[C] GCC对UTF8 BOM的支持

最近玩些特俗字符，结果对yqutest.cpp源码文件编译时先碰到error:converting to execution character set: Illegal bytesequence错误。GCC的源码字符集与执行字符集默认是UTF-8编码，为了避免源码文件乱码，最好也是采用UTF-8编码来存储源码文件。将源码编码转成UTF-8，问题得以解决。但是否需要UTF-8 BOM(byte-order mark)呢？我一时兴起添加了BOM，十六进制为EF BB BF，即对应八进制的357 273 277，编译结果如下：

mryqu> g++ yqutest.cpp -o yqutst123
yqutest.cpp:1: error: stray '\357' in program
yqutest.cpp:1: error: stray '\273' in program
yqutest.cpp:1: error: stray '\277' in program
yqutest.cpp:1: error: stray '#' in program
yqutest.cpp:1: error: expected constructor, destructor, or type conversion before '<' token
mryqu> g++ -v
Using built-in specs.
Target: amd64-undermydesk-freebsd
Configured with: FreeBSD/amd64 system compiler
Thread model: posix
gcc version 4.2.1 20070719  [FreeBSD]

折腾一下-finput-charset和-fextended-identifiers选项，不管用。

g++ -finput-charset=UTF-8 -fextended-identifiers yqutest.cpp -o yqutst123

后来看到Bug 33415 - Can’t compile .cpp file with UTF-8 BOM，才知道是我用的G++版本太低，起码GCC4.4.0才支持UTF-8 BOM。老老实实去掉BOM就可以编译过了。

参考

[C/C++] 各种C/C++编译器对UTF-8源码文件的兼容性测试（VC、GCC、BCB）
Using UTF-8 as the internal representation for strings in C and C++ with Visual Studio