I am trying to encode Bangla words in python using pandas dataframe. But as encoding type, utf-8 is not working but utf-8-sig is. I know utf-8-sig is with BOM(Byte order mark). But why is this called utf-8-sig and how it works?
Asked
Active
Viewed 3.1k times
33
-
2I believe this is a duplicate: https://stackoverflow.com/questions/2223882/whats-the-difference-between-utf-8-and-utf-8-without-bom – juanpa.arrivillaga Jul 22 '19 at 20:08
-
2Welcome to Stack Overflow. [This site](https://forum.plasticscm.com/topic/1811-utf-8-with-signature-vs-utf-8-without-signature/) makes it clear that the BOM (Byte Order Mark) is a "signature," hence the "sig" in the name. [This question](https://stackoverflow.com/questions/23154355/python-utf-8-sig-bom-in-the-middle-of-the-file-when-appending-to-the-end) with its answers explain its reasons and how it works. [Here is Python documentation](https://docs.python.org/3.7/library/codecs.html#encodings-and-unicode) on the topic. Is there something in particular you do not understand? – Rory Daulton Jul 22 '19 at 20:10
-
Possible duplicate of [python utf-8-sig BOM in the middle of the file when appending to the end](https://stackoverflow.com/questions/23154355/python-utf-8-sig-bom-in-the-middle-of-the-file-when-appending-to-the-end) – Rory Daulton Jul 22 '19 at 20:10
-
1I don't find anything related to the 'sig' part. That's why I asked this question. – JuBaer AD Jul 25 '19 at 06:32
-
5This is a perfectly valid question. – Nobilis Feb 21 '20 at 10:43
1 Answers
37
"sig" in "utf-8-sig" is the abbreviation of "signature" (i.e. signature utf-8 file). Using utf-8-sig to read a file will treat BOM as file info. instead of a string.
Hans Ginzel
- 6,784
- 3
- 21
- 22
eee
- 381
- 3
- 5