5

I'm looking for a csv file containing the utf-8 encoding table. This file would contain the code point, the escaped html entity and the symbol itself.

I already found websites providing these correspondances, like this one or this other one. However, they do not provide an offline access and pasting all the content would be, while technically possible, a little bit tedious.

Would you know where such a file is available ?

Patrick Hoefler
  • 5,790
  • 4
  • 31
  • 47
merours
  • 413
  • 4
  • 11

1 Answers1

5

With python you can lookup each unicode character by its integer code using unichr.

import sys
with open('unicode.csv','wb') as output:
    for i in xrange(sys.maxunicode):
        output.write(unicode(i))
        output.write(u',')
        output.write(unichr(i).encode('utf-8'))
        output.write(u',')
        output.write(unichr(i).encode('ascii', 'xmlcharrefreplace'))
        output.write(u'\n')
print sys.maxunicode

This gives you a file (unicode.csv) which has the unciode integer, unicode representation, unicode character, and HTML escaped character (for non-ascii).

For example, each line looks like this:

64058,u'\ufa3a',墨,墨

I put the code and the unicode.csv file on github for easier access.

Note: Because the unicode character set includes newline characters, CSV is not really the best format. (See lines 10 to 13.) I also added a python code to generate a JSON file, which is more safe than CSV for storing unicode characters.

philshem
  • 17,647
  • 7
  • 68
  • 170
  • The best solution would be to generate the lookup on the fly (my code takes about 1 second, but the load times for CSV and JSON are much longer). – philshem Apr 09 '14 at 14:12