1

I've come across a problem with special characters on a django query (Django 1.9.2).

I've created a model that stores a word in it, and I'm feeding that model with words from a Spanish dictionary, using code as follows:

MyModel.objects.get_or_create(word=myword)

And now I've realized that words containing special characters haven't been added, so, for example, there is only one row of MyModel in the database for año and ano! And when I query the database I retrieve the same object for these two queries:

MyModel.objects.get(word='año')
MyModel.objects.get(word='ano')

...and no, those words are not the same ;D

I would want to create one object for each, of course.

jgsogo
  • 676
  • 9
  • 18
  • uhmmmm... Just wrote this here, and I have realized that field `word` in `MyModel` is marked as `db_index=True`... Maybe a field of this type can store those special characters but when building the hash to create the key/index it discards specials chars? Feature or bug? – jgsogo Sep 10 '16 at 18:56
  • did you make sure that your database has utf-8 encoding. By default I think MySQL has latin-8. Also, did you try running an insert query on the database directly, using dbshell. – Swakeert Jain Sep 10 '16 at 19:46
  • I haven't, this issue come from an 'almost' production environment and I want to gather as much info as I can before starting to break things. I'm almost sure database is utf-8 (I'll check it), it stores all characters properly, the problem is when there is "collision" between two words, it only stores the first one that arrives. – jgsogo Sep 10 '16 at 19:51
  • @SwakeertJain - MySQL has `utf8` and `latin1`; either is sufficient for Spanish (and the rest of Western Europe). However, the encoding is different. – Rick James Sep 11 '16 at 03:21

2 Answers2

3

I have had similar problems before, and was able to solve the issue by setting an environment level variable.

In the context of production level Django, I added this to the top of my wsgi.py file:

import sys
reload(sys)
sys.setdefaultencoding('utf8')

I believe that I found this trick from this answer.

Community
  • 1
  • 1
Adam Hopkins
  • 6,562
  • 5
  • 33
  • 49
2

Short answer: You probably want COLLATION utf8_spanish2_ci.

Long answer:

If you are using CHARACTER SET utf8 (or utf8mb4) on the column/table in question, and if you need ano != año, you need COLLATION utf8_bin or utf8_spanish_ci or utf8_spanish2_ci. All other utf8 collations treat n = ñ. spanish2 differs from spanish in that ch is treated as a separate "letter" between c and d. Similarly for ll. More details.

Note that other 'accents' are ignored in comparisons for most utf8 collations except for utf8_bin. For example, C = Ç (except for _bin and _turkish).

Rick James
  • 122,779
  • 10
  • 116
  • 195
  • I need to create a new row for every different word, so I need `n`!=`ñ`, `á`!=`a`, `u`!=`ü`,... I suppose I need to change collation to `utf8_bin` then. (Just for curiosity: `ch` and `ll` are no longer letters since 1994 [RAE](http://www.rae.es/consultas/exclusion-de-ch-y-ll-del-abecedario)) – jgsogo Sep 11 '16 at 07:37
  • Yeah, sounds like utf8_bin -- Watch out; this means that 'A' != 'a'. Thanks for the info on ch and ll. – Rick James Sep 11 '16 at 15:13
  • You might avoid case problems by _storing_ `LCASE(...)`, then _comparing_ `col = LCASE(input_string)`. This would allow an index to still be useful with `utf8_bin`. – Rick James Sep 11 '16 at 15:22
  • I moved that column to `utf8_bin` and everything worked as expected, now I have created a copy of that column into the same table with `utf8_general_ci` and the same string. Depending on the type of search I need I use one or the other to retrieve the rows. Thanks a lot! – jgsogo Sep 12 '16 at 19:39
  • Two columns is a good trick. Usually they are split one for search, one for display. Your use case is 'similar'. – Rick James Sep 12 '16 at 19:44