81

I am trying to create a 'normalized' copy of a string, to help reduce duplicate names in a database. The names contain many international characters (ie. accented letters), and I want to create a copy with the accents removed.

I did come across the method below, but cannot get it to work. I can't seem to find what the Unicode Hacks plugin is.

  # Utility method that retursn an ASCIIfied, downcased, and sanitized string.
  # It relies on the Unicode Hacks plugin by means of String#chars. We assume
  # $KCODE is 'u' in environment.rb. By now we support a wide range of latin
  # accented letters, based on the Unicode Character Palette bundled inMacs.
  def self.normalize(str)
     n = str.chars.downcase.strip.to_s
     n.gsub!(/[à áâãäåÄÄ?]/u,    'a')
     n.gsub!(/æ/u,                  'ae')
     n.gsub!(/[ÄÄ?]/u,                'd')
     n.gsub!(/[çÄ?ÄÄ?Ä?]/u,          'c')
     n.gsub!(/[èéêëÄ?Ä?Ä?Ä?Ä?]/u, 'e')
     n.gsub!(/Æ?/u,                   'f')
     n.gsub!(/[ÄÄ?ġģ]/u,            'g')
     n.gsub!(/[ĥħ]/,                'h')
     n.gsub!(/[ììíîïīĩĭ]/u,     'i')
     n.gsub!(/[įıijĵ]/u,           'j')
     n.gsub!(/[ķĸ]/u,               'k')
     n.gsub!(/[Å?ľĺļÅ?]/u,         'l')
     n.gsub!(/[ñÅ?Å?Å?Å?Å?]/u,       'n')
     n.gsub!(/[òóôõöøÅÅ?ÅÅ]/u,  'o')
     n.gsub!(/Å?/u,                  'oe')
     n.gsub!(/Ä?/u,                   'q')
     n.gsub!(/[Å?Å?Å?]/u,             'r')
     n.gsub!(/[Å?Å¡Å?ÅÈ?]/u,          's')
     n.gsub!(/[ťţŧÈ?]/u,           't')
     n.gsub!(/[ùúûüūůűŭũų]/u,'u')
     n.gsub!(/ŵ/u,                   'w')
     n.gsub!(/[ýÿŷ]/u,             'y')
     n.gsub!(/[žżź]/u,             'z')
     n.gsub!(/\s+/,                   ' ')
     n.gsub!(/[^\sa-z0-9_-]/,          '')
     n
  end

Do I need to 'require' a particular library/gem? Or maybe someone could recommend another way to go about this.

I am not using Rails, nor do I plan on doing so.

paradoja
  • 2,996
  • 2
  • 24
  • 33
Gus Shortz
  • 1,681
  • 1
  • 15
  • 24
  • 1
    Which ruby version are you using? – Huluk Mar 28 '13 at 16:21
  • Take a look at http://stackoverflow.com/questions/1268289/how-to-get-rid-of-non-ascii-characters-in-ruby – MurifoX Mar 28 '13 at 16:28
  • 3
    you could also look at: https://github.com/norman/unidecoder – amalrik maia Mar 28 '13 at 16:34
  • I'm using Ruby 1.9.3, I'll take a look at both of those possible solutions, all I need is the above method's replacement of the listed characters, so if those solutions can do that great and thanks :) – Gus Shortz Mar 28 '13 at 20:30
  • I finally found some references to the Unicode Hack plugin (http://www.railslodge.com/plugins/316-unicode-hacks), that provides the `chars` method needed for the `normalize` method I mentioned. But it seems to no longer be supported – Gus Shortz Mar 29 '13 at 01:49

4 Answers4

241

I generally use I18n to handle this:

1.9.3p392 :001 > require "i18n"
 => true
1.9.3p392 :002 > I18n.transliterate("Hé les mecs!")
 => "He les mecs!"
user2398029
  • 6,256
  • 7
  • 44
  • 77
  • 3
    [The documentation](http://api.rubyonrails.org/classes/ActiveSupport/Inflector.html#method-i-transliterate). Being able to set transliterations on a per-locale basis is also very powerful. – Paul Fioravanti Mar 29 '13 at 10:54
  • 12
    This may not do what you expect on characters that don't have basic Latin mappings--for example Chinese characters. It just turns them to question marks. `(main)> I18n.transliterate("雙屬性集合之空間分群演算法-應用於地理資料")` `=> "?????????????-???????"` – David Mar 25 '14 at 18:20
  • 19
    Just a note for plain ruby , if `I18n::InvalidLocale: :en is not a valid locale` is thrown, use `I18n.available_locales = [:en]` before `I18n.transliterate` – Alter Lagos Jul 15 '15 at 04:09
  • 1
    Note: This does not work for everything. Example "Bùi Viện" gets translated to "Bui Vi?n" – CHawk Apr 17 '16 at 13:31
  • 3
    Didn't work for me: `(main)> I18n.transliterate "ŠKODA" => "ŠKODA"` – Michael Jul 12 '16 at 14:30
  • Those cases should be reported as I18n bugs. – user2398029 Jul 22 '16 at 00:46
  • It depends too much on configuration, I think. Does not work for me too, tried specifying different locales. – kolen May 04 '18 at 17:06
30

The parameterize method could be a nice and simple solution to remove special characters in order to use the string as human readable identifier:

> "Françoise Isaïe".parameterize
=> "francoise-isaie"
AlexGuti
  • 2,853
  • 1
  • 25
  • 28
20

So far the following is the only way I've been able to accomplish what I need:

str.tr(
"ÀÁÂÃÄÅàáâãäåĀāĂ㥹ÇçĆćĈĉĊċČčÐðĎďĐđÈÉÊËèéêëĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħÌÍÎÏìíîïĨĩĪīĬĭĮįİıĴĵĶķĸĹĺĻļĽľĿŀŁłÑñŃńŅņŇňʼnŊŋÒÓÔÕÖØòóôõöøŌōŎŏŐőŔŕŖŗŘřŚśŜŝŞşŠšſŢţŤťŦŧÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųŴŵÝýÿŶŷŸŹźŻżŽž",
"AAAAAAaaaaaaAaAaAaCcCcCcCcCcDdDdDdEEEEeeeeEeEeEeEeEeGgGgGgGgHhHhIIIIiiiiIiIiIiIiIiJjKkkLlLlLlLlLlNnNnNnNnnNnOOOOOOooooooOoOoOoRrRrRrSsSsSsSssTtTtTtUUUUuuuuUuUuUuUuUuUuWwYyyYyYZzZzZz")

But using this feels very 'hackish', and I would love to find a better way.

Gus Shortz
  • 1,681
  • 1
  • 15
  • 24
  • 1
    This works only for ISO-8859-1. What makes you think it works for UTF-8? – pts Nov 29 '14 at 19:58
  • 4
    This one works for UTF-8 and ruby 2.2.3, and does exactly what I needed. Lacks some Romanian characters though. I've aded them: `string.tr( "ÀÁÂÃÄÅàáâãäåĀāĂ㥹ÇçĆćĈĉĊċČčÐðĎďĐđÈÉÊËèéêëĒēĔĕĖėĘęĚěĜĝĞğĠġĢģĤĥĦħÌÍÎÏìíîïĨĩĪīĬĭĮįİıĴĵĶķĸĹĺĻļĽľĿŀŁłÑñŃńŅņŇňʼnŊŋÒÓÔÕÖØòóôõöøŌōŎŏŐőŔŕŖŗŘřŚśŜŝŞşŠšȘșſŢţŤťŦŧȚțÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųŴŵÝýÿŶŷŸŹźŻżŽž", "AAAAAAaaaaaaAaAaAaCcCcCcCcCcDdDdDdEEEEeeeeEeEeEeEeEeGgGgGgGgHhHhIIIIiiiiIiIiIiIiIiJjKkkLlLlLlLlLlNnNnNnNnnNnOOOOOOooooooOoOoOoRrRrRrSsSsSsSsSssTtTtTtTtUUUUuuuuUuUuUuUuUuUuWwYyyYyYZzZzZz")` – Alexander Jun 24 '17 at 09:21
  • Thanks it worked. Lack some Vietnamese chars. I 've added them: `tr("ÀÁÂÃÄÅàáâãäåĀāĂ㥹ạảÇçĆćĈĉĊċČčÐðĎďĐđÈÉÊËèéêểệễëĒēĔĕĖėĘęĚěẹĜĝĞğĠġĢģĤĥĦħÌÍÎÏìíîïĨĩĪīĬĭĮįİıịỉĴĵĶķĸĹĺĻļĽľĿŀŁłÑñŃńŅņŇňʼnŊŋÒÓÔÕÖØòóôộỗổõöøŌōŎŏŐőọỏơởợỡŔŕŖŗŘřŚśŜŝŞşŠšſŢţŤťŦŧÙÚÛÜùúûüŨũŪūŬŭŮůŰűŲųụưủửữựŴŵÝýÿŶŷŸŹźŻżŽžứừửựữốồộỗổờóợỏỡếềễểệẩẫấầậỳỹýỷỵặẵẳằắ", "AAAAAAaaaaaaAaAaAaaaCcCcCcCcCcDdDdDdEEEEeeeeeeEeEeEeEeEeeGgGgGgGgHhHhIIIIiiiiIiIiIiIiIiiiJjKkkLlLlLlLlLlNnNnNnNnnNnOOOOOOoooooooooOoOoOoooooooRrRrRrSsSsSsSssTtTtTtUUUUuuuuUuUuUuUuUuUuuuuuuuWwYyyYyYZzZzZzuuuuuooooooooooeeeeeaaaaayyyyyaaaaa")` – duyetpt Jul 16 '21 at 08:14
3

If you are using rails:

"L'Oréal".parameterize(separator: ' ')
Dorian
  • 4,760
  • 1
  • 28
  • 43
Navid Khan
  • 802
  • 9
  • 22