2

I have a legacy database table with a mixed encoding. Some lines are UTF-8 and some lines are ISO 8859-1.

Are there some heuristics I can apply on the content of a line to guess which encoding best represents the content?

Peter Mortensen
  • 30,030
  • 21
  • 100
  • 124
Jerome WAGNER
  • 21,079
  • 8
  • 58
  • 77

2 Answers2

1

Convert from UTF-8. If that fails then it's not UTF-8, so you should probably convert from Latin-1 instead.

Ignacio Vazquez-Abrams
  • 740,318
  • 145
  • 1,296
  • 1,325
1

Compare

iconv("UTF-8", "ISO-8859-1//IGNORE", $text)

and

iconv("UTF-8", "ISO-8859-1", $text)

If they are not equal - consider it UTF-8.

Peter Mortensen
  • 30,030
  • 21
  • 100
  • 124
Vladislav Rastrusny
  • 28,763
  • 23
  • 91
  • 155
  • What is it supposed to do? How does it work? Why will the result be different in some cases? What it is doing? Please respond by [editing (changing) your answer](https://stackoverflow.com/posts/5259448/edit), not here in comments (***without*** "Edit:", "Update:", or similar - the answer should appear as if it was written today). – Peter Mortensen Mar 16 '22 at 17:03