How can I auto-detect ISO 8859-1 versus UTF-8 encoding in PHP?

Question

I have a legacy database table with a mixed encoding. Some lines are UTF-8 and some lines are ISO 8859-1.

Are there some heuristics I can apply on the content of a line to guess which encoding best represents the content?

If you are willing to write a script in another language I strongly recommend http://chardet.feedparser.org/, which is pretty reliable. — Wukerplank, Mar 10 '11 at 11:50
THis might help you : http://php.net/manual/en/function.mb-detect-encoding.php — DhruvPathak, Mar 10 '11 at 11:47
You can have a look at *https://stackoverflow.com/questions/910793/php-detect-encoding-and-make-everything-utf-8* which address the same problem. — Jon Skarpeteig, Mar 10 '11 at 11:50

score 1 · Answer 1 · answered Mar 10 '11 at 11:47

1

Convert from UTF-8. If that fails then it's not UTF-8, so you should probably convert from Latin-1 instead.

answered Mar 10 '11 at 11:47

score 1 · Answer 2 · edited Mar 16 '22 at 17:02

1

Compare

iconv("UTF-8", "ISO-8859-1//IGNORE", $text)

and

iconv("UTF-8", "ISO-8859-1", $text)

If they are not equal - consider it UTF-8.

edited Mar 16 '22 at 17:02

Peter Mortensen

answered Mar 10 '11 at 12:13

What is it supposed to do? How does it work? Why will the result be different in some cases? What it is doing? Please respond by [editing (changing) your answer](https://stackoverflow.com/posts/5259448/edit), not here in comments (***without*** "Edit:", "Update:", or similar - the answer should appear as if it was written today). – Peter Mortensen Mar 16 '22 at 17:03

2 Answers2