Identifying Japanese language from UTF-8 base64 encoding

Question

I have a SMTP email which is in Japanese and some part is in english. The Subject of the email is encoded in UTF-8, base64.

Subject: =?UTF-8?B?5Y2K5bCO5L2T6KO96YCg6KOF572u44OX44Os44OT44Ol44O844OO44O8?= =?UTF-8?B?44OIIDog5b6M5bel56iL44Oh44O844Kr44O844GM5by344GE?=

How do I detect if this in Japanese/Chinese and decode it to Japanese/Chinese.

Can I acheive this in Perl/Java/Python?

score 5 · Answer 1 · answered Jun 26 '13 at 14:11

You have two steps here. First decode the header:

If you have an email, use a high-level email parser such as Courriel. The subject accessor will return the decoded subject.

If you just have the string, use Encode::MIME::Header:

use Encode qw(decode);
decode 'MIME-Header', 'Subject: =?UTF-8?B?5Y2K5bCO5L2T6KO96YCg6KOF572u44OX44Os44OT44Ol44O844OO44O8?= =?UTF-8?B?44OIIDog5b6M5bel56iL44Oh44O844Kr44O844GM5by344GE?='
__END__
Subject: 半導体製造装置プレビューノート : 後工程メーカーが強い

The second step is to find out the language. As a human, I can already tell that this is Japanese. The kana characters are the clue, they only occur in Japanese writing. If that's all you need, then if the string matches \p{Kana}, it's likely Japanese.

For a more general solution, you use a language detection library such as Lingua::Identify::CLD, Lingua::Ident, Lingua::Lid, Lingua::YALI, WebService::Google::Language.

score 1 · Answer 2 · answered Jun 26 '13 at 06:50

1

You may have to check these

chardet character set detection developed by Mozilla used in FireFox. Source code

jchardet is a java port of the source from mozilla's automatic charset detection algorithm.

answered Jun 26 '13 at 06:50

Juned Ahsan

66,028
11
91
129

1

+1 The only answer that actually answer the question. – Thihara Jun 26 '13 at 07:05
The Perl ports of that library are mentioned in http://stackoverflow.com/questions/1970660/how-can-i-guess-the-encoding-of-a-string-in-perl – daxim Jun 26 '13 at 13:54
-1, but it's the wrong tool, the question asked for language detection, not encoding detection. – daxim Jun 26 '13 at 14:01

score -1 · Answer 3 · answered Jun 26 '13 at 06:41

-1

With Java, you need a library to decode your Base 64 string to binary, for example apache codec.

Then it's straigthforward:

  byte[] b = Base64.decodeBase64("5Y2K5bCO5L2T6KO96YCg6KOF572u44OX44Os44OT44Ol44O844OO44O8");
  String s = new String(b, "UTF-8");
  System.out.println(s);

It prints: 半導体製造装置プレビューノー (I don't know what it means, but it sure looks like japanese).

answered Jun 26 '13 at 06:41

obourgain

8,288
4
42
55

How can I identify which language it is in the code from the output text? – user1737619 Jun 26 '13 at 06:57
If you want to recognize if the text is Japanese or English, I suggest to look at https://code.google.com/p/language-detection/. – obourgain Jun 26 '13 at 07:11
This code actually prints out ??????????????. Should I change the language settings for it to print japanese in eclipse? – user1737619 Jun 26 '13 at 07:13
I don't see how it can print ?????????. Be sure to parse only the part of the String between =?UTF-8?B? and ?= (there are two parts in the subject). – obourgain Jun 26 '13 at 07:25
I changed the text file encoding to UTF-8 in the project Properties of eclipse and it worked. – user1737619 Jun 26 '13 at 08:28

Identifying Japanese language from UTF-8 base64 encoding

3 Answers3