0

I have a SMTP email which is in Japanese and some part is in english. The Subject of the email is encoded in UTF-8, base64.

Subject: =?UTF-8?B?5Y2K5bCO5L2T6KO96YCg6KOF572u44OX44Os44OT44Ol44O844OO44O8?= =?UTF-8?B?44OIIDog5b6M5bel56iL44Oh44O844Kr44O844GM5by344GE?=

How do I detect if this in Japanese/Chinese and decode it to Japanese/Chinese.

Can I acheive this in Perl/Java/Python?

user1737619
  • 197
  • 1
  • 1
  • 14

3 Answers3

5

You have two steps here. First decode the header:

If you have an email, use a high-level email parser such as Courriel. The subject accessor will return the decoded subject.

If you just have the string, use Encode::MIME::Header:

use Encode qw(decode);
decode 'MIME-Header', 'Subject: =?UTF-8?B?5Y2K5bCO5L2T6KO96YCg6KOF572u44OX44Os44OT44Ol44O844OO44O8?= =?UTF-8?B?44OIIDog5b6M5bel56iL44Oh44O844Kr44O844GM5by344GE?='
__END__
Subject: 半導体製造装置プレビューノート : 後工程メーカーが強い

The second step is to find out the language. As a human, I can already tell that this is Japanese. The kana characters are the clue, they only occur in Japanese writing. If that's all you need, then if the string matches \p{Kana}, it's likely Japanese.

For a more general solution, you use a language detection library such as Lingua::Identify::CLD, Lingua::Ident, Lingua::Lid, Lingua::YALI, WebService::Google::Language.

daxim
  • 38,703
  • 4
  • 63
  • 128
1

You may have to check these

chardet character set detection developed by Mozilla used in FireFox. Source code

jchardet is a java port of the source from mozilla's automatic charset detection algorithm.

Juned Ahsan
  • 66,028
  • 11
  • 91
  • 129
  • 1
    +1 The only answer that actually answer the question. – Thihara Jun 26 '13 at 07:05
  • The Perl ports of that library are mentioned in http://stackoverflow.com/questions/1970660/how-can-i-guess-the-encoding-of-a-string-in-perl – daxim Jun 26 '13 at 13:54
  • -1, but it's the wrong tool, the question asked for language detection, not encoding detection. – daxim Jun 26 '13 at 14:01
-1

With Java, you need a library to decode your Base 64 string to binary, for example apache codec.

Then it's straigthforward:

  byte[] b = Base64.decodeBase64("5Y2K5bCO5L2T6KO96YCg6KOF572u44OX44Os44OT44Ol44O844OO44O8");
  String s = new String(b, "UTF-8");
  System.out.println(s);

It prints: 半導体製造装置プレビューノー (I don't know what it means, but it sure looks like japanese).

obourgain
  • 8,288
  • 4
  • 42
  • 55
  • How can I identify which language it is in the code from the output text? – user1737619 Jun 26 '13 at 06:57
  • If you want to recognize if the text is Japanese or English, I suggest to look at https://code.google.com/p/language-detection/. – obourgain Jun 26 '13 at 07:11
  • This code actually prints out ??????????????. Should I change the language settings for it to print japanese in eclipse? – user1737619 Jun 26 '13 at 07:13
  • I don't see how it can print ?????????. Be sure to parse only the part of the String between =?UTF-8?B? and ?= (there are two parts in the subject). – obourgain Jun 26 '13 at 07:25
  • I changed the text file encoding to UTF-8 in the project Properties of eclipse and it worked. – user1737619 Jun 26 '13 at 08:28