1

I am trying to read in HTML from Chinese websites and get their <title> value. All the websites with UTF-8 encoding works fine, but not for GB2312 websites (for example, m.39.net, which shows 39������_�й����ȵĽ����Ż���վ instead of 39健康网_中国领先的健康门户网站).

Here is the code I use to accomplish that:

URL url = new URL(urlstr);
URLConnection connection = url.openConnection();
inputStream = connection.getInputStream();
String content = IOUtils.toString(inputStream);
WoLfPwNeR
  • 948
  • 2
  • 10
  • 24

2 Answers2

1

String content = IOUtils.toString(inputStream, "GB2312"); may do the help.

If you want to detect the charset of a webpage, there are 3 ways as far as I know:

  1. use connection.getContentEncoding() to get the charset described in the HTTP header;
  2. parse <meta http-equiv="Content-Type" content="text/html;charset=ISO-8859-1"> or <meta charset="UTF-8"> in the HTML code (have to download the HTML content first and then read several lines);
  3. use 3rd party libraries. E.g. those mentioned in this question.
Community
  • 1
  • 1
xiGUAwanOU
  • 307
  • 1
  • 16
0

Have you seen http://commons.apache.org/proper/commons-io/apidocs/org/apache/commons/io/IOUtils.html

toString(byte[] input, String encoding)
sideshowbarker
  • 72,859
  • 23
  • 167
  • 174
xiaoming
  • 919
  • 6
  • 10
  • Seems like the earlier answer at http://stackoverflow.com/a/34735065/441757 had already suggested using `IOUtils.toString()`... – sideshowbarker Jan 12 '16 at 03:27