4

When i am parsing a HTML file using jsoup, texts in multiple lines (with < br />) in the HTML file is presented as a single line without new lines(\n). How i can parse the multi line HTML document as multiline strings ??

I am using the method: Element.text()

Eg:

HTML contains C code which is properly displayed in multiple lines in HtMl file, but when i am taking the text data, all the data are presented in a single line without new line charactors.

madth3
  • 7,151
  • 12
  • 47
  • 72

2 Answers2

3

Replace <br /> with something else and back, like this:

Document doc = Jsoup.connect("http://www.ejemplo.html").get(); //Here included the <br>'s
String temp = doc.html().replace("<br />", "$$$"); //$$$ instead <br>
doc = Jsoup.parse(temp); //Parse again

String text = doc.body().text().replace("$$$", "\n").toString()); //example
//I get back the new lines (\n)
Garrett Hyde
  • 5,108
  • 7
  • 51
  • 53
acrux
  • 31
  • 2
0

The text() method of Element (and TextNode) calls appendWhitespaceIfBr(...) which will replace every <br /> (or whitespace) with a blank. Unfortunately i see no mechanism for turning this off without working on the code.

But maybe you can try replacing all <br /> Tags with a new subclass of Node.

ollo
  • 24,053
  • 13
  • 97
  • 150