4

What would be a good tool, or set of tools, to download a list of URLs and extract only the text content? Spidering is not required, but control over the download file names, and threading would be a bonus.

The platform is linux.

Charles Stewart
  • 11,453
  • 4
  • 45
  • 84
Cammel
  • 2,599
  • 2
  • 16
  • 5

6 Answers6

5

wget | html2ascii

Note: html2ascii can also be called html2a or html2text (and I wasn't able to find a proper man page on the net for it).

See also: lynx.

dsummersl
  • 6,228
  • 47
  • 64
dsm
  • 10,053
  • 1
  • 36
  • 70
  • Does html2text have a strip white-space option, because I couldn't find it – Cammel Jan 12 '09 at 17:55
  • Not that I am aware of, but you can use awk/sed/perl ... etc to strip whitespace – dsm Jan 13 '09 at 08:44
  • Keep an eye out for limitations in the tools. `lynx`, for example, won't render things like tables. If `html2ascii` is anything like `pdftotext`, it's able to keep tables in-tact, but it limits output to 80 characters per line. Given a modestly wide table that would fit comfortably in, say, 150 characters per line, it'll insert new lines and add text vertically and completely destroy readability and/or grepability (if that's a word). – Brian Vandenberg Jul 31 '13 at 17:25
3

Python Beautiful Soup allows you to build a nice extractor.

matan h
  • 742
  • 1
  • 8
  • 17
S.Lott
  • 373,146
  • 78
  • 498
  • 766
0

I know that w3m can be used to render an html document and put the text content in a textfile w3m www.google.com > file.txt for example.

For the remainder, I'm sure that wget can be used.

Jean Azzopardi
  • 2,289
  • 23
  • 36
0

Look for the Simple HTML DOM parser for PHP on Sourceforge. Use it to parse HTML that you have downloaded with CURL. Each DOM element will have a "plaintext" attribute which should give you only the text. I was very successful in a lot of applications using this combination for quite some time.

Robert Elwell
  • 6,545
  • 1
  • 27
  • 32
0

PERL (Practical Extracting and Reporting Language) is a scripting language that is excellent for this type of work. http://search.cpan.org/ contains allot of modules that have the required functionality.

olle
  • 4,577
  • 23
  • 28
0

Use wget to download the required html and then run html2text on the output files.

Krishna Gopalakrishnan
  • 1,597
  • 2
  • 9
  • 11