HTML downloading and text extraction

Question

What would be a good tool, or set of tools, to download a list of URLs and extract only the text content? Spidering is not required, but control over the download file names, and threading would be a bonus.

The platform is linux.

score 5 · Accepted Answer · edited May 29 '13 at 21:38

5

wget | html2ascii

Note: html2ascii can also be called html2a or html2text (and I wasn't able to find a proper man page on the net for it).

See also: lynx.

edited May 29 '13 at 21:38

dsummersl

6,228
47
64

answered Jan 12 '09 at 14:30

dsm

10,053
1
36
70

Does html2text have a strip white-space option, because I couldn't find it – Cammel Jan 12 '09 at 17:55
Not that I am aware of, but you can use awk/sed/perl ... etc to strip whitespace – dsm Jan 13 '09 at 08:44
Keep an eye out for limitations in the tools. `lynx`, for example, won't render things like tables. If `html2ascii` is anything like `pdftotext`, it's able to keep tables in-tact, but it limits output to 80 characters per line. Given a modestly wide table that would fit comfortably in, say, 150 characters per line, it'll insert new lines and add text vertically and completely destroy readability and/or grepability (if that's a word). – Brian Vandenberg Jul 31 '13 at 17:25

score 3 · Answer 2 · edited Mar 03 '22 at 08:44

3

Python Beautiful Soup allows you to build a nice extractor.

edited Mar 03 '22 at 08:44

matan h

742
1
8
17

answered Jan 12 '09 at 15:04

S.Lott

373,146
78
498
766

score 0 · Answer 3 · answered Jan 12 '09 at 14:31

0

I know that w3m can be used to render an html document and put the text content in a textfile w3m www.google.com > file.txt for example.

For the remainder, I'm sure that wget can be used.

answered Jan 12 '09 at 14:31

Jean Azzopardi

2,289
23
36

score 0 · Answer 4 · answered Jan 12 '09 at 14:34

Look for the Simple HTML DOM parser for PHP on Sourceforge. Use it to parse HTML that you have downloaded with CURL. Each DOM element will have a "plaintext" attribute which should give you only the text. I was very successful in a lot of applications using this combination for quite some time.

score 0 · Answer 5 · answered Jan 12 '09 at 14:36

0

PERL (Practical Extracting and Reporting Language) is a scripting language that is excellent for this type of work. http://search.cpan.org/ contains allot of modules that have the required functionality.

answered Jan 12 '09 at 14:36

olle

4,577
23
28

score 0 · Answer 6 · answered Jan 12 '09 at 14:40

0

Use wget to download the required html and then run html2text on the output files.

answered Jan 12 '09 at 14:40

Krishna Gopalakrishnan

1,597
2
9
11

HTML downloading and text extraction

6 Answers6

Linked