0

hey can any one tell me to write the command in terminal to extract text from a html file using tags like <li>, <strong>, <b>, <title>, <td>...etc...and $var="strings" and javascript functions using msgstring....

->i am thinking of putting these tags in a text file...

->then i wanna match the tags with the help of command of terminal...

->then i have to put that into a dump file(text)...

because...i wanna change the text with language preference....

i tried with awk script and egrep too....but i got poor result...

Marcel Korpel
  • 21,285
  • 5
  • 59
  • 80
Priya
  • 1,436
  • 1
  • 14
  • 30

5 Answers5

2

This is exactly what pandoc is for.

pandoc filename.html -f html -t plain -o filename.txt

As a bonus, resulting plain text is beautifully formatted.

See Pandoc Manual.

1

Doing this with awk and egrep would probably mean using regular expressions to parse HTML. This is a bad idea. See this famous answer

Rather, use an HTML parser. See other answers in the link above for links to HTML parsers.

As to parsing PHP source code:

As it is structurally similar to HTML, you might be able to use a (tolerant) HTML parser. Otherwise, use a PHP parser. See e.g. this answer.

Community
  • 1
  • 1
sleske
  • 77,633
  • 33
  • 182
  • 219
1

Use regex like this:

perl -pne '/<strong>(.*)?<\/strong>/;' file

Of course, your regex will be more complex, I guess.

sw0x2A
  • 190
  • 7
  • like this it'll take a lot time... i have to enter each tag once and then parse it.... it'll be better if i'll put all tags in a text file and match that file with the file i need to parse.... – Priya Dec 23 '10 at 11:54
0

You may want to clarify your question (sample input and expected output may help). And by "command in terminal" you mean shell command.

This seems nontrivial and you may need to write a shell script. See Advanced Bash-Scripting Guide. But as sleske pointed out, I also recommend some more advanced scripting language (perl/python).

subhacom
  • 816
  • 9
  • 24
0

Here is my answer.

egrep -i -r -f myfile.txt [path] > dumpdata.txt

its working. But i had to parse more by cleaning all functions of javascript and variable value of php containing strings.

Priya
  • 1,436
  • 1
  • 14
  • 30