Parse html using C

Question

I need to grab some content from an HTML (XHTML valid) page. I grab the page using curl and store it in memory.

I played with the idea of using regex with the PCRE library, but simply I couldn't find any examples using it with C. Then I moved on to look at HTML parsers and again there is not a good selection. All I could find was a skimpy documented module for libxml called HTMLparser.

Are there any alternatives? If not, then examples for what I found already?

Obligatory link to warning against parsing HTML with regular expressions: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — moopet, Aug 08 '16 at 10:07
Refer following link I wrote down whole solution using libxml2 C library for windows platform. http://stackoverflow.com/questions/5465965/how-can-libxml2-be-used-to-parse-data-from-xml/38826052#38826052 — Pankaj Vavadiya, Aug 08 '16 at 10:09

score 13 · Accepted Answer · answered Oct 06 '09 at 20:34

13

You want to use HTML tidy to do this. The Lib curl page has some source code to get you going. Documents traversing the dom tree. You don't need an xml parser. Doesn't fail on badly formated html.

http://curl.haxx.se/libcurl/c/htmltidy.html

answered Oct 06 '09 at 20:34

Byron Whitlock

51,185
28
118
166

This is what I ended up implementing. I didn't feel the need to pull out a hungry xml parser to just grab a single line of text. Thanks – Oct 12 '09 at 14:59

score 8 · Answer 2 · answered Oct 06 '09 at 20:31

8

I would use libhtmltidy + whatever xml parser like expat or libxml. Depends on what you're looking for.

answered Oct 06 '09 at 20:31

Michael Krelin - hacker

131,515
23
189
171

2

Just for Reader information... HTML parsers are software for automated Hypertext Markup Language (HTML) parsing. They have two main purposes: HTML traversal: offer an interface for programmers to easily access and modify of the "HTML string code". Canonical example: DOM parsers. HTML clean: to fix invalid HTML and to improve the layout and indent style of the resulting markup. Canonical example: HTML Tidy. – Pankaj Vavadiya Aug 08 '16 at 09:57

score 2 · Answer 3 · answered Oct 06 '09 at 20:30

2

If you want to parse XML using C, then by far the best way to proceed is to use the LibXML library. The main page is at http://xmlsoft.org/. In addition to their downloads, they have explicit code examples that specfically show how to handle parsing. I know for a fact you can get versions precompiled for Mac and Windows, most Linux and BSD distributions have it already included, and you can build from source if you wish.

answered Oct 06 '09 at 20:30

Tony Miller

8,919
2
26
46

1

Good choice, but it will choke on broken html, so I'd run it through libtidy first. – Michael Krelin - hacker Oct 06 '09 at 20:34

score 2 · Answer 4 · answered Aug 31 '16 at 14:12

2

Google recently created a pure C99 library for parsing HTML, HTML5 specifically. It's easy to use in any C program and actively developed.

https://github.com/google/gumbo-parser

answered Aug 31 '16 at 14:12

Anton Kochkov

907
1
7
21

Most changes are from 2 years ago, HTML5 standard has already been defined, isn't the code a little outdated? – Lucas Steffen Sep 02 '16 at 16:50

score 0 · Answer 5 · answered Jul 28 '20 at 16:24

0

Fast C/C++ HTML 5 Parser. Using threads. https://github.com/lexborisov/myhtml

answered Jul 28 '20 at 16:24

EgoPingvina

529
1
9
28

1

The myhtml project seems to be end of life and is proposing to use lexbor (https://github.com/lexbor/lexbor) instead. – Brecht Sanders Aug 01 '20 at 08:26
Yes, you are right. The last update was replaced into it. – EgoPingvina Aug 01 '20 at 13:49

Parse html using C

5 Answers5

Linked