How to match HTML with Regex

Question

I'm trying to extract some urls in an html file using python. Here is what the text look like:

preabc!precde<preefg<

I want to extract "cde" and "efg". The pattern I've used:

pre(.*?)<
pre(.(?!^pre)).*?<

However, none of them works:(. Note that real lengths of "cde" and "efg" are unknow. I'm not familier with regular expression so please explan your answers. Many thanks.

EDIT:

Sorry for my bad explanation and ambiguous example. I want to extract titles like "GIRL FRIENDS" with certain date (2014-7-31 in this case):

<a href="http://rs.xidian.edu.cn/forum.php?mod=viewthread&tid=662128&extra=page%3D1" onclick="atarget(this)" class="s xst">GIRL FRIENDS</a>  ...<a href="http://rs.xidian.edu.cn/forum.php?mod=viewthread&tid=662128&extra=page%3D1&page=2">2</a> <a href="http://rs.xidian.edu.cn/forum.php?mod=redirect&tid=662128&goto=lastpost#lastpost" class="xi1">New</a> </th> <td class="by"> <cite> <a href="http://rs.xidian.edu.cn/home.php?mod=space&uid=265770" c="1">机器人</a></cite> 昨天 23:55 </td>

Why the downvote? Could you explain it rather than just downvote? — , Jul 31 '14 at 18:34
what makes "cde" and "efg" different from "pre" and "abc"? can you provide more examples of input + desired output? — redShadow, Jul 31 '14 at 18:45
btw, I hope you're not trying to parse HTML using regular expressions.. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — redShadow, Jul 31 '14 at 18:47
@skamazin Just "GIRL FRIENDS". There're many titles in the same "pattern" in the html file. — , Jul 31 '14 at 18:58
@Wisatbff why don't you use an HTML parser, such as ``lxml.html``? You'll have a much more robust solution without having to get crazy with hyper-complex regexes.. — redShadow, Jul 31 '14 at 19:00
What makes "GIRL FRIENDS" from all the other titles? Telling us "There're many titles in the same "pattern"" is basically saying regex won't work for you at all — skamazin, Jul 31 '14 at 19:00
You need to use a parser for this. Look at BeautifulSoup or lxml — Adam Smith, Jul 31 '14 at 19:01
@Wisatbff Yea if you looking to do this for a very long file or for many files, I would go with a parser and not a regex. But if it's only this one instance, I can find a regex that'll work for you. — skamazin, Jul 31 '14 at 19:04
Alrighty, try my regex in my answer. Tell me if something goes wrong — skamazin, Jul 31 '14 at 19:08

skamazin · Answer 1 · 2014-07-31T19:07:41.647

2

You can try:

 (>([A-Z ]+?)<|title="([\d-]+))

Test it here

The more specific and less predictable you get, the more complicated and unreadable the regex is going to be. I don't suggest using regex for this, instead try an HTML parser.

edited Jul 31 '14 at 19:07

answered Jul 31 '14 at 18:37

skamazin

746
5
12

1

+1 the most frequently asked question gets the most frequent answer. http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la – msw Jul 31 '14 at 19:12
Ok, forget it. You are right. I'll use a HTML parser instead. Thank you all:) – Jul 31 '14 at 19:12

score 1 · Answer 2 · answered Jul 31 '14 at 19:06

1

I think the best answer is to not try and parse HTML with a regex. There are lots of html parsing libraries available. Using a regex is only going to cause headaches.

answered Jul 31 '14 at 19:06

Bradley Kaiser

766
4
15

f.rodrigues · Answer 3 · 2014-07-31T19:09:50.380

0

This should do the trick:

pre.*!pre(.*)<pre(.*)<

Explanation:

pre.*! ignore the first part the 'abc' since it starts: start with pre, has a body of anycharacter in anylength(the .* part meets anything) ends with a !

pre(.*)< take the cde. Does the same as the above, but instead it stores whatever is in the body in the matching group 1, the () are matching groups.

pre(.*)< takes the efg. Same as above but stores in the matching group 2

Note that the ! and both < are the ones responsible for dividing the string.

edited Jul 31 '14 at 19:09

answered Jul 31 '14 at 18:39

f.rodrigues

3,375
6
23
59

nope. [Beware ZA̡͊͠͝LGΌ!](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – msw Jul 31 '14 at 19:14

How to match HTML with Regex

3 Answers3