0

I'm trying to extract some urls in an html file using python. Here is what the text look like:

preabc!precde<preefg<

I want to extract "cde" and "efg". The pattern I've used:

pre(.*?)<
pre(.(?!^pre)).*?<

However, none of them works:(. Note that real lengths of "cde" and "efg" are unknow. I'm not familier with regular expression so please explan your answers. Many thanks.

EDIT:

Sorry for my bad explanation and ambiguous example. I want to extract titles like "GIRL FRIENDS" with certain date (2014-7-31 in this case):

<a href="http://rs.xidian.edu.cn/forum.php?mod=viewthread&amp;tid=662128&amp;extra=page%3D1" onclick="atarget(this)" class="s xst">GIRL FRIENDS</a> <span class="tps">&nbsp;...<a href="http://rs.xidian.edu.cn/forum.php?mod=viewthread&tid=662128&amp;extra=page%3D1&amp;page=2">2</a></span> <a href="http://rs.xidian.edu.cn/forum.php?mod=redirect&amp;tid=662128&amp;goto=lastpost#lastpost" class="xi1">New</a> </th> <td class="by"> <cite> <a href="http://rs.xidian.edu.cn/home.php?mod=space&amp;uid=265770" c="1">机器人</a></cite> <em><span><span title="2014-7-31">昨天&nbsp;23:55</span></span></em> </td>

Adam Smith
  • 48,602
  • 11
  • 68
  • 105

3 Answers3

2

You can try:

 (>([A-Z ]+?)<|title="([\d-]+))

Test it here

The more specific and less predictable you get, the more complicated and unreadable the regex is going to be. I don't suggest using regex for this, instead try an HTML parser.

skamazin
  • 746
  • 5
  • 12
  • 1
    +1 the most frequently asked question gets the most frequent answer. http://stackoverflow.com/questions/6751105/why-its-not-possible-to-use-regex-to-parse-html-xml-a-formal-explanation-in-la – msw Jul 31 '14 at 19:12
  • Ok, forget it. You are right. I'll use a HTML parser instead. Thank you all:) –  Jul 31 '14 at 19:12
1

I think the best answer is to not try and parse HTML with a regex. There are lots of html parsing libraries available. Using a regex is only going to cause headaches.

Bradley Kaiser
  • 766
  • 4
  • 15
0

This should do the trick:

pre.*!pre(.*)<pre(.*)<

Explanation:

pre.*! ignore the first part the 'abc' since it starts: start with pre, has a body of anycharacter in anylength(the .* part meets anything) ends with a !

pre(.*)< take the cde. Does the same as the above, but instead it stores whatever is in the body in the matching group 1, the () are matching groups.

pre(.*)< takes the efg. Same as above but stores in the matching group 2

Note that the ! and both < are the ones responsible for dividing the string.

f.rodrigues
  • 3,375
  • 6
  • 23
  • 59
  • nope. [Beware ZA̡͊͠͝LGΌ!](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) – msw Jul 31 '14 at 19:14