1

If, for example, there is a video website which has a search option.

http://example.com/search=query

and it return all the search results in that form:

<a href="LinkToVideo"</a><img src="ImageSource" alt="AltDescription"><b>VideoName</b>

I want to use this data, so i send a request to the website, and then use re to return a list with LinkToVideo, ImageSource, AltDescription and VideoName:

response = urllib2.urlopen("http://example.com/search=" + query)
resp = response.read()
search_list = re.compile('<a href="(.+?)"</a><img src="(.+?)" alt="(.+?)"><b>(.+?)</b>').findall(resp)
return search_list

and it return a list like this:

[('example.com/video1.mp4', 'example.com/image1.jpg', 'blah blah ', 'Cats'),('example.com/video2.mp4', 'example.com/image2.jpg', 'blah', 'Dogs'),('example.com/video3.mp4', 'example.com/image3.jpg', 'blah blah blah', 'Zebra')]

The problem is that i dont need the alt description, but it changes.

I want that list will look like this:

[('example.com/video1.mp4', 'example.com/image1.jpg', 'Cats'), ('example.com/video2.mp4', 'example.com/image2.jpg', 'Dogs'), ('example.com/video3.mp4', 'example.com/image3.jpg','Zebra')]

I know i can ignore this, but it the real site (this is just an example) the list is much bigger and there is more data i need to ignore.

I searched google and didnt find a solution. Im sorry if the title isnt describe the problem exactly.

Thanks

1 Answers1

2

Use a non-capturing group ((?:…)) like this:

'<a href="(.+?)"</a><img src="(.+?)" alt="(?:.+?)"><b>(.+?)</b>'

Or just get rid of the group entirely:

'<a href="(.+?)"</a><img src="(.+?)" alt=".+?"><b>(.+?)</b>'

I should also point out that using regular expressions to parse arbitrary HTML is a pretty bad idea and has been known to cause madness. I'd strongly recommend using a proper html parser instead.

Community
  • 1
  • 1
p.s.w.g
  • 141,205
  • 29
  • 278
  • 318
  • **Thanks!** Exactly what i lookd for! – user3611091 Jul 29 '14 at 21:06
  • Madness, haha, that's why people look at me funny? :) +1, also let's let OP know that for `".+?"` this option `"[^"]"` is often preferred (any chars that are not a quote). @user3611091, since you say it's exactly what you're looking for, please consider clicking the checkmark on the left of the answer to accept it. Accepting an answer earns rep both for you and for the answerer. Thanks! – zx81 Jul 31 '14 at 04:44