0

I am extracting URLs from a set of raw data and I intend to do this using python regular expressions.

I tried

(http.+)

But it just got the entire part starting from http.

Input

href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone

https://vine.co/v/i6iIrBwnTFI

Expected Output

http://twitter.com/download/iphone

https://vine.co/v/i6iIrBwnTFI

Command
  • 429
  • 2
  • 6
  • 18

2 Answers2

0

Try this: http[^\"^\s]*

This assumes all your links will start with http and it will break the expression if it encounters a whitespace or a "

Here is how you could use it:

import re
regexp = '''http[^\"^\s]*'''
urls = '''href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone https://vine.co/v/i6iIrBwnTFI'''
output = re.findall(regexp, urls)
output

['http://twitter.com/download/iphone', 'https://vine.co/v/i6iIrBwnTFI']

HakunaMaData
  • 1,230
  • 10
  • 25
0

First, u should find what-characters-are-valid-in-a-url

Then, the regular expression could be:

(http://|https://)([a-zA-Z0-9\-\._~:/\?\#\[\]@!$&'\(\)\*\+,;=]+)

In my python interpreter, it looks like:

>>> import re
>>> regexp = '''(http://|https://)([a-zA-Z0-9\-\._~:/\?\#\[\]@!$&'\(\)\*\+,;=]+)'''
>>> url = '''href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone https://vine.co/v/i6iIrBwnTFI'''
>>> r = re.findall(regexp, url)
>>> r
[('http://', 'twitter.com/download/iphone'), ('https://', 'vine.co/v/i6iIrBwnTFI')]
>>> [x[0]+x[1] for x in r]
['http://twitter.com/download/iphone', 'https://vine.co/v/i6iIrBwnTFI']
Bob Fred
  • 201
  • 1
  • 6