-1

I want to fetch a certain html node in a large html text, but something in my regex is bad.

I want to fetch all urls that look like this:

<a href="ftp://mysite.com"> some stuff </a>

I am trying to do:

/<a href="ftp:(.+)">/

but sometimes it will work, but sometimes it will grab everything until the next close >.

Is there a way to rewrite this regex so it will stop at the first >?

Unihedron
  • 10,601
  • 13
  • 59
  • 69
Nick Ginanto
  • 29,034
  • 41
  • 131
  • 227

3 Answers3

1

Make your regex ungreedy:

/<a href="ftp:(.+?)">/
//        here __^

or:

/<a href="ftp:([^>"]+)">/

But it's better to use a parser.

Toto
  • 86,179
  • 61
  • 85
  • 118
1

*, + are greey (matches as much as possible). By appending ? after them, you can make non-greedy.

/<a href="ftp:(.+?)">/

or you can specify exclude " using negated character classes ([^...]):

/<a href="ftp:([^"]+)">/

BTW, it's not a good idea to use regular expression to parse HTML.

Community
  • 1
  • 1
falsetru
  • 336,967
  • 57
  • 673
  • 597
1

+ is a greedy operator meaning it matches as much as it possibly can and still allows the rest of the regex to match. For this, I recommend using a negated class meaning any character except: " "one or more" times.

/<a href="ftp:([^"]+)">/

Live Demo

hwnd
  • 67,942
  • 4
  • 86
  • 123