0

I am trying to make a regex to identify relative src paths using PHP. To do this my idea was to use a look ahead (?= then not ^ and a subexpression (http) but this doesn't work. It works for a single charater but the ^ doesn't work with a subexpression. Is there an && operator or something?

 <img.*?src=[\'\"]\(?=^(http))

I need it to take the entire http or else imgs with starting with h, t or p will be prejudiced against. Any suggestions? Is this task too big for regex?

mario
  • 141,508
  • 20
  • 234
  • 284
joel
  • 109
  • 1
  • 1
  • 5

2 Answers2

2

You can use negative lookahead, which is (?!...) instead of (?=...). For your example (I'd put the anchor at the start):

^(?!http)

Which reads: start of string, then something which is not "http".

Edit: since you updated with a fuller example:

<img [^>]*src=['"](?!http)([^'"]+)['"]

                          ^------^ - this capturing group captures the link
                                     which doesn't start with http

Of course, for proper parsing you should use DOM ;)

porges
  • 29,519
  • 3
  • 86
  • 114
0

It's not the most useful answer, but it sounds as though you've reached the limit of applicabiliy for Regex in HTML parsing.

As per this answer here look at using a HTML DOM Parser. I haevn't used PHP DOM Parser's much, but I know in other languages, a DOM parser often makes HTML tasks a 30 second job, rather than an hour or more of weird exceptional case testing.

Community
  • 1
  • 1
Matt Mitchell
  • 39,515
  • 35
  • 113
  • 182
  • 1
    I tend to jump on the "Don't parse *ML with regex" bandwagon as well, but in this case, this question is really independent of HTML parsing. It's actually a question of URL parsing. Even if joel uses a proper parser to extract the URL, he still has the same basic problem. – Frank Farmer May 05 '11 at 02:42
  • @Frank Farmer - Yep you're right, although if you had a parser to grab the value of the SRC attribute, couldn't you just do a `StartsWith("http://")` equivalent in PHP – Matt Mitchell May 05 '11 at 02:44