0

I have written a regex query to extract a URL. Sample text:

<p>a tinyurl here <a href="https://vvvconf.instenv.atl-test.space/x/HQAU">https://vvvconf.instenv.atl-test.space/x/HQAU</a></p>

URL I need to extract:

https://vvvconf.instenv.atl-test.space/x/HQAU

My regex attempts:

  1. https:\/\/vvvconf.[a-z].*\/x\/[a-zA-z0-9]*
    

    This extracts:

    {"https://vvvconf.instenv.atl-test.space/x/HQAU">https://vvvconf.instenv.atl-test.space/x/HQAU"}
    
  2. >https:\/\/vvvconf.[a-z].*\/x\/[a-zA-z0-9]*<
    

    This extracts:

    {">https://vvvconf.instenv.atl-test.space/x/HQAU<"}
    

How can I refine the regex so I just extract the URL https://vvvconf.instenv.atl-test.space/x/HQAU?

John Kugelman
  • 330,190
  • 66
  • 504
  • 555
Vik G
  • 357
  • 1
  • 4
  • 19

1 Answers1

1

Depends if you want to extract the URL from the href attribute, or the text within the a tag. Assuming the latter you can use a positive lookbehind (if your regex flavor supports it)

const input = '<p>a tinyurl here <a href="https://vvvconf.instenv.atl-test.space/x/HQAU">https://vvvconf.instenv.atl-test.space/x/HQAU</a></p>';
const regex = /(?<=[>])https:[^<]*/;
console.log(input.match(regex))

Output:

[
  "https://vvvconf.instenv.atl-test.space/x/HQAU"
]

Explanation:

  • (?<=[>]) - positive lookbehind for > (expects but excludes >)
  • https: - expect start of URL (add as much as desired)
  • [^<]* - scan over everything that is not <
Peter Thoeny
  • 3,415
  • 1
  • 8
  • 14