-3

I have a text file of links after scrapping, I need to make a regular expression for these links so i can extract them from a file, but different links have same structure but different in length, like

https://www.cnbc.com/2016/10/12/billionaire-richard-branson-learned-a-key-business-lesson-playing-tennis.html

and this:

https://www.cnbc.com/2016/10/12/hedge-fund-bonus-makeover.html

I can successfully make RE for the base domain, but after that title give me a tough time, mine is

[h][t][t][p][s]:\/\/[w][w][w].[c][n][b][c].[c][o][m]\/[2][0][1][5-8] 

for https://www.cnbc.com/2016/10/11/ but dont know how to make for further with diiferent words for different links ahead,

jackson
  • 101
  • 1
  • 8
  • I have tried something of my own [here](https://regex101.com/r/8iIuYL/2). Nevertheless, you can also refer [this](https://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url) – nice_dev May 27 '18 at 07:34
  • You should really read up on the basics of regular expression syntax. Most of those square brackets are totally unnecessary, but then you've left unescaped `.`s that match any character. – jonrsharpe May 27 '18 at 08:04

2 Answers2

1

You are overcomplicating things,

https?://\S+?cnbc\.com\S+

will probably do, see https://regex101.com/r/ci3O1I/1/ for a demo.

Jan
  • 40,932
  • 8
  • 45
  • 77
  • https://www.cnbc.com/ali-montag/?page=31 this one is also selected in links file which is not required – jackson May 27 '18 at 07:54
1

You can simplify your regex to something like this:

preg_match("/http.*:\/\/www\.cnbc\.com\/201[5-8].*/", $string, $match);

This matches the address with http or https.
Then any link that is between 2015 and 2018.

See here how it works:
https://www.phpliveregex.com/p/o7p

Andreas
  • 23,304
  • 5
  • 28
  • 61