regular expression for same structure but different words in number(counting)

Question

I have a text file of links after scrapping, I need to make a regular expression for these links so i can extract them from a file, but different links have same structure but different in length, like

https://www.cnbc.com/2016/10/12/billionaire-richard-branson-learned-a-key-business-lesson-playing-tennis.html

and this:

https://www.cnbc.com/2016/10/12/hedge-fund-bonus-makeover.html

I can successfully make RE for the base domain, but after that title give me a tough time, mine is

[h][t][t][p][s]:\/\/[w][w][w].[c][n][b][c].[c][o][m]\/[2][0][1][5-8]

for https://www.cnbc.com/2016/10/11/ but dont know how to make for further with diiferent words for different links ahead,

I have tried something of my own [here](https://regex101.com/r/8iIuYL/2). Nevertheless, you can also refer [this](https://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url) — nice_dev, May 27 '18 at 07:34
You should really read up on the basics of regular expression syntax. Most of those square brackets are totally unnecessary, but then you've left unescaped `.`s that match any character. — jonrsharpe, May 27 '18 at 08:04

score 1 · Answer 1 · answered May 27 '18 at 07:38

1

You are overcomplicating things,

https?://\S+?cnbc\.com\S+

will probably do, see https://regex101.com/r/ci3O1I/1/ for a demo.

answered May 27 '18 at 07:38

Jan

40,932
8
45
77

https://www.cnbc.com/ali-montag/?page=31 this one is also selected in links file which is not required – jackson May 27 '18 at 07:54

score 1 · Accepted Answer · answered May 27 '18 at 08:00

1

You can simplify your regex to something like this:

preg_match("/http.*:\/\/www\.cnbc\.com\/201[5-8].*/", $string, $match);

This matches the address with http or https.
Then any link that is between 2015 and 2018.

See here how it works:
https://www.phpliveregex.com/p/o7p

answered May 27 '18 at 08:00

Andreas

23,304
5
28
61

– jackson May 27 '18 at 08:03
What is that? I don't understand what that is. – Andreas May 27 '18 at 08:04
the links after matching them from a file it print like this in python – jackson May 27 '18 at 08:06
<_sre.sre_match https:="" match="https://www.cnbc.com/2018/03/02/how-richard-brans> <_sre.SRE_Match object; span=(1596841, 1596917), match=" object="" span="(1596748,"> – jackson May 27 '18 at 08:07
but match length is not enough to hold all the url, if you know about it it will be convinient for me – jackson May 27 '18 at 08:08
It doesn't help if you paste 50 of those strings here. I do not understand what it is! Your question stated two simple links now you add "stuff" and talk about lenght. I don't get it – Andreas May 27 '18 at 08:11

regular expression for same structure but different words in number(counting)

2 Answers2