0

I'm trying to make a script in bash that locates URLs from a textfile (example.com, example.eu, etc) and copies them over to another textfile using egrep. My current output gives me the URLs that i want, but unfortunately a lot more that i don't want, such as 123.123 or example.3xx.

My script currently looks like this:

egrep -o '\w*\.[^\d\s]\w{2,3}\b' lab4trace.txt > lab4url.txt

I tried using some regex checker sites, but the regex on the site gives me more of a correct answer than my own results.

Any help is appriceated

jsims281
  • 2,199
  • 2
  • 29
  • 56
Erik
  • 9
  • 1
  • Does this answer your question? https://stackoverflow.com/questions/13611973/how-to-grep-for-a-url-in-a-file – franzisk Mar 06 '20 at 14:36

2 Answers2

0

If you know the domains suffix, you can have a regex that looks for *.(com|eu|org)

Bhargav Rao
  • 45,811
  • 27
  • 120
  • 136
tavanez
  • 13
  • 6
  • That would help indeed, but i'm unsure how many suffixes i have that are domain suffixes and how many are .png or similar. I considered the option of downloading another textfile with all supported domain suffixes and cross-referencing the two files, but that sounds like a hassle. – Erik Mar 06 '20 at 15:33
0

Based on https://stackoverflow.com/a/2183140/939457 (and https://www.rfc-editor.org/rfc/rfc2181#section-11) a domain name is a series of labels that can contain any char except . separated by .. Since you want only those valid TLDs you can use https://data.iana.org/TLD/tlds-alpha-by-domain.txt to generate a list of patterns:

grep -i -E -f <(curl -s https://data.iana.org/TLD/tlds-alpha-by-domain.txt | sed 's/^/([^.]{1,63}\\\.){1,4}/') <<'EOF'
aaa.ali.bab.yandex
fsfdsa.d.s
alpha flkafj
foo.bar.zone
alpha.beta.gama.delta.zappos
example.com
EOF

Result:

aaa.ali.bab.yandex
foo.bar.zone
alpha.beta.gama.delta.zappos
example.com

Note: this is a memory killer the above example took 2GB, the list of TLDs is huge, you might consider searching for a list of commonly used TLDs and use that instead.

Community
  • 1
  • 1
Sorin
  • 4,937
  • 2
  • 17
  • 43