Extract all unique URL from log using sed

Question

Can you help me with correct regexp from the sed syntaxis point of view? For now every regexp that i can write is marked by terminal as invalid.

`sed` can't determine uniqueness. You can use a regexp to extract the URLs from the logs, then pipe to `sort -u` to get the unique values. — Barmar, Feb 18 '20 at 13:41
See https://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url and https://stackoverflow.com/questions/3809401/what-is-a-good-regular-expression-to-match-a-url?r=SearchResults&s=2|29.6615 — Barmar, Feb 18 '20 at 13:42
problem not with the getting uniq values, problem with the regexp valid for sed syntax. — SpaceBucks, Feb 18 '20 at 13:44
Better pipe the result to uniq, if you do not want to loose the original sorting — franzisk, Feb 18 '20 at 13:45
Please give an example of log lines, so that we can help you with the regexp. Anyway, I would use grep and not sed — franzisk, Feb 18 '20 at 13:46
Example of the log record: I need to extract the URL after http response code. 41.201.181.27 - [2019-04-06 18:22:02] "GET /images/stands/photo_mois.jpg HTTP/1.1" 304 - "http://example.com/popup.php?choix=mois" "Mozilla/4.0" "-" — SpaceBucks, Feb 18 '20 at 13:48
Show the regexp you tried. Remember that `sed` defaults to basic RE, it doesn't do PCRE. — Barmar, Feb 18 '20 at 13:48
access logs don't contain full URLs, they leave out the `http://domain` prefix. — Barmar, Feb 18 '20 at 13:49
There are many tools for extracting data from webserver access logs, you shouldn't need to use your own regexp. — Barmar, Feb 18 '20 at 13:49
So how i can extract uniq URL from the log with the grep and won't using written by myself regexp? — SpaceBucks, Feb 18 '20 at 13:54

franzisk · Accepted Answer · 2020-02-18T14:03:21.613

2

If your log syntax is uniform, use this command

cut -f4 -d\" < logfile | sort -u

If you want to skip the query string from uniqness, use this

cut -f4 -d\" < logfile | cut -f1 -d\? | sort -u

Explanation

Filter the output with the cut command, take the 4th field (-f4) using " as separator (-d\"). The same with the second filter, using ? as separator

edited Feb 18 '20 at 14:03

answered Feb 18 '20 at 13:53

franzisk

1,644
13
18

1

[The `cat`s are useless.](/questions/11710552/useless-use-of-cat) You are making some fairly bold assumptions about what the log format looks like; I'm guessing you assume they will be processing Apache logs. – tripleee Feb 18 '20 at 13:59
No, i have read all of the user comments, where he put a log line example. Thanks for the cat suggestion, I'll fix it – franzisk Feb 18 '20 at 14:00
Can i transform somehow the command from the above for the finding only POST unique records? – SpaceBucks Feb 19 '20 at 10:15
Just add | grep POST just before | sort -u, it would become cut -f4 -d\" < logfile | grep POST | sort -u. Learn grep, is very useful, it is just a filter for text parsing, very easy to use – franzisk Feb 19 '20 at 13:34

Extract all unique URL from log using sed

1 Answers1