0

Can you help me with correct regexp from the sed syntaxis point of view? For now every regexp that i can write is marked by terminal as invalid.

SpaceBucks
  • 25
  • 2
  • `sed` can't determine uniqueness. You can use a regexp to extract the URLs from the logs, then pipe to `sort -u` to get the unique values. – Barmar Feb 18 '20 at 13:41
  • See https://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url and https://stackoverflow.com/questions/3809401/what-is-a-good-regular-expression-to-match-a-url?r=SearchResults&s=2|29.6615 – Barmar Feb 18 '20 at 13:42
  • problem not with the getting uniq values, problem with the regexp valid for sed syntax. – SpaceBucks Feb 18 '20 at 13:44
  • Better pipe the result to uniq, if you do not want to loose the original sorting – franzisk Feb 18 '20 at 13:45
  • Please give an example of log lines, so that we can help you with the regexp. Anyway, I would use grep and not sed – franzisk Feb 18 '20 at 13:46
  • @franzisk `uniq` only works if the lines are sorted. – Barmar Feb 18 '20 at 13:47
  • Example of the log record: I need to extract the URL after http response code. 41.201.181.27 - [2019-04-06 18:22:02] "GET /images/stands/photo_mois.jpg HTTP/1.1" 304 - "http://example.com/popup.php?choix=mois" "Mozilla/4.0" "-" – SpaceBucks Feb 18 '20 at 13:48
  • 2
    Show the regexp you tried. Remember that `sed` defaults to basic RE, it doesn't do PCRE. – Barmar Feb 18 '20 at 13:48
  • access logs don't contain full URLs, they leave out the `http://domain` prefix. – Barmar Feb 18 '20 at 13:49
  • 1
    There are many tools for extracting data from webserver access logs, you shouldn't need to use your own regexp. – Barmar Feb 18 '20 at 13:49
  • @Barmar you are right, I was wrong – franzisk Feb 18 '20 at 13:50
  • So how i can extract uniq URL from the log with the grep and won't using written by myself regexp? – SpaceBucks Feb 18 '20 at 13:54
  • I tried an answer below. You don't need even grep – franzisk Feb 18 '20 at 13:56

1 Answers1

2

If your log syntax is uniform, use this command

cut -f4 -d\" < logfile | sort -u 

If you want to skip the query string from uniqness, use this

cut -f4 -d\" < logfile | cut -f1 -d\? | sort -u 

Explanation

Filter the output with the cut command, take the 4th field (-f4) using " as separator (-d\"). The same with the second filter, using ? as separator

franzisk
  • 1,644
  • 13
  • 18
  • 1
    [The `cat`s are useless.](/questions/11710552/useless-use-of-cat) You are making some fairly bold assumptions about what the log format looks like; I'm guessing you assume they will be processing Apache logs. – tripleee Feb 18 '20 at 13:59
  • No, i have read all of the user comments, where he put a log line example. Thanks for the cat suggestion, I'll fix it – franzisk Feb 18 '20 at 14:00
  • Can i transform somehow the command from the above for the finding only POST unique records? – SpaceBucks Feb 19 '20 at 10:15
  • Just add | grep POST just before | sort -u, it would become cut -f4 -d\" < logfile | grep POST | sort -u. Learn grep, is very useful, it is just a filter for text parsing, very easy to use – franzisk Feb 19 '20 at 13:34