0

How to get the number of requests per unique host from a log file using shell script.

I managed to get the requested output with the below script. Can someone help with the proper solution?

filename=test.log

# Below for loop loops over each line and fetches all the valid hostnames.
for i in $(cat $filename); do
  declare -a hosts
  #regx needs improvement
  if [[ "$i" =~ ^[A-Za-z0-9-]+\.[A-Za-z0-9-]+\.[A-Za-z0-9-]+$ ]]
   then
        count=`cat test| grep $i | wc -l`
        hosts+=("$i")
   fi
done

# Get uniq hostnmaes from list and save the sorted unique results back into array 
hosts=($(echo "${hosts[@]}" | tr ' ' '\n' | sort -u | tr '\n' ' '))

# Get the number occurances for each host from entire file
for i in ${hosts[@]}; do
        count=$(cat $filename| grep $i| wc -l)
        echo "$i" "$count"
done

Logfile contains the below lines.

unicomp6.unicomp.net - - [01/Jul/1995:00:00:06 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985
burger.letters.com - - [01/Jul/1995:00:00:11 -0400] "GET /shuttle/countdown/liftoff.html HTTP/1.0" 304 0
burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 304 0
burger.letters.com - - [01/Jul/1995:00:00:12 -0400] "GET /shuttle/countdown/video/livevideo.gif HTTP/1.0" 200 0
d104.aa.net - - [01/Jul/1995:00:00:13 -0400] "GET /shuttle/countdown/ HTTP/1.0" 200 3985
unicomp6.unicomp.net - - [01/Jul/1995:00:00:14 -0400] "GET /shuttle/countdown/count.gif HTTP/1.0" 200 40310
unicomp6.unicomp.net - - [01/Jul/1995:00:00:14 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 200 786
unicomp6.unicomp.net - - [01/Jul/1995:00:00:14 -0400] "GET /images/KSC-logosmall.gif HTTP/1.0" 200 1204
d104.aa.net - - [01/Jul/1995:00:00:15 -0400] "GET /shuttle/countdown/count.gif HTTP/1.0" 200 40310
d104.aa.net - - [01/Jul/1995:00:00:15 -0400] "GET /images/NASA-logosmall.gif HTTP/1.0" 200 786

Uniq array explination

I encountered this in my recent hacker rank test and I didn't find the proper example anywhere. Hope this will be useful for someone

Cyrus
  • 77,979
  • 13
  • 71
  • 125
SNR
  • 361
  • 2
  • 12
  • First of all [Don't read lines with for](https://mywiki.wooledge.org/DontReadLinesWithFor) . Then paste your script at https://shellcheck.net for validation/suggestions. – Jetchisel Jul 11 '21 at 05:56
  • Please add your desired output (no description, no images, no links) for that sample input to your question (no comment). – Cyrus Jul 11 '21 at 06:59

2 Answers2

1

Something like this should work with your given input file.

Using an associative array which is a bash4+ feature, inside a while + read loop, plus Parameter Expansion to extract the host name from the log.

#!/usr/bin/env bash

declare -A uniq
filename=test.log

while IFS= read -r lines; do
  ((uniq["${lines%% -*}"]++))
done < "$filename"

To check the value of the associative array uniq

declare -p uniq

Output

declare -A uniq=([unicomp6.unicomp.net]="4" [d104.aa.net]="3" [burger.letters.com]="3" )

To see/print the Unique (key) hosts inside the while loop.

Change from:

((uniq["${lines%% -*}"]++))

To:

((!uniq["${lines%% -*}"]++)) && echo "${lines%% -*}"  

To print out just the unique hosts (key) after the loop.

printf '%s\n' "${!uniq[@]}"

Output

unicomp6.unicomp.net
d104.aa.net
burger.letters.com

To print out values which is how many times the host name was seen in the log.

printf '%s\n' "${uniq[@]}"

Output

4
3
3

Looping through the keys and values of the associative array uniq to add more useful info.

for i in "${!uniq[@]}"; do
  printf '"%s" has a requests of [%d]\n' "$i" "${uniq[$i]}"
done

Output

"unicomp6.unicomp.net" has a requests of [4]
"d104.aa.net" has a requests of [3]
"burger.letters.com" has a requests of [3]
Jetchisel
  • 5,722
  • 2
  • 14
  • 15
  • Can you please explain ```uniq``` command inside while loop. It would really improve the answer and helpful. – SNR Jul 12 '21 at 05:53
  • `uniq` is an associative array, which was done by `declare -A`, the `(( ))` is an arithmetic construct in bash, and `++` increments. `"${lines%% -*}"` is a form of Parameter Expansion, Does that explains it? – Jetchisel Jul 12 '21 at 05:58
  • 1
    Just to add: ```${lines%% -*}``` The two ```%%``` mean from ending of the parameter, delete the longest (or greedy) match—up to and including the pattern "-". – SNR Jul 12 '21 at 06:14
  • Also you have linked [Uniq array explanation](https://stackoverflow.com/questions/13648410/how-can-i-get-unique-values-from-an-array-in-bash) Some post there explains why there can't be a duplicate inside the associative array key? Those guys can explain more/better than I can do. – Jetchisel Jul 12 '21 at 06:19
0

If you are allowed to use standard tools (as your code implies), then you can simply do:

<test.log cut -f1 -d\  | sort | uniq -c
  • cut - extract first column, using space as delimiter
  • sort - order so uniq works
  • uniq - count the duplicates

If the logfile is huge (terabytes), then this method should avoid memory issues with the bash associative array approach: cut and uniq need almost no memory, and sort should switch to external merge sort.

jhnc
  • 6,479
  • 6
  • 20