Trimming string up to certain characters in Bash

Question

I'm trying to make a bash script that will tell me the latest stable version of the Linux kernel.

The problem is that, while I can remove everything after certain characters, I don't seem to be able to delete everything prior to certain characters.

#!/bin/bash

wget=$(wget --output-document - --quiet www.kernel.org | \grep -A 1 "latest_link")

wget=${wget##.tar.xz\">}

wget=${wget%</a>}

echo "${wget}"

Somehow the output "ignores" the wget=${wget##.tar.xz\">} line.

It's not good practice to make a variable with the same name as a command. — agc, Apr 06 '17 at 17:17
Instead of parsing the kernel.org HTML use the RSS feed to get the version. https://www.kernel.org/feeds/kdist.xml Then its just a matter of parsing it. http://stackoverflow.com/questions/893585/how-to-parse-xml-in-bash — konstructor, Apr 06 '17 at 18:57

Benjamin W. · Accepted Answer · 2017-04-06T18:43:37.620

2

You're trying remove the longest match of the pattern .tar.xz\"> from the beginning of the string, but your string doesn't start with .tar.xz, so there is no match.

You have to use

wget=${wget##*.tar.xz\">}

Then, because you're in a script and not an interactive shell, there shouldn't be any need to escape \grep (presumably to prevent usage of an alias), as aliases are disabled in non-interactive shells.

And, as pointed out, naming a variable the same as an existing command (often found: test) is bound to lead to confusion.

If you want to use command line tools designed to deal with HTML, you could have a look at the W3C HTML-XML-utils (Ubuntu: apt install html-xml-utils). Using them, you could get the info you want as follows:

$ curl -sL www.kernel.org | hxselect 'td#latest_link' | hxextract a -
4.10.8

Or, in detail:

curl -sL www.kernel.org |     # Fetch page
hxselect 'td#latest_link' |   # Select td element with ID "latest_link"
hxextract a -                 # Extract link text ("-" for standard input)

edited Apr 06 '17 at 18:43

answered Apr 06 '17 at 17:18

Benjamin W.

38,596
16
96
104

I used \grep simply because by default Ubuntu aliases grep to grep --color=auto. By calling the default grep I avoid colors in my output – Tommaso Thea Cioni Apr 06 '17 at 17:31
@TommasoTheaCioni Yes, but aliases are by default turned off in scripts, and `--color=auto` only prints terminal colour escape codes when printing to a tty, not in other cases, so it's not even a problem in an interactive session. Not really important, though. – Benjamin W. Apr 06 '17 at 18:46
Oh ok, I'll remove it. – Tommaso Thea Cioni Apr 06 '17 at 18:48
@TommasoTheaCioni I mean, it doesn't hurt, but it also doesn't really do anything ;) – Benjamin W. Apr 06 '17 at 18:48

score 1 · Answer 2 · answered Apr 06 '17 at 17:19

Whenever I need to extract a substring in bash I always see if I can brute force it in a couple of cut(1) commands. In your case, the following appears to work:

wget=$(wget --output-document - --quiet www.kernel.org | \grep -A 1 "latest_link")
echo $wget | cut -d'>' -f3 | cut -d'<' -f1

I'm certain there's a more elegant way, but this has simple syntax that I never forget. Note that it will break if 'wget' gets extra ">" or "<" characters in the future.

score 0 · Answer 3 · answered Apr 06 '17 at 17:27

It is not recommended to use shell tools grep, awk, sed etc to parse HTML files.

However if you want a quick one liner then this awk should do the job:

get --output-document - --quiet www.kernel.org |
awk '/"latest_link"/ { getline; n=split($0, a, /[<>]/); print a[n-2] }'

4.10.8

score 0 · Answer 4 · answered Apr 06 '17 at 17:34

0

sed method:

wget --output-document - --quiet www.kernel.org | \
  sed -n '/latest_link/{n;s/^.*">//;s/<.*//p}'

Output:

4.10.8

answered Apr 06 '17 at 17:34

agc

7,514
2
27
45

Trimming string up to certain characters in Bash

4 Answers4