46

The box has no Ruby/Python/Perl etc.

Only bash, sed, and awk.

A way is to replace chars by map, but it becomes tedious.

Perhaps some built-in functionality i'm not aware of?

miken32
  • 39,644
  • 15
  • 91
  • 133
James Evans
  • 685
  • 1
  • 7
  • 11

6 Answers6

67

Escaping HTML really just involves replacing three characters: <, >, and &. For extra points, you can also replace " and '. So, it's not a long sed script:

sed 's/&/\&amp;/g; s/</\&lt;/g; s/>/\&gt;/g; s/"/\&quot;/g; s/'"'"'/\&#39;/g'
ruakh
  • 166,710
  • 24
  • 259
  • 294
  • 2
    +1 for elegance and efficiency. You should post your answer here: http://stackoverflow.com/questions/5929492/bash-script-to-convert-from-html-entities-to-characters where they recommend installing `recode`, `perl`, `php`, `xmlsarlet` and `w3m` (a web browser for crying out loud). The last answer recommends using Python3 which although installed by default (in Ubuntu at least) is overkill too. – WinEunuuchs2Unix Mar 26 '17 at 23:43
  • 1
    @WinEunuuchs2Unix: Thanks for your kind words! That question is asking about the opposite direction (`<` to ` – ruakh Mar 26 '17 at 23:58
  • @ruakh You're welcome :) Can't your sed search and replace simply be reversed to accomplish the same result as those answers? – WinEunuuchs2Unix Mar 27 '17 at 00:43
  • 1
    @WinEunuuchs2Unix: There are many ways to HTML-escape a given piece of text; for example, `<`, `<`, and `<` are all valid ways to escape ` – ruakh Mar 27 '17 at 01:21
  • Yes. My HTML-unescaping is limited to stack exchange site Ask Ubuntu and so far I've only noticed `&Amp;`, `$lt;` and `"`. The goal is to compare all the scripts on my drive I've published in Ask Ubuntu to see if they have been changed locally or revised by someone else in Ask Ubuntu. For fun I'm also extracting upvotes from the HTML file and putting it in the local file. This is the work in progress from a few nights ago: http://askubuntu.com/questions/894888/bash-template-to-use-zenity-or-yad-to-insert-edit-delete-records-in-a-file/896783#896783 – WinEunuuchs2Unix Mar 27 '17 at 01:31
  • Really useful. Same as a function: `escape_html() { sed $1 's/&/\&/g; s/\</g; s/>/\>/g; s/"/\"/g; s/'"'"'/\'/g'; }` – geotheory Nov 17 '18 at 10:06
13

You can use recode utility:

    echo 'He said: "Not sure that - 2<1"' | recode ascii..html

Output:

    He said: &quot;Not sure that - 2&lt;1&quot;
Ivan
  • 2,934
  • 4
  • 20
  • 22
  • 1
    Probably not available if there's no Python/Ruby/Perl. – tbodt Nov 19 '16 at 22:48
  • Tested on 30 or so textfiles containing ASCII and it even handles the null character `\0`. Use to sandbox textfile contents for `srcdoc` attribute of a sandboxed `iframe` in HTML and allow background styling via parent frame to cascade. – vhs May 06 '20 at 15:20
10

Pure bash, no external programs:

function htmlEscape () {
    local s
    s=${1//&/&amp;}
    s=${s//</&lt;}
    s=${s//>/&gt;}
    s=${s//'"'/&quot;}
    printf -- %s "$s"
}

Just simple string substitution.

Vladimir Panteleev
  • 24,384
  • 6
  • 67
  • 112
miken32
  • 39,644
  • 15
  • 91
  • 133
2

or use xmlstar Escape/Unescape special XML characters:

$ echo '<abc&def>'| xml esc
&lt;abc&amp;def&gt;
schemacs
  • 2,573
  • 6
  • 33
  • 50
0

I'm using jq:

$ echo "2 < 4 is 'TRUE'" | jq -Rr @html
2 &lt; 4 is &apos;TRUE&apos;
yegor256
  • 97,508
  • 114
  • 426
  • 573
-5

The previous sed replacement defaces valid output like

&lt;

into

&amp;lt;

Adding a negative loook-ahead so "&" is only changed into "&amp;" if that "&" isn't already followed by "amp;" fixes that:

sed 's/&(?!amp;)/\&amp;/g; s/</\&lt;/g; s/>/\&gt;/g; s/"/\&quot;/g; s/'"'"'/\&#39;/g'
Jean-François Fabre
  • 131,796
  • 23
  • 122
  • 195
nachtgeist
  • 39
  • 1
  • 8
    Big mistake. When I HTML-encode a string `&`, it is because I want it to be rendered by some web browser as `&`. That is why it **must** be turned into `&amp;`. That way, HTML-encoding and HTML-decoding are in balance. You don't suppress HTML-encoding just because the input _looks like_ it has already been HTML-encoded. HTML-encoding is **not** idempotent. Failure to grasp that, eventually leads to XSS vulnerabilities. – Ruud Helderman Nov 10 '15 at 20:47
  • 1
    @Ruud is right; the right way to accomplish this is to escape ampersands first, like in ruakh's answer. – Brian McCutchon Jan 14 '16 at 21:07
  • 3
    I totally agree with what @Ruud said except that he should have emphasized **failure to grasp that leads to XSS vulnerabilities** – kmkaplan Feb 01 '17 at 08:32