Why was the percent sign chosen as escape character for URLs?

Question

URIs use percent encoding to represent characters which would otherwise be reserved (like the forward slash - %2F), not always displayable or recognizable (Unicode characters, e.g. non-Latin letters) or otherwise inconvenient (like the space character - %20).

RFC 1630 says

The choice of escape character for introducing representations of non-allowed characters also tends to be a matter of taste. An ANSI standard exists in the C language, using the back-slash character "\". The use of this character on unix command lines, however, can be a problem as it is interpreted by many shell programs, and would have itself to be escaped. It is also a character which is not available on certain keyboards. The equals sign is commonly used in the encoding of names having attribute=value pairs. The percent sign was eventually chosen as a suitable escape character.

Is there any reason that the percent sign was chosen, rather than, say, $, ^ or *, all of which (AFAIK) don't have a special function in URIs?

Guess for the reason: $ and * are also interpreted in many shell programs, and ^ may not have been available on some keyboards. — dirkt, Dec 24 '23 at 16:05
Even 1992 documentation presents the encoding without giving a rationale — I suspect the best approach to find the actual answer is to ask Tim Berners-Lee directly (or find an interview where the question is asked and answered). — Stephen Kitt, Dec 24 '23 at 16:40

Raffzahn · Accepted Answer · 2023-12-24T16:52:05.493

14

Looks to me as if the cited paragraph already answered that perfectly.

It must be a character available on as many keyboards as possible
Should not be one that gets already escaped on their own development system
Should not be in otherwise common use
What remains is a matter of taste.

Rule #1 eliminates all characters that are marked in IS0-646 (IA5) for national use or as potential modifiers:

(Taken from Wikipedia)

Every character marked in above table may thus not be used - which already includes the mentioned $ and ^.

Rule #2 eliminates \, at least for unixoide systems - same goes for <, > and |. If this would have been developed on mainframes it might have been other characters like : and ..

Rule #3 eliminates other characters like = and most punctuation ., :, , which are commonly as separators or in file names.

And then there is Rule #4: Taste - Noone can argue with taste and taste doesn't need nor have any reasoning. Taste will be at best self referencing (*1).

Bottom line:

Of the characters remaining after rules 1, 2 and 3 Percent was simply the one they liked most.

*1 - I like it, because it's beautiful - Why is it beautiful? 'Cause I like it ...

edited Dec 24 '23 at 16:52

answered Dec 24 '23 at 16:35

Raffzahn

222,541
22
631
918

6

The remaining non-alphanumeric characters after the listed 1/2/3 are (I think): !%&()*+-? Of those, () are not good because they are a pair - why use one by itself. ? as the start of query parameters is extremely logical. & as additional parameter separator (this parameter and the other parameter) is extremely logical, though arguably + would work there as well. So that really just leaves !%*+- - not a whole lot of choices, and I would argue that + and - are not great choices because of their usual mathematical/logical meaning, so that just leaves !%* – manassehkatz-Moving 2 Codidact Dec 24 '23 at 17:16
3

If rule 2 eliminates <, > and | as shell syntax, it should also eliminate & and ;, I think. – Toby Speight Dec 24 '23 at 17:52
@TobySpeight Yes, but no. While 1630 (URI) precedes 1738 (URL), both by Berners-Lee andof 1994, they diverge in usage.URL, as used for HTTP, does only reserve/,?and;(!). Usage of&` to join CGI parameters was added independent and prior (1993) to both by NCSA to allow interactive forms. It wasn't until 4 years later (1997) that it was formally codified as 3875. – Raffzahn Dec 24 '23 at 18:15
1

Yes, I know & and ; were later used in CGI - I meant that they require quoting in shell language, like < and >. Actually, & is also a nuisance for HTML. – Toby Speight Dec 24 '23 at 18:49
3

Rule 4 is king here. % looks very aesthetically reasonable (at least to me?). I can’t imagine anybody choosing (i.e.) , except on the direst of circumstances. – Euro Micelli Dec 24 '23 at 19:09
@TobySpeight it's rather prior. NCSA CGI used & already before RFC 1630 was written. And that usage was made public. HTML inherited & from SGML. – Raffzahn Dec 24 '23 at 19:16
I don't know when they got added, but note that '#', ?, and & are all in use elsewhere in the URI scrheme. – trlkly Dec 25 '23 at 15:39
@trlkly URI does not reserve &, only = | ; | / | # | ? | : | space as noted with the BNF on page 23 (In addition it dis encourages the use of !and *). – Raffzahn Dec 25 '23 at 16:18
@Raffzahn Interesting. I don't know what is reserved, but I do know that & is used to separate query terms (e.g. http://www.example.com/?search=cats&type=pics) And those query terms are a common place where escape codes need to be used, since the often contain user input. – trlkly Dec 25 '23 at 17:36
1

@trlkly Yes, but those are not part of URI specification, but URL as defined in later RFC1738 which adds @ and & as reserved. – Raffzahn Dec 25 '23 at 18:13
Does SGML have anything useful to say about this? After all, the idea of HTML was that links should be embedded in a document, which at a fairly elementary level implies some form of quoting and escaping. – Mark Morgan Lloyd Dec 26 '23 at 07:29
@MarkMorganLloyd Not really, as SGML is about content formatting, not identification (URI/URL). – Raffzahn Dec 26 '23 at 13:23

Why was the percent sign chosen as escape character for URLs?

1 Answers1