22

Unicode codepoints 0x2596–0x259f can help you draw primitive graphics by offering all the combinations of on or off for the four quadrants of a glyph. They're available in this order:

▖  ▗  ▘  ▙  ▚  ▛  ▜  ▝  ▞  ▟

00 00 10 10 10 11 11 01 01 01 10 01 00 11 01 10 01 00 10 11

Actually, that's only ten of the sixteen you need for the full set. The remainder are scattered a bit:

0020 2580 2584 2588 258C 2590
     ▀    ▄    █    ▌    ▐

00 11 00 11 10 01 00 00 11 11 10 01

...but I think I more or less understand why they're scattered. Space is a leftover from ASCII, and the others sit in very pretty patterns in codepoint-space. The lower half block sits in a sequence of seven increasingly-tall blocks (terminating with the full block), and the left half block sits in a sequence of seven decreasingly-wide blocks (beginning with the full block), while the upper half and right half hang out at the respective ends of those sequences.

But how was the order of the first ten mentioned in this question chosen? Some hypotheses I pondered but rejected:

  1. The first three all have just one of the four quadrants "on"... but the fourth one does not follow this pattern, with the final single-quadrant codepoint lying near the end of the sequence. So we don't group them together by quadrant count.

  2. If you think of the bottom left quadrant as the least significant bit, followed by the bottom right, top left, and top right, then the first three could be understood as simply "counting up" (skipping over the blocks that are scattered around the rest of codepoint space), but then the fourth one in that pattern actually appears in position five instead.

  3. If you look at just the top left and top right quadrants, it looks like a nice clean gray code. But the bottom two bits don't appear (to me) to be continuing that pattern. For example, I would expect 00/01 to transition through 10/01 to 10/00, but 10/01 appears later in the sequence.

  4. There are ten of them, which factors to 2×5; perhaps laying them out that way reveals something?

     ▖  ▗  ▘  ▙  ▚
    

    ▛ ▜ ▝ ▞ ▟

    ▖ ▗

    ▘ ▙

    ▚ ▛

    ▜ ▝

    ▞ ▟

    I don't see anything obvious.

So what was the process used to decide their order?

user3840170
  • 23,072
  • 4
  • 91
  • 150
Daniel Wagner
  • 323
  • 2
  • 6
  • 5
    Have I really lived long enough that a Unicode question can be considered "retro"? – Mark Ransom Apr 28 '23 at 02:22
  • @MarkRansom the beginning of Unicode is closer in time to the PDP-1 than it is to us – Nick Matteo Apr 28 '23 at 02:49
  • 1
    @MarkRansom Well, I don't really think this qualifies as retro. But I also think that this site isn't limited only to retro things; in particular the remit according to the help center includes "computing history and persons with a historic relation to computing" -- and I think this qualifies as that. History goes right up to a few minutes ago, so feel free to keep that youthful feeling alive! – Daniel Wagner Apr 28 '23 at 05:19

2 Answers2

30

U+2596–U+259F were added in Unicode 3.2.0 (see the relevant delta code chart), as a result of this proposal. None of the available documents explain the sequence of code points chosen — worse than that, the proposal explains where the characters come from, but the code points don’t follow the order used in the source character sets. However the character names suggest an order:

  • Quadrant LL
  • Quadrant LR
  • Quadrant UL
  • Quadrant UL and LL and LR
  • Quadrant UL and LR
  • Quadrant UL and UR and LL
  • Quadrant UL and UR and LR
  • Quadrant UR
  • Quadrant UR and LL
  • Quadrant UR and LL and LR

It’s possible that each character was described by listing its “set” quadrants from top-left to bottom-right, using acronyms shared throughout the proposal (“LL” etc.); and then the overall list was sorted alphabetically. The result is inconsistent (UL to LR inside individual character names, LL to UR in the list)…

(Intriguingly, the 3.2 technical report doesn’t mention these characters at all!)

Stephen Kitt
  • 121,835
  • 17
  • 505
  • 462
7

To start with it's important to keep in mind that Unicode is not intended and does not provide any means of binary sequence, other than no two characters having the same code point. Thus any addition is always done on random base. Pages are only meant to help looking for what space to be used next to keep it manageable.

The remainder are scattered a bit [...] but I think I more or less understand why they're scattered. Space is a leftover from ASCII, and the others sit in very pretty patterns in codepoint-space.

The main reason for this is that the 'Terminal graphic characters', as those 10 are called, have been added only after the 'pretty' ones. Code Page 2580 had in Unicode version 3.1 exactly 10 code points left, which worked out perfectly to have them inserted with 3.2 as seen in this PDF.

For the historical sequence:

The addition was originally proposed by Frank da Cruz (*1) for the Kermit Project at Columbia University in 1998 as part of an ongoing effort to integrate various terminal control and special and graphics characters. The various quarter squares were needed to incorporate

  • Heath/Zenith 19 Graphics Character Set
  • Wyse Graphics 3 Character Set
  • Televideo 965 Multinational Character Set

Interestingly the UR/LL variant was not to be found in any of those character sets and added for completeness.

Timeline for inclusion:

  • The first draft is of September 30, 1998
  • The original proposal dates from November 10th, 1998 and suggested provisional usage at E0DB..E0E4.
  • It was intended to be part of Unicode 3.0
  • By March 31st, 2000 the proposal was revised to include changes made by Unicode 3.0 (two characters already included), as well as to add dome more characters to accommodate certain IBM mainframe character sets.
  • In May 2000 the WG moved the Terminal Graphic Symbols for accept
  • In September 2000 ISO WG2 accepted them
  • In March 2002 they were officially published with Unicode 3.2 - now occupying 2596..259F

For the 'Why that sequence'

One can only guess for the exact reason, as non of the character sets referenced had the exact same order or even a logical order at all - except when looking at the Wyse WCS3. It features most of the quadrant symbols (13) and orders them in a somewhat circular fashion:

WCS3: |▝|▖|▗|▘|▐|▄|▌|▀|▙|▛|▜|▟|█|

Comparing those with Unicode 3.2 reveals that all WCS3 characters not already assigned otherwise do show up in exactly the same (non contiguous) sequence:

WCS3: |▝|▖|▗|▘|▐|▄|▌|▀|▙|▛|▜|▟|█|
      | | | | |*|*|*|*| | | | | |*| (Assigned otherwise)
UCS:  | |▖|▗|▘| | | | |▙|▛|▜|▟| |

With the sole exception of the upper right quadrant (). It and the missing 'diagonal' ones (|▚|▞|) seem to have been interspersed at will.

Then again, Stephen's remark about them being simply alphabetically sorted by name does make a very good point.


*1 - How hard Frank da Cruz fought - and what absurd arguments he had to fight, is in part documented in the mail communication about their inclusion:

I for my part do NOT!!!! want to see these terminal graphic things in the BMP. They belong in Plane 1.

Perhaps, but as the lawyers say, the door was opened by the characters already included in blocks at U+2400, U+2500, U+2600, and U+2700. In any case, the intention here is to help Unicode become somewhat more "technology-neutral". Terminal emulation is a fact of life, and important to a significant number of serious and productive computer users; why should its special glyphs be excluded from the same status enjoyed by dingbats and astrological signs? Seriously, I think terminal emulation is far more mainstream than many Unicoders seem to think, and I hope it is a worthy goal to welcome this constituency into the fold, thus allowing them to continue their work in their accustomed manner, rather than according to the dictates of haute couture, with the added bonus of uniform access to the world's writing systems.

I think he makes a pretty solid point. Terminal characters are the foundation of any character set to be unified.

Toby Speight
  • 1,611
  • 14
  • 31
Raffzahn
  • 222,541
  • 22
  • 631
  • 918
  • 2
    It was the thin end of a wedge that eventually led to the pile of poo that Unicode is famous for today. Somewhere along the way they lost all sense of what a "character" is. – Michael Kay Apr 26 '23 at 14:02
  • 1
    Great historical context, thank you! – Daniel Wagner Apr 26 '23 at 14:59
  • 2
    @MichaelKay It really should have been encoded as [REGIONAL INDICATOR P][REGIONAL INDICATOR O][REGIONAL INDICATOR O] – user253751 Apr 26 '23 at 15:51
  • 4
    Somewhere along the way they lost all sense of what a "character" is. It takes character to say "no" to ridiculous extensions. – dave Apr 26 '23 at 16:52
  • 8
    @MichaelKay Somewhere along the way? Unicode has been about being able to round-trip legacy character sets since day 1. That's why it has precomposed versions of extended latin characters like é. – ssokolow Apr 26 '23 at 19:54
  • 6
    @Michael Kay

    Unicode was developed so all characters in the world had a set number.

    Emojis (including U+1F4A9) actually predate unicode. This meant either Unicode had to decide that some characters did not deserve a number or they had to assign a number to it.

    – Catprog Apr 27 '23 at 06:19
  • 2
    Or they could have decided that some of the things that had been assigned a number in other coding schemes were not characters and were therefore out of scope. But they didn't, and the pile of poo is the inevitable consequence. – Michael Kay Apr 27 '23 at 13:39
  • 3
    @MichaelKay The quoted e-mail suggests the metaphorical wedge had been thoroughly inserted by the time of this proposal, since it mentions "the same status enjoyed by dingbats and astrological signs". In fact, version 1.0 of Unicode featured blocks for "Miscellaneous Dingbats" (now "Miscellaneous Symbols") and "Zapf Dingbats" (now "Dingbats"). There doesn't seem to have ever been an intention or will to define "character" other than "thing that people commonly represent using some form of text encoding". – IMSoP Apr 27 '23 at 15:29
  • 2
    @MichaelKay I don't like emojis, but at lot of people do use them in written text (admittedly, usually less formal text). Also, there was incompatibility between Google/Apple/Samsung (I think) emojis; I think Unicode was the right place to standardize. – Martin Bonner supports Monica Apr 27 '23 at 18:03
  • 2
    @MichaelKay: I don't mind the presence of obscure novelty symbols nearly so much as the ways that Unicode handles composite characters and text direction, but views things like small caps as a "formatting" issue outside its scope. – supercat Apr 27 '23 at 22:15
  • I think the word in the first paragraph should be "arbitrary" instead of "random". – Paŭlo Ebermann Apr 27 '23 at 23:43
  • 1
    @MichaelKay It t's in a character set than ists a character. Or better, if there is a CODEPOINT that could be used, it needs to be included - otherwise Unicode can not fulfil it's main purpose of creating a single reference any code can be transferred to (and from). It's neither about content nor for what that codepoint stands. That's why not only accented characters have been includes (as ssokolow mentions) but as well different strokes for the same letter. The only point that could be made is if it's printable . Thus Dingbats are characters as they have gylphs. – Raffzahn Apr 28 '23 at 11:00
  • 1
    @supercat I think people on CJK locales have even more reason to be upset at them for Han Unification. Given how visibly the Simplified Chinese, Traditional Chinese, and Japanese versions of a glyph can differ, they're effectively saying that, if you want to use English and Greek in the same document, you'd better use a file format that has an equivalent to <font ...></font> tags, except shoved over to non-Europeans. While not quite as bad as 8-bit code pages, it has a similar effect to when "P" and "Π" (uppercase pi) or "R" and "Ρ" (uppercase rho) have the same codepoint. – ssokolow Apr 29 '23 at 01:43
  • See Radical 169 (门 in your current font) for an example of how much the three can diverge, visually. – ssokolow Apr 29 '23 at 01:47
  • @ssokolow: Having to use something analogous to font tags might not have been unreasonable if they were included within the Unicode standard, but some Unicode strings can be cut and spliced in context-independent manner while others can't, and trying to figure out how to do even the most basic layout of a Unicode text string requires substantial application-level understanding of many complicated rules. – supercat Apr 29 '23 at 15:44