21

In the source code of the 1972 Pascal compiler (a very large OCR-ed PDF), there are declarations of variables and record fields of type ALFA, which are "packed arrays" of 10 characters. Characters were 6 bit wide on the CDC 6600, and 10 of them would fit in a 60-bit word, implementing a form of small string optimization: they could be passed by value in registers, hashed and compared efficiently, etc., which conferred a substantial speedup on computers with word-oriented architectures, and a similar technique could be beneficial even today.

The 10 character array type named ALFA survives even today, although without the benefit of the optimization.

In the high-level programming language context, a data type is defined by its representation and the set of allowed operations. This makes the ALFA type distinct, and not a mere shortcut for a small array of char, as its set of allowed operations is wider than the set of allowed operations for any other array type: for example, Pascal ALFA values can be compared for equality and ordering (page 17 of the PDF).

The novelty here is adding to a programming language a data type which combines characteristics of an array (individual characters of a variable of the ALFA type could be accessed by index if an implementation fully supported packed arrays), and of a scalar: variables of type ALFA could be passed as parameters to subroutines and returned from subroutines, whereas passing arrays by value or returning them was not allowed, and, at least in some implementations, they could be used in case statement expressions.

Contemporary implementations of small-string optimization hide the actual "small string" object inside a variant class. As Pascal had variant records, it was fully equipped.

Was Niklaus Wirth the first to invent the "small string" type, in the sense of making it a full-fledged predefined type in a programming language rather than an ad-hoc programming trick, or a pre-existing programming language had a similar data type?

Leo B.
  • 19,082
  • 5
  • 49
  • 141
  • 3
    Not sure what the question is targeting here, as holding characters in a machine word comes natural already way before there is any PASCAL or other compiler. 6 characters for a 36 bit 7090, 8 in a 48 bit CDC 1604, 10 in a 60 bit CDC 6600 or 4 (8 bit) in a 32 bit IBM /360 (The main reason why most identifiers in the IBM world are 4 or 8 characters in length). Similar with smaller systems like 3 chars in 18 bit PDP. In fact, doing single char operations (or variable length) was rather cumbersome in many of these early systems. – Raffzahn Jan 23 '22 at 23:04
  • 3
    Indeed. TOPS-10: filenames one machine word, command names 1 machine word, which was 6 chars.. RSX-11D/M filenames 2 machine words (3 chars was a bit limiting even on a mini) + 1 machine word extension. VAX/VMS. command names significant to 1 machine word, 4 chars. – dave Jan 23 '22 at 23:27
  • 1
    @Raffzahn Two phrases highlighted in bold should provide a clear answer what the question is targeting. – Leo B. Jan 24 '22 at 01:31
  • @LeoB. Then I may be to stupid to understand what you're asking about. The 60 bit type is not a language invention, but a machine specific data type that came with the CDC 6600. It's nothing abstract or new, but the natural way to store strings on a CDC 6600. Wirth didn't invent anything here, but enabled his Pascal implementation to use explicit this data type for strings storage – Raffzahn Jan 24 '22 at 01:43
  • @Raffzahn If it took mode than a decade of existence of programming languages to enable a programming language implementation to make explicit the natural way to store short strings in a word-oriented computer, then it was not that obvious, and some kind of mental leap was involved. – Leo B. Jan 24 '22 at 02:01
  • @LeoB. Ido not understand your last comment. CDC 6600 programs already stored strings of up to 10 characters into a single word, before there was a Pascal compiler. CDC Display code was already used with 1604 and 3000 series machines. They all were word addressable machines. Any kind of character handling was done by shifting and extracting or inserting. It was natural for them to handle characters that way, no matter what language. All programming language, no matter if Assembler, Fortran or COBOL on CDC used this. Adding ALFA to Pascal was natural to allow to access word stored strings. – Raffzahn Jan 24 '22 at 02:54
  • 5
    Small string optimization would be to not have a separate string type for small strings, but let the compiler optimize for small strings that fit in some minimal required storage. Even standard c++ string libraries do that nowadays, which must mean that the “trick” is very old indeed. ;-) – WimC Jan 24 '22 at 06:15
  • @WimC 50+ years ago it was not a fully automatic optimization, but merely compiler-assisted. – Leo B. Jan 24 '22 at 09:44
  • 1
    This is the same as FORTRAN's Hollerith constants, which were in use before the 1966 ANSI standard. – scruss Jan 24 '22 at 12:10
  • @WimC yes standard C++ libraries do that now but that is a 21st century thing - earlier c++ used Copy on Write and shared pointers - so that does not show its age. Also that small string optimization is not the same as described in the question – mmmmmm Jan 24 '22 at 12:41
  • 1
    Not packing characters you knew belonged together would waste a lot of space. – Thorbjørn Ravn Andersen Jan 24 '22 at 14:11
  • @mmmmmm Exactly. I meant that the title of the question can be misleading in that sense, depending on how you interpret SSO. – WimC Jan 24 '22 at 15:40
  • @scruss In Fortran, there was no specialized type dealing with these constants. – Leo B. Jan 24 '22 at 16:49
  • 1
    I think this answer is confusing packing of characters (essential on a word machine) with word-sized strings as a specific data type in a language. Pretty much any string support on a word machine used packing - e.g. Wirth's suggestion for Algol 60 in AB 17.3.2. – dave Jan 25 '22 at 01:33
  • @another-dave It's not about packing per se, it's about special-casing based on size of the packed array. – Leo B. Jan 25 '22 at 01:47
  • Ah, I misplaced my comment - it was supposed to be commentary on Raffzahn's answer. I understand the question itself to have the meaning you said – dave Jan 25 '22 at 02:01
  • Does B count as an example? It had multi-character literals stored directly in variables, so it definitely supported a programming style based on short strings, except it had no type discipline to enforce they are consistently used as such. Because if yes, then I think the question becomes somewhat slippery. – user3840170 Jan 26 '22 at 11:07
  • @user3840170 No, the question is specifically about the type. Storing Hollerith constants in REAL or INTEGER variables was possible back in Fortran, but manipulating them would be an ad-hoc trick. another-dave acknowledges below that it doesn't count. I've found a mention that there was an ALPHA type in Burroughs ESPOL, but it likely meant CHAR. – Leo B. Jan 26 '22 at 16:19
  • @LeoB. - FORTRAN wasn't big on creating types where it didn't need to. While it wasn't much fun to sling these packed strings around, you could manipulate them fairly well with the plain FORTRAN-66: see ch.2 of Fortran techniques with special reference to non-numerical applications – scruss Jan 30 '22 at 16:50
  • @scruss The format (80A1) means that the string becomes unpacked. – Leo B. Jan 30 '22 at 23:21

2 Answers2

17

TL;DR:

ALFA is a predefined machine specific type (like all Pascal types), that happens to be 10 on CDC 6600, but may have any other value on other machines. ALFA has been added to allow machine specific character handling - the same way the keyword packed was added for the 6600 for this purpose. Without this storage would not only be rather wasteful, but also any interfacing to OS or I/O be rather painful.

Wirth notes in section 6.1.1 Standard Scalar Types of his 1971 The Programming Language Pascal:

alfa the values are sequences of n characters, where n is an implementation dependent parameter. [...] Alfa quantities may be regarded as a packed representation of short character arrays.

(Emphasis added)

Section 14 Pascal 6000 (p.48) describes implementation specific amendments for the CDC 6000 series. Among implementation specific definition of integer, real and char, ALFA is noted as

the number of character packed into an alfa value is n=10


The Long Read:

There seems to be a tiny misalignment between the understanding of word oriented architecture vs. word addressing, as well as how a language is derived or that Pascal is a machine independent language.

Pascal is not machine independent

Pascal is by definition machine/implementation dependent. To cite section 6.1.1 and section 14 of The Programming Language Pascal:

  • integer
    • 6.1.1 : the values are integers within a range depending on the particular implementation.
    • 14 : is defined as type integer is -2^48+1 .. 2^48-1.
  • real
    • 6.1.1 : the values are a subset of real numbers depending on the particular implementation.
    • 14 : is defined according to the CDC 6000 floating point format.
  • char
    • 6.1.1 : the values are a set of characters depending on the particular implementation.
    • 14 : is defined by the CDC 6000 display code character set.

Section 10.1.3 defines in addition pack and unpack as standard procedures for character transfer into and from ALPHA arrays. Both are of course implementation specific with implied machine specific length values.

Pascal is not a greenfield development

It's always helpful to keep in mind that languages are usually not a greenfield development, but meant to run on real world machines with real world usage. This is especially true, as Wirth explicitly wanted to create an every day language suitable for engineering as well, all the way down to OS development. Supporting machine specific details (like packing) is essential.

Word architecture vs. Word addressing

Today's machines are usually word oriented but byte addressing. That means they work great with words, but can access each byte directly. A feature essentially made canon by the IBM /360 (*1). A machine with 32 bit word architecture, but memory addressing in bytes. Here character strings can be accessed by direct addressing.

The CDC 6600 is not only word oriented, but also word addressing (*2). This means the smallest addressable memory element is a (60 bit) word. To access smaller units, like characters, they have to be (un)packed.

No programmer of a 6600 would store a single character per word, maybe except for intermediate states. Strings were always stored as characters 10 per word.

Character handling was (usually) based on CDC Display Code, a custom 6 bit code, which was already used by the predecessor 48 bit machines (CDC 1604 and 3000 series with 8 characters per memory word. There is no way to address a single byte (or character). To access a single character it must be extracted by shifting (LXi/AXi) and masking (MXi). See Section 3.14 Character Manipulation on page 93..101 of Assembly Language Programming for the Control Data 6000 and Cyber series.

Other than the question implied, using 'packed' character arrays isn't an 'optimization', but the natural way for CDC machines. The idea to use a word per character (resulting in 90% wasted storage) would have been absolute alien to any programmer back then - much like it still is today. All CDC programming languages, Assembler, Fortran or COBOL used this, as they did so already before the 6600.

Adding ALFA as a dedicated type to Pascal as well came natural - same way as in CDC COBOL on 6600 or prior machines.

The concept of character, like offered by Pascal, is not native to this machine type. Character handling thus is ugly and comes with lots of inelegant code. Alfa offers a machine specific extension to a simplify handling of the basic character storage unit and allows easy interfacing to other components as well as I/O.

For practical use and the CDC 6000 ALFA is a shortcut for array [1..10] of char, as that's the most common usage of a character array on a CDC 6000. The very same way as STRING is a shortcut for array [1..80] of char. Quite handy when processing punch cards, isn't it?

Having ALFA is simply the way to define a machine specific storage solution in an independent way and postponing relevant decision to the implementation. While this makes program using ALFA not-portable, it does encapsulate everything around machine specific arrays in a way easy detected. There is in fact a nice article by Tannenbaum comparing Pascal to Algol 68 where he points out that pascal tries to support common handling for such formats.

Finally having ALFA in later implementations stand as well for 10 characters can be seen as compatibility service to ease porting older sources to newer machines. Same way as 'packing' is supported by modern compilers a straight copy operations.

Bottom Line: There is no dedicated 'invention', but acknowledging the architecture on source level, making programs fit and easy to read.


*1 - Others did as well, but it was the /360 that established today's view of bytes and words.

*2 - Notable here, the 6600 not only came before the /360 took off, but it's way of character handling was even more 'ancient' as it was inherited from it's predecessors.

Also, let's be honest the CDC was never made for business application - heck, many even employed other computers for the 'dirty' part :))

Jean-François Fabre
  • 10,805
  • 1
  • 35
  • 62
Raffzahn
  • 222,541
  • 22
  • 631
  • 918
  • 1
    Mentioning Assembler in a response to a question about types does not make sense. If CDC Fortran had a similar type, how did it look? Could values of that type be compared with .LT.? – Leo B. Jan 24 '22 at 09:36
  • 1
    @LeoB. All types are machine types and assembler does handle them like any other language. – Raffzahn Jan 24 '22 at 14:48
  • 5
    In context, the meaning of "type" should be understood within the type system of a programming language. Your answer beats around the bush, not answering the question as asked, but simply dumping everything you know related to familiar keywords in the question. – Leo B. Jan 24 '22 at 16:47
  • 1
    @LeoB. Pascals type system is simply NOT machine independent, but reflects an architecture assumed with a specific implementation, in this case CDC 6000. Of course you may insist on a theoretical view, but then it's made up in hindsight, not reflecting the way the language was created. – Raffzahn Jan 24 '22 at 18:33
  • You're right, as it is said in the standard that the size of the ALFA type is machine-dependent, whatever number of characters fits in a machine word. In that regard I'm somewhat surprised that on the BESM-6 where was only one ALFA type for 6 8-bit chars, but not the other (8 6-bit chars), as both were in use. But that's beside the point, my question is indeed about the theoretical view of type system in programming languages, not about implementations. – Leo B. Jan 24 '22 at 18:38
  • 1
    It is possible that programming languages before Pascal were either "theoretical" like LISP or ALGOL(s), not bothering to define anything machine-dependent, or too low-level and too weakly-typed, which made adding a specific type for small strings look unnecessary. Then Pascal could be the first combining the sufficiently strict type system and practical machine-dependent considerations to justify adding that type; and I'm asking if it was indeed the first. – Leo B. Jan 24 '22 at 18:46
  • @LeoB. But that's the whole point her. It was Wirth's main intention to create a practical language that can be used in every day cases and all the way to system programming. Algol was, as language designed to be machine independent - too much for what he had in mind. Thus Pascal is cut back from Algol to be more simple as well as able to get down to machine specific detail - like packaging smaller units into a word. There's a nice article by Tannenbaum about Pascal vs. Algol ... somewhere :) – Raffzahn Jan 24 '22 at 18:47
  • @LeoB. that's at best splitting hairs. After all, ALFA isn't really a type on it's own, but a shortcut (subtype) that may fit a machine word - or not. BESM-6 Pascal simply operates (as intended) along these lines, while later Pascals opted for turning it into a compatibility layer. Asking for some higher insight means redefining purpose in light of an all authoritative type system and hindsight, when in reality Pascal was foremost intended to be a usable language on real world hardware. – Raffzahn Jan 24 '22 at 18:53
  • The Pascal manual says that it is a scalar type on its own. In the quote from the manual you specifically trimmed the title of paragraph 6.1.1 "Standard scalar types", that's disingenuous. There is no point arguing anymore. I'm asking a simple question, was there a type with similar semantics in a pre-existing language, or not. – Leo B. Jan 24 '22 at 19:42
  • @LeoB. Please point out exactly where I have trimed the title? Why are you trying to make up false arguments? All the way to personal defamation? What style is that? Is it that hard to accept that the answer to your question is simply not what you expected? Grow up and learn about development of inserting intention in hindsight. Also, you're not asking a'simple question' but try to force a certain answer, ignoring facts. – Raffzahn Jan 24 '22 at 20:19
  • In the "long read", the title is omitted and the ALFA type is not listed. And you continue stating that ALFA isn't really a type on it's own which is clearly false. – Leo B. Jan 24 '22 at 20:30
  • 2
    @LeoB. ??? Wow, that's a real hard pull. The full title is mentioned before, and as usual not repeated. Likewise the apposite definition ALFA is not repeated - as both has been in detail (including description what these sections are about) have already been given. Serious, reading an answer only half ways just to construct some claim is so cheap , I had no idea, one could get that low with constructed claims, just to avoid accepting that you're looking for something that isn't there? – Raffzahn Jan 24 '22 at 20:36
  • 1
    Excuse me for not reading the TL;DR and going directly to what is intended to be actually read (a hint: being concise and to the point, avoiding the need for TL;DR, could help). How can you, then, argue in good conscience that ALFA is not a type of its own, contradicting what is said in the standard? I will only accept one of two answers: either "here is a language predating Pascal which has a type effectively equivalent to Pascal's ALFA" (and I wish for that answer), or "apparently the closest type in pre-existing languages is X in language Y, but it's not an exact match". – Leo B. Jan 24 '22 at 20:48
  • 1
    That's a fascinating answer, but it seems to have gone a little off the track as far as the question actually posed: "Who invented small string optimization? [...] Was [it] Niklaus Wirth [...]?" I am reasonably confident that it was not Wirth, and this answer seems generally to support that, but it doesn't seem quite to come around to who did. – John Bollinger Jan 24 '22 at 21:49
  • 1
    "Pascal not machine-independent" - in this sense, most languages aren't. Example: Algol 60 gave no indication of the range of an integer, but all implementations I've used restricted the type to whatever was convenient on the machine, typically a word. So this feels like a distinction without a difference. – dave Jan 24 '22 at 23:08
  • @LeoB. Excuse me for expecting the one asking a question to read an answer in whole (Hint, it's already all in the TL;DR part :)). Point is , the type Pascal called ALFA was already there, before Pascal was designed. Pascal did not invent it, the same way as Pascal did not invent the types of Integer, Real or Char. On this base, the question for an 'Inventor' simply doesn't make an sense. The language only provided programmers with a grip to handle these basic types provided by the machine. – Raffzahn Jan 24 '22 at 23:44
  • 3
    @Raffzahn In the future, if your first reaction to a question is "Not sure what the question is targeting here", please refrain from attempting to answer. – Leo B. Jan 25 '22 at 00:19
9

Before Pascal, there was Algol 68, which includes the data types bits and bytes, fixed-length array-like entities holding bool values or char values respectively, and which were expected to be implemented 'efficiently', i.e., probably as one machine word.

(Naturally, long bits and long bytes were possible, at the implementor's discretion)

Wirth was, of course, one of the original contributors to the design of Algol 68, before he (and others) resigned in objection to the direction the language was taking. Thus the direction of information flow is not clear, even though Pascal post-dates Algol 68. Wirth's earlier Algol W had a BITS type but not an explicit "machine word of characters" type.


Before all this, FORTRAN IV on the IBM 7090/7094 under IBSYS did not actually have a string data type, but a FORMAT of 'An' used on a READ would cause the next n characters to be read into a REAL or INTEGER variable (one machine word), padding or truncating to exactly 6 BCD characters. I think about all you could do with the data would be to eventually print it out again.

I think the FORTRAN case does not really answer the question, not being related to array types, but it does demonstrate that packed characters did not need a whole lot of invention on word-oriented machines.

dave
  • 35,301
  • 3
  • 80
  • 160