45

I've known about a technique allowing to bootstrap arbitrary 16-bit x86 code from a subset of instructions representable as printable ASCII bytes since the early 1990s.

The first example of an ASCII executable I saw was a short text that could be prepended to a uuencoded file resulting in an MS-DOS .COM executable that would uudecode and probably run itself.

That one I couldn't find, but here's an example of a post reminiscing about x86 ASCII executables, with a few sample files to play with.

For example, an ASCII executable to convert .COM files to executable ASCII starts as follows:

T_OOWW3=XXWX5 2PY5w3P_-l.P-KD1Ep-OLPZ-pJP-pw40PQX5fsPu
ASDWERT/Nide5Fe,xPQX-=.PQX-MQP-xx4_P5rjP5Z2P-jE,JP=
5O2,APQX5R8P-rJPPRX5iBP-x=PRX5TsP59DHHP5rIHP-w64ZP=
40-2APQX-MiP-trP5_WP-pBP51w,pPTYPZPZP__z1t3w.FNtKptDCZ
LGcP4mCC558taMjL.4Hh0.44r5tNNAbs55p4VGsO5n_55LlC8zp_rk
gS5_pOiq.AIkgWub7GwtcOI.C9xO7PC2aPf.stA2.yGQ5JGvMvc4O_

What is the history of this technique? Was it invented for the x86 instruction set or earlier? Which existing instruction set architectures are known to allow it?

Seeing how many people have misinterpreted the question, a clarification:

The main usage of this trick was to publish binaries on USENET for people (or send them by e-mail to people) who don't use Unix and access "the cyberspace" from an MS-DOS machine. They may have no idea what uudecode is. With an ASCII-only executable the instructions are: copy the message in its entirety to a file, delete all lines up to and including the -- cut here -- line, rename the file to whatever.com (for example, uudecode.com) and run it.

In order for it to work, the file, which consists only of printable ASCII — bytes 10 (LF), 13 (CR), 32 to 126 (space to ~), and definitely no bytes with the high bit set (might not pass through e-mail/USENET), and no other control bytes, especially no ^Z — has to begin with a cleverly constructed sequence of machine commands that doesn't use any of the forbidden bytes yet manages to convert the rest of the text from ASCII to the binary intended to be executed.

Alternatively, using ASCII executables could be a way to enter machine code by typing it in more efficiently than hex.

So the question is, is it possible to have a similar converter for 8080/Z80, 6502, or another microcomputer platform, and if yes, has it been used and for what purposes, if not transmitting executables through e-mail?

DrSheldon
  • 15,979
  • 5
  • 49
  • 113
Leo B.
  • 19,082
  • 5
  • 49
  • 141

8 Answers8

25

If you go back a lot before the x86, this technique wasn't unusual at all. In fact, writing programs using printable letters and symbols was pretty much the norm for early computers, except that there was a number of encodings for words of varying bit size, and that encoding was not ASCII.

Examples:

  • On the IBM 1401 (1959), a program that looked like

    ,008015,022029,036043,050054,055062,063065,069080/333/M0792502F1.065HELLO WORLD

    would print "HELLO WORLD". Here , (set word mark), / (clear storage), M (move) etc. were opcodes, and the rest was operands. Wikipedia has a list of characters and corresponding opcodes.

  • On the Olivetti P101 "desktop computer" (1965), a program like

    b ↑
    B ↑
    b ↓
    B ÷
    A ◇

    would read two numbers, divide them and print the result. More examples in the manual. This machine didn't even have character sets with all Latin letters.

  • There was another early computer where the poor programmers had to translate assembler instructions into two characters of a rather random teletype-like charset on papertape, because initially there was no proper assembler, so "writing gibberish" was actually the proper method to program this computer. Unfortunately, I can't remember at the moment which computer that was (will edit answer when I do).

And there's probably a lot more examples.

So the technique itself is quite old. Coming back to encoding machine language into ASCII, in principle, one can apply it to any kind of ISA, one just has to define what part of ASCII one considers as admissible, which part of the instructions set of the particular CPU they match, and then it becomes an exercise to encode what you want in this restricted manner.

jpmc26
  • 105
  • 3
dirkt
  • 27,321
  • 3
  • 72
  • 113
  • 4
    The IBM 1401 instruction encoding was specifically designed to have its opcodes represented as printable characters, wasn't it? I'm asking about a way of converting arbitrary binary code into a directly executable ASCII file on platforms where the instruction set was designed without that reservation. – Leo B. Sep 07 '17 at 16:13
  • 3
    @LeoB. No, it wasn't. It used 6 bit bites (sp!) so each and every possible binary code was also a printable character. There was no need to specially design it that way, or seperate binary from printable ... these cathegories came up later. – Raffzahn Sep 07 '17 at 16:36
  • @Raffzahn The "set word mark" opcode maps to a comma too conveniently to believe that it was a pure coincidence. Back to my question: it is about allowing arbitrary executable binaries to be represented as 7-bit ASCII without needing any conversion. – Leo B. Sep 07 '17 at 16:48
  • To some extend the Zuse Z22/Z23 (1955) machines would also qualify, as the normal sequence of loading a programm was to load and compile the source. In normal operation no binary programs where loaded. The compiler was part of the OS, residing in the lower 4k words of memory (originally drum, later core), while the compiled (user) programm was writen into the upper 4k words. – Raffzahn Sep 07 '17 at 16:58
  • 1
    @LeoB. With 40 of the 64 combinations representing valid opcodes it's hard to belive your asumption. Also, until now I understood your question that it's about a file with only printable (or otherwise usual) characters used for text representation. If you now narrow this down to strict 7 bit ASCII, all machines using different codesets are out of question. Sounds not realy open and useful. – Raffzahn Sep 07 '17 at 17:10
  • @Raffzahn The main usage of this trick was to publish binaries on USENET for people (or send them by e-mail to people) who don't use Unix and access "the cyberspace" from an MS-DOS machine. They may have no idea what uudecode is. With an ASCII-only executable the instructions are: copy the message in its entirety to a file, delete all lines up to and including the -- cut here -- line, rename the file to whatever.com and run it. Easy as pie. – Leo B. Sep 07 '17 at 22:44
  • 2
    @LeoB. Sure, but what you're telling now basicly contradicts your question. Originally you asked if about a history before x86, and now you want to restrict it to MS-DOS users? It would be great if you could make up your mind what you're asking. – Raffzahn Sep 07 '17 at 22:57
  • I've just provided an example of the usage for x86/MS-DOS. The same would have worked for any machine that could receive e-mail, whose users could be not savvy enough to know about uudecode. Or as means to type in executable code from printed media in a form more compact than hex. – Leo B. Sep 07 '17 at 23:04
  • 3
    The Manchester Mark 1 was originally directly in teleprinter code - i.e., 5 bits on tape => 5 bits in memory. – dave Mar 04 '20 at 12:37
12

It was standard practice on the Sinclair ZX80 & ZX81 to put executable code into a REM statement at the beginning of a BASIC program.

REM statements are, of course, text comments, so this meets the spirit of your requirement for executable ASCII.

The ZX80 (1980) and ZX81 (1981) predate your question about the early 1990s by about 10 years and used the Z80 processor.

There is a guide on how to put executable code into REM statements here. Essentially poking assembler codes back into a reserved space.

This is just one example of the use of this technique. It was also used later in the 1980s on the BBC Micro. Small embedded subroutines were also put into REM statements on the HP9845, mostly to accelerate arithmetic calculations.

wizzwizz4
  • 18,543
  • 10
  • 78
  • 144
Chenmunka
  • 8,141
  • 3
  • 39
  • 65
  • 5
    I don't know much about zx80 and zx81, but on zx spectrum putting code into REM is a standard practice. Obviously not every byte value is treated by zx spectrum basic as ASCII - some are basic tokens, some are control codes (for example, they change color of text) and some just break further printing of such REM comment. Therefore I doubt whether zx80 and zx81 used ONLY ascii subset to put code into REM comment. – lvd Sep 07 '17 at 09:32
  • 7
    @lvd The ZX-81 didn't use ASCII at all. It had a character set different from ASCII. But that's nit-picking. – tofro Sep 07 '17 at 10:02
  • 2
    Similarly, Atari BASIC allows code be inserted into a string variable and executed. This was a common way to add assembly routines to a BASIC program, allowing faster running BASIC programs. – Tim Locke Sep 07 '17 at 12:02
  • 1
    @TimLocke Also worth mentioning here that on the Atari 8-bit machines, it was possible to enter any byte from the keyboard — every byte was associated with a graphical character — so in that case it wasn't just ASCII. – al45tair Sep 07 '17 at 13:05
  • 1
    "Executable ASCII" is a sequence of ASCII characters that can be executed as machine instructions directly, without any pre-processing or conversion. Otherwise you can say that a C program is, of course, ASCII, and is, of course, executable. – Leo B. Sep 07 '17 at 16:00
  • @LeoB. - That ASCII in a REM line is "directly executable". PRINT USR 16514 did the trick of having the BASIC interpreter directly calling that code. – tofro Sep 07 '17 at 16:23
  • This answer isn't about the question, as it describes a method to reserve space within a programm to store arbitary binary information, not ASCII (or any system specific encoding) as it has been asked for. – Raffzahn Sep 07 '17 at 16:40
  • @tofro I'm sorry, I think you're still missing the point of the question. – Leo B. Sep 07 '17 at 16:51
  • @LeoB. I don't think so - Bytes are just bytes, and any interpretation is up to you. In case someone arranges opcodes in a way that the program bytes look like a picture or like readable text, any computer can do that. The ZX Spectrum manual even had the Z80 opcodes included in its ASCII table. – tofro Sep 07 '17 at 17:03
  • @TimLocke In any language that allows putting arbitrary bytes in strings by some means of escaping and is flexible enough to pass control to an arbitrary address, you can do that; but this is not the point of the question. The question is about being able to use a strict subset of bytes (about 96 out of 256) in a binary code directly executed by a CPU. – Leo B. Sep 07 '17 at 17:06
  • @tofro "In case someone arranges opcodes" - exactly. The designers of the x86 instruction set haven't done that, but it is possible to come up with an useful executable (by the CPU) ASCII file for x86. I'm asking, if it is possible for other existing platforms. – Leo B. Sep 07 '17 at 17:09
  • 1
    You can write useful programs for a Z80 using ASCII text only. Most of the register load and store opcodes are in that range, and most of the relative jumps as well. You would not be able to add and subtract, but otherwise, why not? For the RET to basic (0C9h), you'd probably have to revert to a bit of self-modifying code. But I very much doubt that all of the programs in your links really restrict themselves to pure 7-bit ASCII. – tofro Sep 07 '17 at 17:20
  • @lvd on Spectrum such programs (this approach was popular in tape loaders because loader must be in basic) usually appeared malformed when doing LIST command (or even freezed system or gave C Nonsense in Basic error). So I think any bytes were encoded in REM. Format of line of basic program is line number, line length, line payload, \n i.e. they are length prefixed so any bytes can be used after REM. – kolen Sep 07 '17 at 19:11
  • Clearly the technique produces REM statements so far from regular printable characters that when do you do this, according to the article, "on the OLD ROM the command LIST will usually cause a system crash." – cjs Mar 04 '20 at 07:47
  • On the Atari, this trick worked because you could get the ADR of the string. How did the ZXs get the address of the REM? Was it simply knowing the line always started at a fixed address? What if it was two lines? – Maury Markowitz Mar 04 '20 at 12:07
  • The ZX81 most certainly did not use ASCII so this is not answering the OP's question, I'm afraid. Instead it misleads readers. – TonyM Jul 08 '20 at 06:55
11

Short Answer

It can be done in any environment that:

  1. Allows the remarking of data files into program files,
  2. Has a loader format that's either primitive enough or all readable
  3. Has a character set (doesn't have to be ASCII) that has a sufficient number of encodings that produce valid opcodes
  4. Has an address space layout that fits the possible encodings
  5. Necessary OS calls can be encoded (subset of 3.)

Numbers 4. and 5. can be circumvented depending on the machine, the OS and available text encodings.

Since some small machines have 256 printable characters, the difference between a binary and a text file is negligible anyway.

History here is a bit hard to grasp, as the necessity didn't arose in ye old days. Who needs tricks when you've got full access? But there was a somewhat similar situation for early mainframes - while there was a 'binary' mode for punch cards for many machines, one could also punch 'text' cards with encodings outside of the (usual) readable range. This worked, since the translation of 12 hole punch card code into 8 Bit EBCDIC worked according to fixed rules.

Remember, real men always had a handpunch near, or carry at least a porta-a-punch.

Raffzahn
  • 222,541
  • 22
  • 631
  • 918
  • 4
    As an example, on a SPARC having a directly executable ASCII file is impossible, because almost every useful instruction would have a zero byte somewhere. Apart from having no control characters except CR and LF, another main characteristic of an ASCII text is that all bytes in it have the upper bit cleared, making it transparently transmissible on a network that can add or strip byte parity arbitrarily. – Leo B. Sep 07 '17 at 16:17
  • @LeoB. Hmm, I have a hard time to see the claim about lots of zero bytes when looking at SPARC encoding. Mind to explain? What is true is that all load/store instructions are in the 11 group, so it will need some realy nifty tricks to get anything working. Even syntesising will be hard, as data manipulation is within the 10 group. So SPARC will most likely fail due no.3 - unless using a different codeset, like EBCDIC, where all letters fall into 11 :)) (On a sidenote, whats the problem with zero bytes in ASCII text?) – Raffzahn Sep 07 '17 at 16:31
  • 1
    I was thinking about load/store instructions that have a wide offset field. The problem with zero (or other control) bytes in ASCII text is that early communication protocols or file systems could use them for their own purposes without escaping those existing in the text being transmitted or stored. The same goes for bytes with the high bit set. – Leo B. Sep 07 '17 at 17:02
  • 1
    Well, for having 'low' adresses, the thrich here would be the same as on a x86 machine You can't just load the an address like 0x0001 into a 16 bit x86 register, as it would generate a 16 bit constant with a zero byte as the second. Wouldn't it? So some secondary instructions to construct the address in a register (or memory) are needed anyway. – Raffzahn Sep 07 '17 at 17:32
  • "As an example, on a SPARC having a directly executable ASCII file is impossible" - @LeoB.: I guess, reformulating your question in a way "how to achieve ASCII-only machine code on ARM/MIPS/SPARC/etc." (somewhere on codegolf SO for example) would be only more useful than retrocomputing excavations. (I myself coming from a MIPS case where echo -e is not available on an embedded system, and wonder how I'd bootstrap my binary to such a system.) – pfalcon Sep 23 '18 at 22:31
9

I remember doing this on the university mainframe around 1975. This was on an ICL1904S. Note that the 1900 series had been around for more than 10 years at that time. I don't know when the feature came out but it had been around for some time.

You could list out any executable in card reader format. It would produce the executable in 6-bit characters in lines of 80 characters. Not really ASCII - the whole system ran on 6-bit a character set. These could then be embedded in the GEORGE 3/4 batch files.

It was absolutely brilliant because the uni ran a cleanup every 2-3 weeks. The OS would go through and delete all the executables and intermediate object files but it left the batch files alone, whatever their size.

cup
  • 2,525
  • 1
  • 9
  • 21
8

For a slightly interesting twist on this concept, consider Control Data mainframes.

These beasts included not only a CPU, but a "peripheral processing unit" (PPU)1--and the CPU sent commands to the PPU via normal I/O channels.

The CPU was a 60-bit processor that used 6-bit character codes. The PPUs were 12-bit processors, so the CPU sent a stream of 2 characters to the PPU to send it commands. The PPU commands all required that the first character of the string be a 0.

In most CDC character sets (they had a few, since each one only supported 64 characters), a 0 character was a colon, and letters started at 1, so A=1, B=2, etc.

One semi-popular trick when I was in college was to get a user to execute a program that tried to print the string :D to the screen. As it happened, PPU command 4 was "log off user"...


  1. In the higher-end machines, this was officially a "peripheral processor" (PP) instead, but the concept remained essentially similar.
Jerry Coffin
  • 4,842
  • 16
  • 24
3

This certainly seems to be possible on the 6502. While several seemingly crucial instructions (like STA, STX and STY) exist only with the 8th bit set, it's still possible to construct arbitrary bytes in RAM using SEC with the read-modify-write forms of ROL, ROR and/or LSR, provided the RAM addresses are printable ASCII. The full set of ADC/EOR/AND opcodes are also available to speed up construction of arbitrary bytes, then both JMP and JSR are available, as are BPL, BVC, BVS.

It's straightforward to see how this could be used to construct a small program in zero-page, which could in turn accept hex digits or even Base64, and translate that into a full program.

Chromatix
  • 16,791
  • 1
  • 49
  • 69
  • Does that mean it would be possible to type in something like 'SYS nnn : REM CODEHERE' where nnn is the address of the CODEHERE string in the basic input buffer? – Arc Mar 04 '20 at 23:12
  • @Arc Maybe, depending on details of the BASIC interpreter in question. But if you have a BASIC interpreter available, you might as well use that to do the conversion from an ASCII string. – Chromatix Mar 05 '20 at 02:29
2

The first example of an ASCII executable you saw is in the Google Usenet archive here

2

Is 1949 early enough?

The Manchester Mark 1 had 20-bit wide instructions which were conventionally written as four 5-bit characters using a variation on teleprinter code which Alan Turing adapted for the purpose by replacing control codes with printable characters so that all instructions and data could be written as text.

One might suggest this is cheating because all machine code can be written in e.g. hexadecimal to avoid ASCII control codes, or because it's not ASCII (which didn't exist until the 1960s). However, since the entire point of the exercise was to be able to enter text which can be executed directly, it surely counts.

Similar adaptations of ASCII also exist, e.g. ye olde CP437 which has glyphs for all 256 values. Some very large C=64 programs also shove code and data into the text-mode buffer (what with it being the only spare memory left) and you can watch the PETSCII characters twinkle on the screen as it executes.

pndc
  • 11,222
  • 3
  • 41
  • 64
  • The entire point was to be able to bootstrap arbitrary binary executable code on a byte-oriented instruction set architecture using just the standard printable ASCII subset (bytes 0x0A, optionally 0x0D, and 0x20-0x7e). – Leo B. Jul 09 '20 at 21:31