40

Why, usually, does 0 mean success in process return status codes?

When I worked at TANO Corp in New Orleans in the late 70s and early 80s, the convention there was the opposite: 1, true, was the "it's all OK", and 0, false, was "oops!"

Along all the years I've known about this "0 means success" idiom, I've asked people if they have any clue where it came from and the best answer was PURE speculation — admittedly — from the respondent: it means "nothing to report."

OK, that speculation makes sense to me. And a LOT of the origin of terms people ask questions about here in Retro have answers like "it was borrowed / carried over from non-computing practice." In this case, "nothing to report" is a commonly heard phrase in a military context and considering the early focus of modern computing.

Stephen Kitt
  • 121,835
  • 17
  • 505
  • 462
Richard T
  • 847
  • 4
  • 12

5 Answers5

46

Here is a Multics document on "Standard Error Handling Practice" from March 1969.

enter image description here

Scroll down to the second page and find "by convention, zero is the code for normal completion".

enter image description here

That indicates to me that by the time of Multics' it was already common practice in computing that 0 indicated success/normal for subroutines. In Multics of course there was a close affinity between the way subroutines worked and the way processes worked - so it was natural that processes used the same mechanism. From there: UNIX.

By the way, consider the Multics "shell" - leveraging the remarkable Multics' property that you could call any entry point in any executable file in the entire file system as a subroutine (if you had permissions to it, of course) - it simply called all "commands" - system or user-written - with a normal subroutine call in machine language, and the dynamic linker handled the rest. That means that in Multics the return value from such a subroutine call became the return value that a command would check (and then it would call a system subroutine to correctly handle the failure from the command-line POV). From the same document:

enter image description here

The history of using 0 for success started a long time ago - most likely earlier than Multics. Though it probably wasn't the only such convention. If you want an opinion my guess is that, based on the experiences of people writing programs in "the early days" they recognized that there were frequently multiple reasons a given subroutine/system call/command could fail, and some of those different reasons might be interesting to the caller. (E.g., "device not ready" vs "no permission to use device".) On the other hand, there probably weren't a bunch of compelling examples for having multiple success codes on an API. And also they knew that the easiest, cheapest, simplest method of returning a result from a subroutine was just to leave a value in a particular distinguished register - that was a looooong time practice. And with that in mind you then look at the possibilities of singling out one particular in-band value of an integer, with a mind to making it simple and cheap to test for (by the caller) and maybe also simple and cheap to establish (by the callee) ... and 0 stands out.


BTW, how do you make any subroutine into an entry point of an executable? Simple! Multics allowed files to have multiple names and all you needed to do was add, to your file, the name of each subroutine you wanted to allow the dynamic linker to find. Especially for the system commands it was common to find an executable file with several, or a dozen, extra names attached to it - each one being a subroutine in that binary, and each one callable from the command line. The directory list command would list all the names of each file of course (which would look strange to modern eyes). It's a bit more convenient for the programmer than the modern technique of hard-linking the same file from multiple names and then having the code figure out which command was wanted by inspecting argv[0].

davidbak
  • 6,269
  • 1
  • 28
  • 34
27

TL;DR: CPUs handle Zero unique among all integers.

Zero is set apart from any other integer by the way ALUs work. On low level it is thus of advantage to use zero as return code for success, as it's the most easy to be detected. With that being said, it comes natural to extend this to process/program return codes.


Way Back in Time and Close to Hardware

Much meaning can be put in hindsight onto return codes, but such based on integers inherently benefit from integers being treated as first class member by next to all CPU architectures. This includes almost always a way for taking execution depending on an integer being zero or non zero. Either by offering a value based branch, or a fast, low cost test followed by a branch.

At that point its helpful to keep in mind, that error handling is a burden slowing down execution. A substantial one, considering that programs are all about calling functions - within and from the OS.

By assigning zero as default value for success this advantage can be used for cheap (*1) error/non-error detection, reducing error handling cost to a possible minimum.

Of course that argument might work either way, but when looking at ordinary execution, then functions will usually will have have to be way more differentiated why they failed than why they succeeded. Reserving one return code for success and MAX_INT-1 for error numbers does again simplify error handling. Of course a differentiated error handling with multiple fields and structure past a simple number will beat all of that - but also be an overkill in 99.999% of all cases.

And then there was C and Unix

While (early) mainframe OS used dedicated mechanics for success and return/error information, the designers of Unix were all about simplifying to the absolute minimum. Using zero to distinguish the most notable case, yields the best performance.

C/Unix was not only using the advantage of integers within programs (*2), but as well extended it to the shell. After all, a programs main() is also just a function, so why bother to convert this in any way but simply forwarding that value to shell?

The C Programming Language Second Edition mentioned this as general rule on p.27:

Typically, a return value of zero implies normal termination; non-zero values signal unusual or erroneous termination conditions.

Divide et Impera

What goes for integers works of course as well with signed integers. Those divide all none zero values in two (almost) equal sized sets, marked by the sign - a feature as easy to detect as like zero and as well privileged by many architectures.

By using signed integers Now not only reasons for being non successful, but as well reasons for success can be reportet. C does make as well use thereof (*3).

Two Halves of a Shell

(*4)

Despite process exit codes usually seen as unsigned integers, the sign principle got as well extended to shell use by reserving values of of 128 and above, like for the return value of a sub-process.

One Exit Code to Rule Them All

In batch programming 'success' is one most important 'message', as it's the one to be detected to carry on with whatever is next. Think of a very classic use case like processing data from a tape. Such a program may return beside the basic

  • Everything worked fine and
  • Generic fail

exit codes for

  • Wrong tape,
  • No tape assigned or
  • Add follow up tape

Depending on the environment the later may require the request of operator assistance to mount the right tape, a follow up tape or search an archive. All things the data process can and should not do on it's own, as it's heavy dependent on customer installation what the right handling will be.

Again the privilege of zero being special for all integer makes batch writing consistent, easy to do and most important, easy to read. Anyone who has worked in a (classic) data center will know how important a consistent structure of batch files is.


Long story short:

Zero is privileged as return value by hardware, assigning it to the most common case comes naturally


*1 - In a sense of compact code and fast execution.

*2 - Immortalised by the ubiquitous `if (rc) { /*errorhandling */ };

*3 - Of course it wouldn't be C if it doesn't get complicated at that point, for example with read() now only reserving -1 for some error and reporting the real error number in errno, adding several pitfalls in non trivial programs :)

*4 - In some ways the usage of signed integers and the resulting easy detection of positive and negative values and zero is much like the shell of a Bivalvia: Two valves connected by a hinge :))

LercDsgn
  • 3
  • 1
Raffzahn
  • 222,541
  • 22
  • 631
  • 918
11

A process, or a system call, can have multiple outcomes, typically more than two. There could be several "successful" outcomes, and several "unsuccessful" ones.

When these outcomes are identified by numbers, choosing negative numbers to mean "unsuccessful", and non-negative numbers to mean "successful" is a fairly mnemonic choice, given the associated meanings of the words "positive" and "negative" in most fields (except, perhaps, some medical contexts where "negative" is often the desired result of a test).

With these conventions, finding out whether a result of an operation was favorable or unfavorable overall, on a machine with two's complement arithmetic would involve just one instruction, checking for the sign of the status value. Thus, 0 is not just "nothing to report", but rather, "successful, nothing more to report".

However, given that most applications have only one truly "successful" condition, in case of process return codes conventions may vary, and the negative range may now mean "terminated involuntarily", and the positive range may mean "terminated voluntarily" (which is at least a partial "success").

Leo B.
  • 19,082
  • 5
  • 49
  • 141
  • 2
    Also (IMO) it hasn't actually worked well to supply multiple success codes. E.g., Windows NT allows multiple success codes in the API. In the first place they are hard to use when provided, there is usually little reason to use them, and since most APIs don't have multiple success codes if you wanted to use them you'd have to look them up in the docs each time to see if your API had them. When they do exist it is easy to get them wrong by people forgetting to test for error/success using the proper macro (thereby turning warnings to errors). .... – davidbak Aug 07 '22 at 15:44
  • ... Finally, in the case of process returns and the shell: they don't compose into pipelines or other shell constructs easily. – davidbak Aug 07 '22 at 15:44
  • @davidbak True; I was editing my answer at the same time. – Leo B. Aug 07 '22 at 15:46
  • 1
    @davidbak - that people write code without reading documentation is surely a poor reason for restricting your API's ability to define more than one type of success. Mind you, I was trained on VAX/VMS, which had a machine instruction for "is this a success/fail status".. – dave Aug 07 '22 at 18:18
  • 4
    @another-dave - not really, if it concerns you that applications/libraries be written on your platform be reliable. Even from the consideration that if it is considered difficult/annoying to write reliable apps/libs on your platform it'll get a bad reputation. And as an app/lib writer I'd like my platform to be easy to write correct reliable code for, too. – davidbak Aug 07 '22 at 18:37
  • Nevertheless, there is more than one way for some routines to 'complete successfully', so you need some way to communicate that to the caller. And therefore, sooner or later, programmers will need to RTFM. – dave Aug 07 '22 at 21:18
  • @another-dave Like you, I have a strong VAX/VMS background, and I'm more pursuaded by your arguments in these threads, but, as I'm now a also a scientist, I'm WELL aware of how subconscious bias works! And I think our friend and quite sharp colleague Raffzahn is stuck on a bias for their belief in the efficiency of zero, while I'm not buying it as ALL the systems I've known can be just as efficient with one as zero. POSSIBLE CLARIFICATION: I'm not a big fan of long return status codes; just one or zero, PLEASE! There ARE more complex things in error and success; provide ONE mechanism for 'em! – Richard T Aug 07 '22 at 23:02
  • @another-dave: I suspect it wasn't that making it easier for people to use the tool without documentation as it was to avoid having to make additional documentation that would then be ignored; since after enough abstraction, someone would only care about errors, and not care about specifics of success. if(errorCode){/Error Logging Code here /} lets you get away with not caring how a success happened, and only if an errorCode boolean'd in that if statement. Documentation can then be LOTR rather than Silmarillion, focusing on edge error cases. – Alexander The 1st Aug 08 '22 at 08:35
  • @AlexanderThe1st You are assuming that non-zero is truthy, which is begging the question. – richardb Aug 10 '22 at 17:22
  • 1
    @RichardT: I guess you've never looked at x86 assembly language and machine-code, then. Testing for non-zero is a 2-byte instruction, test eax,eax. Checking for exactly 1 is a 3-byte instruction, cmp eax,1. Also, returning zero is a 2-byte instruction, xor eax,eax, while returning any other integer value is normally done with a 5-byte mov eax, 1. Some other ISAs like ARM and MIPS can branch on a value being (non)zero in a single instruction (ARM cbz / cbnz, or MIPS beq $v0, $zero, target without needing to li a 1 into a register to compare with). – Peter Cordes Aug 10 '22 at 19:52
  • @RichardT: Of course, if 1 was the only possible non-zero status code, you could just check for non-zero, and indeed checking for that is just as efficient on modern mainstream ISAs. So most of this argument is based on having multiple failure codes. But returning 0 is still more efficient on some ISAs with variable-length machine code, notably x86, since x^x or x-x can generate a zero with no space needed for an immediate 1 operand. – Peter Cordes Aug 10 '22 at 19:56
  • (I realize none of the ISAs I mentioned or know about are old enough to be relevant to the establishment of this convention, but smaller code-size is almost always better when all else is equal, and many old ISAs including VAX have variable-length machine code instructions. Raffzahn's answer makes perfect sense to me.) – Peter Cordes Aug 10 '22 at 19:59
  • @richardb: Technically true of assuming that 0 is false in a boolean cast - that's the basic default presumption. Much easier to document for "Do I need to care that the function didn't work in this particular case? No? Then don't bother me with documentation about the many ways this can succeed - I specifically called you to not have to worry about the details when I don't have to." – Alexander The 1st Aug 10 '22 at 22:20
  • @davidbak: There are a few cases where it makes sense to treat multiple different kinds of "success" differently. For example, a request to check if a directory contains a file with a certain name may report that such a file exists and is accessible, that the directory was successfully determined not to contain such a file, or that an error occurred while trying to make the determination. Such cases aren't terribly common, but they're hardly unknown. – supercat Aug 18 '22 at 20:26
9

Not an answer as to 'first', but since we're talking a lot about Unix, it might be useful to point out that Unix had at least 3 'error' conventions. I talk here about PDP-11 Unix.

  1. Syscall success/failure. Success indicated by carry bit clear, possible single-word return value in R0. Failure indicated by carry bit set, negative error code in R0. Distinguishing success from failure is a branch on carry, not a test of R0.

  2. C RTL success/failure. For C library routines wrapping kernel calls, the problem was mapping the two kernel outputs (C-bit, R0) into one function return. This was done by picking a value that was "not possible" for a success return, generally 0 or -1, and saying that was an error indication. The actual kernel error code was written to a global variable (negated, I am not sure why).

  3. Process exit status. The case of interest is of course what the parent process actually sees, which observation occurs via a 'wait' call. The result is a composite of the kernel's termination reason (exit, kill, segv, etc), a flag indicating whether core was dumped, and the 8-bit exit status from the process if it voluntarily exited.

So in Unix, the kernel only has one 'success' reason for termination - the process called 'exit', which is coded as zero (and there was no core dump in this case, so that flag is zero too). That leaves 8 bits for the process use.

Second-edition Unix man pages for the shell say nothing about the shell taking any particular action on zero/non-zero for the actual exit status; only about printing messages for the non-exit cases. I conclude there was not yet a strong convention in this regard. Perhaps it only becomes important when 'programmable' shells appear - maybe Programmer's Workbench Unix?

dave
  • 35,301
  • 3
  • 80
  • 160
  • "negated, I am not sure why" — presumably, to easily check for error (since -errno < 0, just as -1). E.g. write(2) returns a signed number of bytes, whose being negative indicates an error. Dunno about the UNIX kernel, but Linux returns -errno on error, which is then written by the libc wrapper to the actual errno variable after negation, and the wrapper then returns -1. – Ruslan Aug 08 '22 at 09:35
  • @ruslan - yes, Unix returns negative numbers for error codes, as I said. My question is, why not set those actual values in errno? Or, why would the kernel implementor and the C RTL implementor, who were presumably in very close communication, make opposite decisions? Your 'checking for error; sentence does not seem to hold water, since by intent, errno is only set by the RTL after an error has been reported. – dave Aug 08 '22 at 12:02
  • Because error codes are positive and couldn't be told from the actual return value. So the logic is like: if return value is negative, there's an error, negate the value to get error code. Otherwise it's the value corresponding to the purpose of the syscall: number of bytes read, file descriptor opened, etc. Maybe the error codes themselves could be made negative to avoid the need for negation, but that's another question why they weren't made negative. – Ruslan Aug 08 '22 at 12:06
  • 1
    The point in (PDP-11) Unix is that the actual return value or error code from the kernel is separately conveyed from the success/fail indication, so there is never any occasion on which it is necessary to determine whether a particular set of bits in the return value means success or failure. – dave Aug 08 '22 at 12:14
  • @Ruslan: The negation sounds weird. Some modern systems like Linux use in-band signalling of errors in their system call ABI with -errno values, e.g. for Linux, any return value unsigned >= -4095ULL is an error code. (See the kernel/library differences note in getpriority(2)). But MacOS/Darwin signals error / non-error out-of-band in the carry flag, like another-dave is saying PDP-11 Unix did, with the return value register (rax on x86-64) holding an errno value if CF==1, otherwise a normal return value. – Peter Cordes Aug 10 '22 at 20:07
  • Perhaps earlier Unix on PDP-8 had used in-band signalling in the ABI? Or inside the kernel, error codes were in-band? I don't suppose the kernel generated the carry flag result directly from the return value with a compare or something? Since some system calls need to return non-zero values, it couldn't be as simple as just doing 0 - retval which would set carry if non-zero. Oh, and if system-calls preserve user FLAGS except for carry, just generating a flags condition in the kernel before returning wouldn't work, if the restore mechanism is like x86 iret. – Peter Cordes Aug 10 '22 at 20:11
  • @PeterCordes do you mean PDP-7? UNIX didn't run on PDP-8. For PDP-7 we can see here that it did use in-band error indication (see e.g. open or chdir), though there seem to be no error codes, just one "I failed" code. – Ruslan Aug 10 '22 at 21:37
  • @Ruslan: Err, yes, I was mis-remembering which PDP Unix ran on before they developed C to port to PDP-11 from the original hand-written assembly implementation. Thanks for the correction, and checking on that guess. – Peter Cordes Aug 10 '22 at 21:40
  • These early man pages, which evidently refer to PDP-11 Unix, indicate sys calls setting the C bit, but with no specific mention of error codes. Still, since some calls use R0 as the return value on success, it's not a huge leap to put an error code there on failure. But absent doc or code, nothing to confirm my assertion of 'negative'. – dave Aug 11 '22 at 00:11
1

Many CPUs have a "flags" register that contains a number of bits automatically set or reset according to the last ALU or load operation.

Commonly one of those bits is often a "Z" bit which is automatically set when a value is 0.

CPUs also typically have a "compare" instruction that does a subtraction, discards the result, but lets the flags stand.

However, if you want to test for zero, the compare is unnecessary since the Z bit will be set automatically from the last operation that loaded a zero value.

So if you need 2 values to decide between "error" and "not-error", then zero and non-zero are convenient because save an instruction/a couple cycles (valuable things in late 60's) on each system call.

I could be wrong ... I haven't really looked at this in depth yet, but just skimming over the Multics Processor Manual seems to suggest at least one of the CPUs Multics ran on worked the same way - looking at the lda instruction (page 32) for example, mentions something about an "indicator" being set if the value is 0, I believe. So the tnz (transfer on non-zero) instruction could be used immediately after the call to jump somewhere if there was a problem, without a cmpa instruction first.

Of course that manual has a copyright date of 1985, so not 100% sure.

LawrenceC
  • 1,199
  • 7
  • 18
  • Not all computer architectures set the flags on simple load immediate value to register operation, only some do. And some interfaces use the carry flag for indicating between success and failure. – Justme Aug 19 '22 at 21:48
  • Absolutely. For example, CPUs with "skip" instructions that I think probably didn't have flag registers at all. What I read above led me to believe maybe the CPU architectures Multics ran on did work like that though. – LawrenceC Aug 22 '22 at 20:00