23

Memory corruption bugs have always been a common problem in large C programs and projects. It was a problem in 4.3BSD back then, and it's still a problem today. No matter how carefully the program is written, if it's sufficiently large, it's often possible to discover yet another out-of-bound read or write bug in the code.

But there was a time when large programs, including operating systems, were written in assembly, not C. Was memory corruption bugs a common problem in large assembly programs? And how did it compare to C programs?

比尔盖子
  • 3,114
  • 1
  • 14
  • 32
  • 18
    I disagree with your premise that out-of-bounds writes are inevitable. Independent of the language, it is possible to prevent this, e.g. through bounds-checking every indexed write. Typically this isn't done for performance or code size reasons (when it's a deliberate choice), or simply because someone "didn't think of it". – Michael Graf Jan 21 '21 at 10:20
  • 16
    It depends who wrote the program. Seriously, Some people pay attention to detail, and some do not. In my opinion as a one-time kernel-mode programmer, a decent assembly language is no worse than C in that regard. – dave Jan 21 '21 at 13:19
  • 3
    @another-dave, Yeah, you can write good code in any language, and you can write bad code in any language, but be fair! Some languages make it easier to avoid certain kinds of problem (e.g., memory leaks are more often found in C programs than in Java programs.) – Solomon Slow Jan 21 '21 at 15:31
  • 1
    @another-dave "a decent assembly language is no worse than C in that regard" It's actually my motivation for asking this question. Some early design oversights in the C standard library produced things like scanf("%s", buf) or gets(buf), which are inherently dangerous and extremely easy to misuse. On the other hand, an ASM programmer is very unlikely to make this mistake. – 比尔盖子 Jan 21 '21 at 16:44
  • 1
    @比尔盖子 The main point here is really about using a 'decent assembler' - and using the feature it offers. Any 'decent' assembler will offer ways to write code in an abstract way and let the assembler fiddle out all the details, thus avoiding many pitfalls. – Raffzahn Jan 21 '21 at 17:13
  • 12
    I'm mostly a Java programmer these days, but back when I wrote a lot of Macro-11 and Macro-32, I think I was on top of the concept of managing memory and access to same. Perhaps the problem lies in blunt programmers using sharp tools? – dave Jan 21 '21 at 17:44
  • 2
    @SolomonSlow, I agree with that. One nitpick: I’m not dissing garbage-collected languages (they are a valid choice) but don’t forget that GC doesn’t prevent memory leaks at all; on the contrary, it says “I give up; I’ll just leak every piece of heap memory I will ever allocate; if I actually run out of memory before the program completes, this neat trick will go and try to pick up after the mess I left behind” – Euro Micelli Jan 21 '21 at 18:08
  • @another-dave — At least in larger software projects, things like these should be laid out in coding standards and verified through code reviews, rather than been left to individual programmers paying more or less attention to detail. – Michael Graf Jan 21 '21 at 20:53
  • 4
    @EuroMicelli: Nonsense. In a tracing-GC language, allocation is determined by the existence of rooted references. An object will cease to be allocated the moment the last rooted reference to it ceases to exist. The memory allocator may not add the memory that had been used by the object to the pool of memory available for immediate reuse until the amount of memory it needs exceeds the amount immediately available, but that hardly means the storage was "leaked". As soon as the last rooted reference gets overwritten, the storage will become available for reuse if needed. – supercat Jan 21 '21 at 22:24
  • @MichaelGraf - did I suggest otherwise? But the person writing the code owns the primary responsibility to take care; the code review is only a gatekeeper. – dave Jan 22 '21 at 00:23
  • 1
    @supercat, it’s leaked in the sense that the memory cannot be reused until the GC does something; what you describe is an optimization detail. The fundamental nature stays the same: “the program no longer deallocates memory for reuse as soon as the object goes out of scope at a deterministic place; the language infrastructure might eventually recover the memory for reuse if needed, at some indeterminate time in the future. The program might never bother to do so if there isn’t enough memory pressure before the program ends”. GC doesn't get rid of leaks; it accepts them, and delays fixing them – Euro Micelli Jan 22 '21 at 01:34
  • 7
    Starting in 1967, I learned "You can program poorly in any language". – waltinator Jan 22 '21 at 02:01
  • 1
    @EuroMicelli: The memory can't be reused until there is a use for it, which is true of all allocation schemes. Further, even alloc/free-based memory managers often defer consolidation of adjacent free sections until there's an allocation request that can't be satisfied by a known free block. Many GC implementations work by identifying areas of storage that aren't used by any live objects, without knowing nor caring what objects may have previously occupied now-unused regions of storage. – supercat Jan 22 '21 at 15:57
  • 2
    I'm not sure the qualifier "large" is needed here. Memory corruption is possible in small programs as well. The issue is not size, but complexity, arriving when you breach the boundary between "obviously nothing wrong" and "nothing obviously wrong". – fadden Jan 22 '21 at 16:08
  • The nature of programming has changed in the last decade or so. Time was that the only not-my-teams-code was the OS itself. Now you use 43 badly-documented third-party libraries with sketchy interfaces. – dave Jan 22 '21 at 23:27
  • DMA - the easiest way to accidentally wipe random memory on accident. Not necessarily a C/Asm problem, but I used it a ton more in Asm :) – Michael Dorgan Jan 23 '21 at 01:32
  • 1
    Voting to close as opinion-based, and the current list of answers kinda proves it. – pipe Jan 23 '21 at 03:01
  • @supercat "As soon as the last rooted reference gets overwritten, the storage will become available for reuse" This is not correct. In the general case, the system doesn't know that the last rooted reference has been overwritten until it does a garbage collection. Until that point, the storage cannot be reused. On properly tuned systems garbage collection happens infrequently or even never if a process's memory requirements are low. In the latter case, the storage for objects without rooted references will never be reclaimed. – JeremyP Jan 24 '21 at 13:41
  • @MichaelGraf Bounds checking doesn't prevent out-of-bounds errors; it just changes the nature of the damage they can cause from "memory corruption" to "denial of service". The only way to reliably prevent indexing errors is to not use indexing, ie by using enumerator-style abstractions instead. – Mason Wheeler Jan 24 '21 at 13:53
  • @JeremyP: If the system knows of a region of storage it can use to satisfy an allocation request, why should it care what other regions of storage are available? Storage that previously held an object to which no rooted reference exists will be available as soon as the system bothers checking whether that region of storage is available, which it will do as soon as there's an allocation request which doesn't get satisfied by some other means first. – supercat Jan 24 '21 at 18:37
  • @MasonWheeler — It does more. First, it moves the time of detection forward: the error is flagged when the erroneous write is attempted, rather than crashing an undefined time later when trying to read the overwritten data / execute the overwritten code; and it is flagged every time. This greatly increases the likelyhood that the error will be caught in testing. Second, it provides information about the error, increasing the likelyhood that the root cause will be found and the error fixed. – Michael Graf Jan 24 '21 at 20:19
  • @supercat Yes "as soon as the system bothers checking". The only way to do that is to traverse the entire rooted reference graph which is an expensive and time consuming operation. If the system finds it can't allocate an object, it triggers garbage collection. That might, by the way, involve moving allocated objects to different parts of RAM. You can't say the RAM is available as soon as all the references have gone, because it isn't. – JeremyP Jan 26 '21 at 09:04
  • @JeremyP: In a typical GC system, even if no storage is available at a time when code erases the last rooted reference to an object, and that action is immediately followed by an allocation request, the allocation will succeed by using the storage that had just been used by the object that no longer exists. The fact that a GC cycle occurred between the overwrite of the last reference and the successful completion of the allocation request is an implementation detail which is essentially invisible to the application outside of its effect on things like performance counters. – supercat Jan 26 '21 at 17:35
  • @JeremyP: Note that even in systems that use alloc/free semantics, it would not be uncommon for free() to simply set a flag on the formerly-allocated block that won't be checked unless or until the system does a free-block-scan to merge free blocks into a sorted list. A full reference scan would cost more than such a merge operation, but in a modern generational GC most scans will only need to traverse the parts of the object graph which have been created or touched since the previous GC scan. – supercat Jan 26 '21 at 17:41
  • @supercat How does the system know it has just erased the last rooted reference to an object? – JeremyP Jan 28 '21 at 09:29
  • @JeremyP: It wouldn't care. A tracing GC works in a fashion analogous to a bowling alley pinsetter's deadwood collector, which lifts all the pins that are still standing, and runs the sweeper bar underneath.The collector doesn't know nor care what pins need to be swept up, or where they are. It simply clears out the entire area. Any object to which no reference exists may be swept up any time the GC decides it would benefit from finding more free space. Increasing the number of objects that are creating and almost immediately discarded will increase the frequency of gen-zero collections... – supercat Jan 28 '21 at 15:48
  • ...but the cost of a gen-zero collection is affected by the number of objects that survive, and the number of older objects that have been modified since the last gen-zero collection. If the rate at which long-lived objects are created or modified is constant, increasing the frequency of gen-zero collections will decrease the amount of work required for each one. If an object gets created and abandoned without registering a lock, finalizer, or any other special features, the GC will neither know nor care that the object ever existed. – supercat Jan 28 '21 at 15:52
  • @supercat That's the whole point. The sweeper bar can't sweep away the fallen pins until the standing pins have been lifted out of the way. The JVM cannot know if a particular block is available for reuse unless it knows it is not reachable from any rooted object. The only way to find that out is to traverse the rooted object graph. – JeremyP Jan 29 '21 at 08:42
  • @JeremyP: If objects which get kept by the pinsetter are moved to the next lane over so the first lane is devoid of anything but new objects, it will only be necessary to to traverse the portion of the rooted object graph which has been touched since the last time the pinsetter ran. If an object hasn't been touched since the pinsetter ran, it can't contain any references to anything in the first lane. – supercat Jan 29 '21 at 12:38

12 Answers12

53

Coding in assembly is brutal.

Rogue pointers

Assembly languages rely even more on pointers (through address registers) so you can't even rely on the compiler or static analyzing tools to warn you about such memory corruptions / buffer overruns as opposed to C.

For instance in C, a good compiler may issue a warning there:

 char x[10];
 x[20] = 'c';

That's limited. As soon as the array decays to a pointer, such checks cannot be performed, but that's a start.

In assembly, without proper run-time or formal execution binary tools you can't detect such errors.

Rogue (mostly address) registers

Another aggravating factor for assembly is that the register preservation and routine calling convention isn't standard / guaranteed.

If a routine is called and doesn't save a particular register by mistake, then it returns to the caller with a modified register (beside the "scratch" registers that are known to be trashed on exit), and the caller doesn't expect it, which leads to reading/writing to the incorrect address. For instance in 68k code:

    move.b  d0,(a3)+
    bsr  a_routine
    move.b  d0,(a3)+   ; memory corruption, a3 has changed unexpectedly
    ...

a_routine: movem.l a0-a2,-(a7) ; do stuff lea some_table(pc),a3 ; change a3 if some condition is met movem.l (a7)+,a0-a2 ; the routine forgot to save a3 ! rts

Using a routine written by someone else which doesn't use the same register saving conventions can lead to the same issue. I usually save all registers prior to using someone else’s routine.

On the other hand, a compiler uses the stack or standard register parameter passing, handles local variables using stack/other device, preserves registers if needed, and it's all coherent in the whole program, guaranteed by the compiler (unless there are bugs, of course)

Rogue addressing modes

I fixed a lot of memory violations in ancient Amiga games. Running them in a virtual environment with MMU activated sometimes triggers read/write errors in complete bogus addresses. Most of the time those read/writes have no effect because the reads return 0 and the writes go in the woods, but depending on the memory configuration it can have nasty consequences.

There were also cases of addressing errors. I saw stuff like:

 move.l $40000,a0

instead of immediate

 move.l #$40000,a0

in that case, the address register contains what's in $40000 (probably trash) and not the $40000 address. This leads to catastrophic memory corruption in some cases. The game usually ends up doing the action that didn't work somewhere else without fixing this so the game works properly most of the time. But there are times when the games had to be properly fixed to restore proper behaviour.

In C, misleading a value for a pointer leads to a warning.

(We gave up on a game such as "Wicked" that had more and more graphical corruption the more advanced you got in levels, but also depending on the way you passed the levels and their order...)

Rogue data sizes

In assembly, there are no types. It means that if I do

move.w #$4000,d0           ; copy only 16 bits
move.l #1,(a0,d0.l)    ; indexed write on d1, long

the d0 register only gets half of the data changed. May be what I wanted, maybe not. Then if d0 contained zero on most significant 32-16 bits, the code does what's expected, otherwise it adds a0 and d0 (full range) and the resulting write is "in the woods". A fix is:

move.l #1,(a0,d0.w)    ; indexed write on d1, long

But then if d0 > $7FFF it does something wrong too, because d0 is considered negative then (not the case with d0.l). So d0 needs sign extension or masking...

Those size errors can be seen on a C code, for instance when assigning to a short variable (which truncates the result) but even then you just get a wrong result most of the time, not fatal issues like above (that is: if you don't lie to the compiler by forcing wrong type casts)

Assemblers have no types, but good assemblers allow to use structures (STRUCT keyword) which allow to elevate the code a little by automatically computing structure offsets. But a bad size read can be catastrophic no matter you're using structs/defined offsets or not

move.w  the_offset(a0),d0

instead of

move.l  the_offset(a0),d0

isn't checked, and gives you the wrong data in d0. Make sure you drink enough coffee while coding, or just write documentation instead...

Rogue data alignment

The assembler usually warns about unaligned code, but not on unaligned pointers (because pointers have no type), that can trigger bus errors.

High level languages use types and avoid most of those errors by performing alignment/padding (unless, once again, lied to).

You can write assembly programs successfully however. By using a strict methodology for parameter passing/register saving and by trying to cover 100% of your code by tests, and a debugger (symbolic or not, this is still the code that you have written). That isn't going to remove all the potential bugs, specially the ones caused by wrong input data, but it will help.

user3840170
  • 23,072
  • 4
  • 91
  • 150
Jean-François Fabre
  • 10,805
  • 1
  • 35
  • 62
  • 2
    Yes, this is true. But I'm more interested in the historical experience from that era's development works on real programs, which is why I emphasized "large programs in assembly" in my question. – 比尔盖子 Jan 21 '21 at 10:35
  • 8
    I wrote large assembly programs in the 199x does that count ? I also know of games written in assembly in the eighties/nineties where there were bugs – Jean-François Fabre Jan 21 '21 at 10:39
  • Yes. It certainly counts. – 比尔盖子 Jan 21 '21 at 10:40
  • 5
    But an experienced programmer knows these things, knows the tools they're working with, and is careful to review their own code. Being difficult does not mean that mistakes are inevitable. – dave Jan 21 '21 at 13:26
  • exactly. The people who coded those things professionally had a good methodology and experience. And with a lot of testing you can find most of the bugs too. You also need a good low-level debugger – Jean-François Fabre Jan 21 '21 at 14:50
  • "The assembler usually warns [...] not on unaligned data, that can trigger bus errors" Of course it does. It even aligns data for you. I assume what you're referring to are unaligned pointers, something an assembler can't do as it's a runtime value, so checking can only be done during runtime by machine check. Also: "High level languages use types and avoid most of those errors" Well, the same is true for any assembler worth its name. Using the worst of assembler as example gives a somewhat tiled picture. – Raffzahn Jan 21 '21 at 17:09
  • yes I mean unaligned pointers. An unaligned dc.l is detected. yes, you can define structures in assemblers, but that doesn't mean you have to respect/use them. – Jean-François Fabre Jan 21 '21 at 17:43
  • 4
    Not worthy of its own answer, but there's also (for back in the 16/32 bit X86 days at least) the joy that was segment registers... still remember spending altogether too much time debugging an assembly program where I was doing stuff with two registers where one of them used a different segment register (I want to say BX and BP, where the latter use the SS register and the former was CS) and forgot about that (as far as size, it was large enough that notepad couldn't edit it and would have been late 90's) – Foon Jan 21 '21 at 18:04
  • @Jean-FrançoisFabre unaligned pointers are a runtime issue, quite hard to do protection agains - especially if C is the language in comparison. "but that doesn't mean you have to respect/use them." is that a serious statement? Programmers not using the tools given / adhering to their own definitions are an issue independent of language, thus this may not make much sense when judging, as requested, Assembly vs. C, or does it? – Raffzahn Jan 21 '21 at 18:44
  • 1
    You sure about that? I did some large assembly projects and almost never had pointer errors. It's dynamic memory that leads to pointer errors, not language choice, and assembly did a better job of managing the memory layout in a way C just couldn't. – Joshua Jan 21 '21 at 20:51
  • Not worthy of an answer, but this question provides some of the motivation for the development of Lisp and related languages. All of the bugs associated with undisciplined use of pointers will be in the interpreter, and this means that debugging the interpreter will have a big payoff. – Walter Mitty Jan 21 '21 at 20:56
  • 1
    @Foon: just for the record, SS is the default when [R/E]BP is the base, otherwise DS for all other addressing modes. (ESP/RSP as the base register also implies SS, which is possible with 32 and 64-bit addressing modes. 64-bit mode fixes CS/DS/ES/SS bases at 0, but the exception type for a non-canonical address (#GP general-protection or #SS stack-segment) does actually depend on the addressing mode). And yeah, non-flat memory models are a pain, glad I never had to actually care about it. – Peter Cordes Jan 22 '21 at 05:43
  • And there is also the issue of no memory initialization to 0 (something that civilized frameworks/runtimes do by default), if someone checks for "null pointer" but forgot to init it with zero you are in for fun time. – PTwr Jan 22 '21 at 14:42
  • True, but that doesn't depend of the language but of the operating system. You'll have the same issues in C if you don't initialize your global data in those hostile systems. Fun fact: the Ada language guarantees that null pointers are null. the startup sees to it. – Jean-François Fabre Jan 22 '21 at 14:59
  • 2
    A lot of the comments in this thread point to the fact that a good architecture is important when programming close to the machine. The pile-up that is the x86 ISA is not programmer-friendly. It seems to me that the perceived difficulty of avoiding wild accesses depends very much on what hardware you've had the fortune (or misfortune) to deal with. – dave Jan 23 '21 at 17:57
25

I spent most of my career writing assembler, solo, small teams and large teams (Cray, SGI, Sun, Oracle). I worked on embedded systems, OS, VMs, and bootstrap loaders. Memory corruption was seldom if ever a problem. We hired sharp people, and the ones that failed were managed into different jobs more appropriate to their skills.

We also tested fanatically - both at the unit level and system level. We had automated testing that ran constantly both on simulators and real hardware.

Near the end of my career I interviewed with a company and I asked about how they did their automated testing. Their response of "What?!?" was all I needed to hear, I ended the interview.

jack bochsler
  • 351
  • 2
  • 3
  • Your interview story is a beautiful thing. I'm glad you had the perspective to avoid a mess. – chicks Jan 27 '21 at 18:21
19

Simple idiotic errors abound in assembly, no matter how careful you are. It turns out that even stupid compilers for poorly-defined high level languages (like C) constrain a huge range of possible errors as semantically or syntactically invalid. A mistake with a single extra or forgotten keystroke is far more likely to refuse to compile than it is to assemble. Constructs you can validly express in assembly which just don't make any sense because you're doing it all wrong are less likely to translate into something that's accepted as valid C. And since you're operating at a higher level, you're more likely to squint at it and go "huh?" and rewrite the monster you just wrote.

So assembly development and debugging is, indeed, painfully unforgiving. But most such errors break things hard, and would show up in development and debugging. I would hazard the educated guess that, if the developers are following the same basic architecture and same good development practices, the final product should be about as robust. The sort of errors a compiler catches can be caught with good development practices, and the sort of errors compilers don't catch may or may not be caught with such practices. It will take a lot longer to get to the same level, though.

RETRAC
  • 13,656
  • 3
  • 42
  • 65
  • Think so? I would go rather say the support for errors in C is at least as 'good' as in Assembler, if not better. If programmers don't use the features offered, it can't be blamed on the tool. – Raffzahn Jan 21 '21 at 18:36
  • 10
    +1 for "most such errors break things hard". In my Assembler class in college, when running my program I was always ready to dive for the power button whenever my program started doing weird things (random pixels printing across the screen, my computer speaker beeping, etc.), just to avoid the small but real possibility of accidentally triggering dangerous commands such as directly driving the hard drive heads. Never had that happen, but my programs did tend to fail spectacularly. :) – bob Jan 21 '21 at 19:23
  • 2
    @bob , I heard that's about how Ctulhu was called the first time. Maybe it was you... – Aganju Jan 23 '21 at 04:59
  • 1
    re: A mistake with a single extra or forgotten keystroke is far more likely to refuse to compile than it is to assemble Either way, you'd see it when you read your code after you decided it worked. You do read your code, right? – dave Jan 23 '21 at 18:00
  • @Aganju nice! :) – bob Jan 25 '21 at 14:23
14

I wrote the original garbage collector for MDL, a Lisp like language, back in 1971-72. It was quite a challenge for me back then. It was written in MIDAS, an assembler for the PDP-10 running ITS.

Avoiding memory corruption was the name of the game in that project. The entire team had dread of a successful demo crashing and burning when the garbage collector was invoked. And I had no really good debugging plan for that code. I did more desk checking than I have ever done before or since. Stuff like making sure there were no fencepost errors. Making sure that when a group of vectors were moved, the target didn't contain any non garbage. Over and over, testing my assumptions.

I never found any bugs in that code, except for ones found by desk checking. After we went live none ever surfaced during my watch.

I'm just plain not as smart as I was fifty years ago. I couldn't do anything like that today. And systems of today are thousands of times bigger than MDL was.

Walter Mitty
  • 6,128
  • 17
  • 36
7

Memory corruption bugs have always been a common problem in large C programs [...] But there was a time when large programs, including operating systems, were written in assembly, not C.

You're aware that there are other languages that were quite common already early on? Like COBOL, FORTRAN or PL/1?

Was memory corruption bugs a common problem in large assembly programs?

This depends of course on multiple factors, like

  • the Assembler used, as different assembler programs offers different level of programming support.
  • program structure, as especially large programs adhere to checkable structure
  • modularisation, and clear interfaces
  • the kind of program written, as not every task requires pointer fiddling
  • best practice style

A good assembler does not only make sure that data is aligned, but as well offers tools to handle complex data types, structures and alike in abstract fashion, reducing the need to 'manually' calculate pointers.

An assembler used for any serious project is as always a macro assembler (*1), thus capable to encasule primitive operations into higher level macro instructions, enabling a more application centric programming while avoiding many pitfalls of pointer handling (*2).

Program types are also quite influential. Applications usually consist of various modules, many of them can be written almost or complete without (or only controlled) pointer usage. Again, usage of tools provided by the assembler is key to less faulty code.

Next would be best practice - which goes hand in hand with many of the before. Simply do not write programs/modules that need multiple base registers, that hand over large chunks of memory instead of dedicated request structures and so on...

But best practice starts already early on and with seemingly simple things. Just take the example of a primitive (sorry) CPU like the 6502 having maybe a set of tables, all adjusted to page borders for performance. When loading the address of one of these tables into a zero page pointer for indexed access, usage of the tools the assembler would mean to go

     LDA   #<Table
     STA   Pointer

Quite some programs I've seen rather go

     LDA   #0
     STA   Pointer

(or worse, if on a 65C02)

     STZ   Pointer

The usual argumentation is 'But it is aligned anyway'. Is it? Can that be guaranteed for all future iterations? What about some day when address space get tight and they need to be moved to non aligned addresses? Plenty of great (aka hard to find) errors to be expect.

So Best practice again brings us back to using the Assembler and all the tools it offers.

Do not try to play Assembler instead of the Assembler - let him do his job for you.

And then there is the runtime, something that applies to all languages but is often forgotten. Beside things like stack checking or bounds check on parameters, one of the most effective ways to catch pointer errors is simply locking the first alnd last memory page against write and read (*3). It not only catches the all beloved null pointer error, but as well all low positive or negative numbers which are often result of some prior indexing going wrong. Sure, Runtime is always the last resort, but this one is an easy one.

Above all, maybe the most relevant reason is

  • the machine's ISA

in reducing chances of memory corruption by reducing the need to handle with pointers at all.

Some CPU structures simply require less (direct) pointer operations than other. There is a huge gap between architectures that include memory to memory operations vs. such who don't, like accumulator based load/store architectures. The inherently require pointer handling for anything larger than a single element (byte/word).

For example to transfer a field, let's say a customer name from around in memory, a /360 uses a single MVC operation with addresses and transfer length generated by the assembler from data definition, while a load/store architecture, designed to handle each byte separate, has to set up pointers and length in registers and loop around a moving single elements.

Since such operations are quite common, resulting potential for errors is as well common. Or, on a more generalized way it can be said that:

Programms for CISC processors are usually less prone to errors than such written for RISC machines.

Of course and as usual, everything can be screwed up by bad programming.

And how did it compare to C programs?

Much the same - or better, C is the HLL equivalent of the most primitive CPU ISA, so anything offering higher level instructions will fair better.

C is inherently a RISCy language. Operations provided are reduced to a minimum, which goes with a minimum ability for check against unintended operations. Using unchecked pointers is not only standard but required for many operations, opening many possibilities for memory corruption.

Take in contrast a HLL like ADA, here it's almost impossible to create pointer havoc - unless it's intended and explicit declared as option. A good part of it is (like with the ISA before) due higher data types and handling thereof in a typesafe manner.


For the experience part, I did most of my professional life (>30y) in Assembly projects, with like 80% Mainframe (/370) 20% Micros (mostly 8080/x86) - plus private a lot more :) Mainframe programming covered projects as large as 2+ millions LOC (instructions only) while micro projects kept around 10-20k LOC.


*1 - No, something offering away replacing text passages with premade text is at best some textual preprocessor, but not a macro assembler. A macro assembler is a meta tool to create the language needed for a project. It offers tools to tap the information the assembler gathers about the source (field size, field type, and many more) as well as control structures to formulate handling, used to generate appropriate code.

*2 - It's easy to bemoan that C wasn't fit with any serious macro capabilities, it would not only removed the need for many obscure constructs, but as well enabled much advancement by extending the language without the need to write a new one.

*3 - Personally I prefer to make page 0 only write protected and fill the first256 bytes with binary zero. That way all null (or low) pointer writes still result in a machine error, but reading from a null pointer returns, depending on the type, a byte/halfword/word/doublewort containing zero - well, or a null-string :) I know, it's lazy, but it makes life much more if easy one has to incooperate other peoples code. Also the remaining page can be used for handy constant values like pointers to various global sources, ID strings, constant field content and translate tables.

Raffzahn
  • 222,541
  • 22
  • 631
  • 918
  • 3
    one of the most effective ways to catch pointer errors is simply locking the first alnd last memory page against write and read yeah, but 1) you need an MMU for that and 2) you're not going to catch all errors, only the ones that are really nonsense. The electricfence linux library is able to protect all allocated memory (not stack) against page boundaries, provided that you perform 2 tests (protection before the buffer, protection after the buffer, different alignment for both) – Jean-François Fabre Jan 21 '21 at 18:35
  • @Jean-FrançoisFabre No doubt, there are lots of improvements possible - much the same way as writing good code is an area as wide as the universe :) Above is only meant as a simple and easy to understand example. There are many more, like making code non writable and so on. Regarding the nonsense, my experience is that these are the most common, and as usual, catching the most common with the least effort is a great gain, isn't it? – Raffzahn Jan 21 '21 at 18:40
  • In many cases, code which doesn't need to accommodate page crossings can be significantly smaller and faster than code that does. Many assemblers for the 6502 can, with suitable use of macros, either be made to arrange things so they won't cross page boundaries, or at minimum squawk if objects would get placed in ways that cross page boundaries, thus allowing the programmer to rearrange things. – supercat Jan 21 '21 at 20:11
  • @supercat Exactly my point. Don't try to be the Assembler, let the Assembler do the work. – Raffzahn Jan 21 '21 at 20:41
  • 1
    Assemblers can't generally determine when code should be designed to support page crossings and when it shouldn't. For the most part, the only "help" I would seek to receive from an assembler when dealing with such issues would be to squawk if its auto-generated memory layout would result in problematic page crossings. In some simple cases, a compiler might be able to auto-select from among a few alternatives to see if any will work, but the effort required to make that work would often exceed the effort to rearrange things if the assembler issues any page-crossing squawks. – supercat Jan 21 '21 at 20:52
  • 1
    @Raffzahn As the RISC designers noted, don't be try to be the compiler, either! – RETRAC Jan 28 '21 at 18:12
  • 1
    @RETRAC Exactly. That's why I prefer either a handy assembler that takes away as much tinkering as possible, or a sufficient abstract language, like Ada, doing the same. – Raffzahn Jan 28 '21 at 18:32
7

I have written OS mods in assembly on CDC G-21, Univac 1108, DECSystem-10, DECSystem-20, all 36 bit systems, plus 2 IBM 1401 assemblers.

"Memory corruption" existed, mostly as an entry on a "Things Not To Do" list.

On a Univac 1108 I found a hardware error where the first half-word fetch (the interrupt handler address) after a hardware interrupt would return all 1s, instead of the contents of the address. Off into the weeds, with interrupts disabled, no memory protect. Round and round it goes, where it stops nobody knows.

waltinator
  • 347
  • 1
  • 4
7

You are comparing apples and pears. High level languages were invented because programs reached a size which was unmanageable with assembler. Example: "V1 had 4,501 lines of assembly code for its kernel, initialisation and shell. Of those, 3,976 account for the kernel, and 374 for the shell." (From this answer.)

The. V1. Shell. Had. 347. Lines. Of. Code.

Today's bash has maybe 100,000 lines of code (a wc over the repo yields 170k), not counting central libraries like readline and localization. High-level languages are in use partly for portability but also because it is virtually impossible to write programs of today's size in assembler. It's not just more error prone — it's nigh impossible.

  • It's possible to write assembly-language in ways that are highly maintainable and scalable, if one uses consistent conventions for register usage and function entry/exit. On the other hand, doing so would negate much of the potential efficiency benefit one might hope to receive from using assembly language. Compilers for high level languages can exploit the fact that two string literals may happen to be equal in the source code, but ensure that changing the value of one won't inadvertantly change the other. In assembly code, one would either have to use two strings which would each... – supercat Jan 22 '21 at 21:02
  • ...get written into the object code, or else have multiple references to a shared string and ensure that if its value is changed, a new copy is created and only references that should point to the new value are updated to point to that new copy. – supercat Jan 22 '21 at 21:03
  • 2
    @supercat Of course you can code in assembler in a fashion which resembles C (elaborate macros, register usage convention). You end up re-implementing or emulating a lot of what C and its library do -- only without the benefit of C's checks and enforcements ;-). – Peter - Reinstate Monica Jan 22 '21 at 22:45
4

I don't think memory corruption is generally more of a problem in assembly language than in any other language which uses unchecked array-subscripting operations, when comparing programs that perform similar tasks. While writing correct assembly code may require attention to details beyond those that would be relevant in a language like C, some aspects of assembly language are actually safer than C. In assembly language, if code performs a sequence of loads and stores, an assembler will produce load and store instructions in the order given without questioning whether they are all necessary. In C, by contrast, if a clever compiler like clang is invoked with any optimization setting other than -O0 and given something like:

extern char x[],y[];
int test(int index)
{
    y[0] = 1;
    if (x+2 == y+index)
        y[index] = 2;
    return y[0];
}

it may determine that the value of y[0] when the return statement executes will always be 1, and there's thus no need to reload its value after writing to y[index], even though the only defined circumstance where the write to index could occur would be if x[] is two bytes, y[] happens to immediately follow it, and index is zero, implying that y[0] would actually be left holding the number 2.

supercat
  • 35,993
  • 3
  • 63
  • 159
  • I'm not sure I understand the point of your example. Are you talking about mixing C with asm, where x[] and y[] are defined in an asm file so you can control their relative layout (e.g. putting x and y contiguous in that order)? If you tell the C compiler about with a global struct containing two arrays, or a union with two vs. one large array, it would be a lot closer to well-defined behaviour for x+2 to point into y. – Peter Cordes Jan 22 '21 at 05:58
  • Comparing two pointers derived from different objects is undefined behaviour in ISO C, so this optimization is legal whether you like it or not. I thought you could probably get well-defined behaviour for the semantics you seem to want in C by casting the pointers to uintptr_t (https://godbolt.org/z/beq655), but clang still returns an immediate 1 instead of reloading. Not sure if that's a bug or not. – Peter Cordes Jan 22 '21 at 05:58
  • 1
    @PeterCordes: If x is a two-element array, evaluation of x+2 would be defined behavior. If y does not happen to immediately follow x in the address space, or index is non-zero, behavior would be defined as skipping the assignment to y[index] (which is indeed what happens). If index is zero and y does immediately follow x, that situation would seem to be rather explicitly addressed in N1570 6.5.9p6. I wouldn't mind if the Standard said the comparison between x+2 and y+index would yield 0 or 1 in Unspecified fashion, but clang's behavior... – supercat Jan 22 '21 at 16:06
  • ...is inconsistent even with that. Even if a programmer doesn't deliberately place x and y adjacent, they could end up that way. If a program would work correctly if y[0] and the return value are both 1, or if they are both 2, but not if they differ, correctness shouldn't be affected by the proximity of x and y. If the function were written in assembly language, there would be no doubt but that its return value would always equal y[0], but in clang's dialect of C not so much. – supercat Jan 22 '21 at 16:19
  • 1
    Ah you're right, I was assuming the == itself was UB, but N1570 6.5.9p6 confirms that it's fully well defined. So this does seem to be a clang bug. In some sense, that doesn't mean C has this downside, but in practice Implementation bugs can make language features less usable. – Peter Cordes Jan 22 '21 at 16:21
  • @PeterCordes: Both clang and gcc make unsound assumptions that if two pointer expressions compare equal, they may be freely substituted without regard for aliasing sets. In the case of clang, I think this assumption is baked into the LLVM back end since Rust exhibits the same behavior (which isn't justified under the Rust language spec either). Unfortunately, there isn't a term for "C as processed by any compiler that doesn't use a gcc or llvm back end", and it doesn't really matter what the Standard says if commercial compilers consistently support constructs beyond what it requires... – supercat Jan 22 '21 at 16:24
  • ...while clang and gcc can't be relied upon to do everything the Standard does require. The authors of the C Standard explicitly said they did not wish to preclude the use of the language as a form of "high-level assembler", but the only way to make clang and gcc reliable is to disable all optimizations altogether. To be fair to gcc, it can sometimes produce decent code at -O0 if one makes generous use of register qualifiers and other hand-holding, but that's a bit of a pain. – supercat Jan 22 '21 at 16:25
3

Assembler requires more intimate knowledge of the hardware you're using than other languages like C or Java. The truth is, though, assembler has been in use in almost everything from the first computerized cars, early video game systems up through the 1990's, up to the Internet-of-Things devices we use today.

While C offered type safety, it still didn't offer other safety measures like void pointer checking or bounded arrays (at least, not without extra code). It was quite easily to write a program that would crash and burn as well as any assembler program.

Tens of thousands of video games were written in assembler, compos to write small yet impressive demos in only a few kilobytes of code/data for decades now, thousands of cars still use some form of assembler today, as well as a few lesser-known operating systems (e.g. MenuetOS). You might have dozens or even hundreds of things in your house that were programmed in assembler that you don't even know about.

The main problem with assembly programming is that you need to plan more vigorously than you do in a language like C. It's perfectly possible to write a program with even 100k lines of code in assembler without a single bug, and it's also possible to write a program with 20 lines of code that has 5 bugs.

It's not the tool that's the problem, it's the programmer. I would say that memory corruption was a common problem in early programming in general. This was not limited to assembler, but also C (which was notorious for leaking memory and accessing invalid memory ranges), C++, and other languages where you could directly access memory, even BASIC (which had the ability to read/write specific I/O ports on the CPU).

Even with modern languages that do have safe-guards, we will see programming errors that crash games. Why? Because there's not enough care taken into designing the application. Memory management hasn't disappeared, it's been tucked into a corner where it's harder to visualize, causing all kinds of random havoc in modern code.

Virtually every language is susceptible to various kinds of memory corruption if used incorrectly. Today, the most common problem are memory leaks, which are easier than ever to accidentally introduce due to closures and abstractions.

It's unfair to say that assembler was inherently more or less memory-corrupting than other languages, it just got a bad rap because of how challenging it was to write proper code.

phyrfox
  • 2,503
  • 1
  • 13
  • 14
2

It was a very common problem. IBM's FORTRAN compiler for the 1130 had quite a few: the ones I remember involved cases of incorrect syntax that weren't detected. Moving to machine-near higher level languages didn't obviously help: early Multics systems written in PL/I crashed frequently. I think that programming culture and technique had more to do with ameliorating this situation than language did.

John Doty
  • 2,344
  • 6
  • 12
2

I did a few years of assembler programming, followed by decades of C. Assembler programs did not seem to have more bad pointer bugs than C, but a significant reason for that was that assembler programming is comparatively slow work.

The teams I was in wanted to test their work every time they'd written an increment of functionality, which was typically every 10-20 assembler instructions. In higher-level languages, you typically test after a similar number of lines of code, which have a lot more functionality. That trades off against the safety of a HLL.

Assembler stopped being used for large-scale programming tasks because it gave lower productivity, and because it usually wasn't portable to other kinds of computer. In the last 25 years I've written about 8 lines of assembler, and that was to generate error conditions for testing an error handler.

John Dallman
  • 13,177
  • 3
  • 46
  • 58
1

Not when I was working with computers back then. We had many problems but I never encountered memory corruption issues.

Now I worked on several IBM machines 7090,360,370,s/3,s/7 and also 8080 and Z80 based micros. Other computers may well have had memory problems.