When did compilers start generating optimized code that runs faster than an average programmer's assembly code?

Question

It is highly unrecommended to write your own code in assembly now since, in most cases, gcc -O3 does magic. But in the ‘80s it was believed that compiled C code takes 4(?) times or more than a well-organized assembly equivalent. When and why does coding C for performance as the primary choice become the received practice? Which compiler first made it, on which architecture?

Are there any high level language compilers (Ada/COBOL/Fortran/Pascal) other than C families which generates optimized code outperforming average assembly programmers?

Comments are not for extended discussion; this conversation has been moved to chat. — Chenmunka, Sep 13 '20 at 17:24
On Unix workstations, as most software was compiled with GCC, the CPUs started to be designed to run code compiled by GCC faster. — Ian Ringrose, Sep 13 '20 at 18:19
There is no problem with writing assembly by hand except that you are less productive - you can only write a certain amount of code per day - and you can pack more work into higher level language code than assembly. So by pure economics it doesn't make sense for humans to write assembly. An exception is when you have code that needs to be as fast as absolutely possible where an experienced assembly programmer MAY be able to know things that it is not yet possible to let a programmer hint to a compiler. — Thorbjørn Ravn Andersen, Sep 14 '20 at 10:51
What's an "average" assembly-language programmer these days, though? Is the average getting better because only the motivated need to do it? — dave, Sep 14 '20 at 15:40
my personal experience (video games, so it was all asm then later moved to C++) is that hand asm was always faster, but become not worth it around the Pentium 3 era, but for some specialized code. The one thing that took very long for compilers to be good at was generating the FPU code for math operations. — Thomas, Sep 14 '20 at 20:30
@IanRingrose "CPUs optimized for GCC" happened long after compilers could beat decent hand-coded assembly. In the early 90s, GCC was effectively non-existent on Sun workstations. But even back then, IME Sun's compiler produced optimized code that was about as fast if not faster than decent assembly - but with a lot less programmer effort required. I'd guess compilers were "winning" widely by the late 80s at the latest. — Andrew Henle, Sep 15 '20 at 14:27
I remember Carmack or Romero being interviewed when Wolf 3D came out and them saying that one line drawing routine aside it was all C.At that point in time it was surprising (to me at least) that this was the case. Later discovered that the Amiga version of Marble Madness was in C. — Alan B, Sep 16 '20 at 06:21
@Alan B, it does not really matter what proportion of the routines is in C. What matters is the proportion of time your program sits in that single routine written in assembly. And then the speed-up that can be reached in a well-tuned assembly routine vs output of the optimizing compiler can easily translate into appreciable speed-up of the whole program. — introspec, Sep 16 '20 at 13:40
@introspec: I once took some DSP code that had been written entirely in assembly language except for some floating-point math (fixed-point DSP!) which the programmer didn't know how to write in assembly language. I reworked some of the math to use some short assembly-languages for multiply-accumulate, simple FIR filtering, and integer square root, and rewrote everything else in C. System went from running about 20% as fast as needed to running 5-10x as fast as needed, but with a tiny fraction as much assembly-language code. — supercat, Sep 16 '20 at 15:31
@supercat, yes, absolutely. No-one is saying that everything needs to be written in assembly, it is usually impractical and unnecessary. However, the performance of even the best compilers at getting these really hot innerloops right is rarely impressive from the point of view of real assembly programmers. — introspec, Sep 16 '20 at 16:20
@introspec: Often, getting inner loops right requires knowing things that programmers can't very well convey. For example, suppose one is targeting a typical 32-bit ARM, has a word-aligned group of 256 bytes that will hold values 0..127, wants to decrement all the ones that are non-zero, and would regard it acceptable if the presence of any byte value 128..255 arbitrarily disturbs the values of any other bytes in the region. Is there any portable way to specify such behavior in a way that would allow a compiler to generate code that isn't at least twice as slow as optimal? — supercat, Sep 16 '20 at 16:57
@introspec: On the Cortex-M0, Loading each word, adding 0x7F7F7F7F [value kept in register], "and-not"ing with 0x7F7F7F7F, shifting right 7, subtracting from the original, and storing it back would take eight cycles per four bytes not counting loop overhead. Consolidating loads and stores to use LDM/STM would take 26 cycles per 16 bytes while leaving a "low" register available as a loop counter. I can't imagine any compiler coming anywhere near close to that. — supercat, Sep 16 '20 at 17:02
@supercat, My experience with assembly a lot more retro, so I best qualified to talk about Z80. And I think that it is an excellent CPU to consider in this context, because it is fairly different from the higher-level language abstractions people so used to nowadays. E.g., the most efficient paradigm for Z80 filling memory is to set up sequences of PUSH instructions, possibly interdispersed with commands for reloading CPU registers. This is so unlike to the way in which C operates, that any form of machine translation cannot possibly map onto this paradigm well. — introspec, Sep 16 '20 at 20:18
@introspec: IMHO, the Standard's requirement that C implementations report recursion really sunk the quality of C compilers for that platform. On platforms where recursion would be totally impractical (e.g. 8051 or PIC) linkers can statically allocate local variables so that functions which are never live simultaneously can store their variables in the same place. If one avoids using stack-allocated variables, the Z80 can be almost workable as a C platform (I've used it), though having 8-bit operations require values in A while 16-bit operations require values in HL makes many things... — supercat, Sep 16 '20 at 20:23
@supercat, recursion definitely was one of the key issues. However, I think that the issue is much deeper than that. In demomaking for Z80 we have some very successful patterns like POP HL : LDI (which is a LUT translating specially-formed pairs of values into bytes) or its slightly more complex variation POP HL : LDI : LD A,(HL) : LD (BC),A. To write code like this you actively move stack pointer, pre-allocate specific parts of memory to optimize LUT access based on the address ranges that you will need to use, etc, etc. It is really a very different mode of thinking. — introspec, Sep 16 '20 at 20:33
...much clunkier than they really should be. Incidentally, I find myself wondering if the design of the Z80's new instructions was done at a time when the chip was expected to have an 8-bit ALU, and if the decision to use a 4-bit ALU was based upon the adequacy of a 4-bit ALU for the instructions inherited from the 8080? A lot of the instruction set decisions would make sense given an 8-bit ALU, but the 4-bit ALU imposes such a severe time penalty as to seriously degrade their usefulness. — supercat, Sep 16 '20 at 20:33
@introspec: Agreed, but code using IY-based automatic objects is so much slower than code using statically-allocated ones that moving from the latter to the former would save more execution time than any further improvements one could make beyond that. BTW, I find it sad that LDI-family instructions decrement BC rather than just B. Decrementing just B would have allowed execution to be two cycles faster, and even if code had to follow an LDIR with DEC C/JP NZ, an extra 14 cycles every 256 bytes to manage the high-byte count would have been a lot cheaper than 2 cycles every byte. — supercat, Sep 16 '20 at 20:37

score 55 · Accepted Answer · answered Sep 13 '20 at 19:46

55

As a former professional assembly language programmer I would say that by the late 1980s a number of C compilers had become available whose output was as good as something a skilled assembler programmer could produce. I used various x86 C compilers around then and JPI C and WATCOM C in 1988 and MSVC 1.0 in 1994 produced output as good as anything I could write and even taught me the occasional trick.

answered Sep 13 '20 at 19:46

Cecil Ward

731
5
4

2

Watcom had a truly fantastic optimizer by the early-to-mid 1990s that I would have put up against almost any assembly language programmer, but I wasn't using their original tools from the late 1980s, so I didn't know if it was always so awesome. – Cody Gray - on strike Sep 14 '20 at 20:54
4

I spent quite a bit of time looking at output from Borland Turbo C 2.0 (1988) and that compiler varied wildly between smart and dumb. Some tricks (x % 8 -> and ax,7, x = !x -> neg ax; sbb ax,ax; inc ax) were ever-present, but inefficiencies (repetitive wasteful calculations of bx when doing a lot of struct accesses, making bad decisions about stack vs. si/di for frequently-used local vars...) cancelled out the benefit somewhat. – smitelli Sep 15 '20 at 13:42
7

@smitelli: GCC is still like that. When targeting the Cortex-M0, even if code loads a constant into an automatic variable before a loop, and there would be a register available to hold the value throughout the loop, gcc will still sometimes not only apply constant propagation to replace the variable with a constant, and then reload the constant on every iteration of the loop, but even end up adding an extra register move on top of that. – supercat Sep 15 '20 at 15:33
@smitelli: The choice of whether to use stack vs SI/DI would often depend upon whether variables were declared 'register'. Suitable use of that qualifier can make a huge difference when using Turbo C (and also, incidentally, when using gcc with the -O0 flag). On some occasions, code which makes good use of the register qualifier may be more efficient when compiled with -O0 than it would at any other optimization setting, because using -O0 will prevent gcc from making a counter-productive optimization. – supercat Oct 21 '21 at 16:51
1

One thing that the optimising compiler won't do is self-modifying code so sometimes it's still worth writing some inner loops in assembly if you think you can make use of it. – Sam May 01 '23 at 11:29

score 40 · Answer 2 · edited Sep 14 '20 at 20:51

For a start, it is widely known that FORTRAN II for the IBM 1401 series was specifically designed to generate high enough quality object code to make assembly programming of numerical routines unnecessary. FORTRAN compilers have largely kept up that legacy ever since.

C compilers have historically varied in quality a great deal. It must be remembered that C was originally designed as a sort of "portable assembly language" with a reasonable correspondence to the instructions and addressing modes of the PDP-11. Suitably written C with even a simple compiler could be remarkably efficient. But object code produced by some early compilers, particularly for microcomputer platforms such as the PC, was unreasonably bad.

Today, even with the sophisticated compilers now available, it is still usually possible for a skilled assembly coder to write better code than a compiler produces. They may use instructions that the compiler does not know how to use, or understand the algorithms more deeply than can be expressed in C. At a minimum, they can start with the output of a compiler and improve from there.

The advantage of the compiler is that it generates code more quickly, ie. with less developer effort, and the source code is easier to maintain. Today's sophisticated compilers help to reduce the performance deficit that traditionally went along with that. But sophisticated compilers are not new.

Comments are not for extended discussion; this conversation has been moved to chat. — Chenmunka, Sep 14 '20 at 15:27
Sometimes the assembly programmer can take advantage of knowledge of the problem that the compiler can't know. For example (n * 103) >> 10 is equivalent to n / 10, if you know that n is an integer in the range 0 to 99. Compilers know lots of tricks like that, but they don't know that your input is restricted. — Mark Ransom, Sep 29 '21 at 16:14
@MarkRansom: Modern compilers do understand range restrictions. See e.g. https://stackoverflow.com/questions/40447195/can-i-hint-the-optimizer-by-giving-the-range-of-an-integer — MSalters, Oct 22 '21 at 09:58

score 29 · Answer 3 · answered Sep 13 '20 at 22:40

29

Compilers started generating more efficient code than the average assembler programmer the moment that architectures became so complex that the assembler programmer wasn't been able to cope with all the details of them. Things like which instruction should go to pipe U or pipe V in a Pentium processor, etc.

Which one was the first? I'd say, for the x86 architecture, it was the Intel Compiler. At least it was the first one (ttbomk) that was able to detect candidate loops for vectorization and use MMX, SSE and AVX instructions with them.

However, the Watcom C compiler had a reputation for generating very good quality machine code in the days before Pentium and even 486. Some of the optimization options I saw after in the Intel Compiler, were already available in the Watcom.

answered Sep 13 '20 at 22:40

mcleod_ideafix

18,784
2
70
102

AVX didn't exist until after other compilers (like GCC) gained the ability to auto-vectorize. But yes, good point for MMX and SSE. Current ICC still has some auto-vectorization capabilities that current gcc and clang lack, e.g. loops like memchr where the trip count isn't known / calculable ahead of the first iteration. GCC / clang still can't vectorize search loops, but ICC can. – Peter Cordes Sep 14 '20 at 22:46
The specific features involved in the complexity were probably out-of-order processing and superscalar design. – Mark Sep 15 '20 at 01:34
I like that this answer doesn't treat the compiler in a vacuum. As compilers were getting smarter, CPUs were getting more complex to program for. – Cort Ammon Sep 15 '20 at 15:55
A good thing about this example is it does highlight the level of detail an assembly programmer may need to know for the Pentium. As well as the differences between the U and V pipes in terms of integer instructions other questions would be do you try to interleave simple and complex integer instructions to keep them both fed and is the SIU quicker than the CIU for instructions they can both handle? – Single Malt Sep 26 '20 at 18:43
you mean i'm not supposed to read the entire 12,900 page reference manual for ARM v8-A in order to program my apple M2? – don bright May 04 '23 at 00:43

score 22 · Answer 4 · answered Sep 12 '20 at 21:56

22

I came across an interesting comment a few days ago that Donald Knuth was deeply impressed when he discovered that 5 * 5 - 25 was optimised by an (ALGOL?) compiler to a register clear. That would have been in the late 1950s.

Frances Allen's book on optimisation was published in 1972. I agree that a lot of 1980s-era PC compilers produced poor code, but they were also notable for (a) being cheap and (b) not assuming the availability of an arbitrarily-large amount of memory which would have allowed them to optimise an arbitrarily-complex expression (let alone attempting to optimise across expressions).

I'd also note hearing a comment in the late 1980s that some of the most efficient compilers were for Modula-2, since the source language gave the compiler sufficient hints as to what was expected. The Topspeed compilers (written largely by George Barwood) were pretty good.

So I think a reasonable answer would be that in principle a compiler could approximate the efficiency of a human programmer in the early to mid 1970s, provided that the user paid enough for the compiler and provided that the host on which it ran had sufficient resources.

answered Sep 12 '20 at 21:56

Mark Morgan Lloyd

2,428
6
22

7

The DEC BLISS compilers got to be pretty good; somewhere around 1981, I ran a test once where I rewrite some of my own MACRO-32 in BLISS-32, and the code was only 1.7 times larger, which I thought was fair enough. I read an interview with Ritchie in which he said that if DEC would have let him have the BLISS-11 compiler, they would not have needed to invent C. But DEC was pretty close-fisted with BLISS. – dave Sep 12 '20 at 23:08
3

As a counterpoint, I know that at least one Modula-2 compiler from the 1980s (for the ARM) was excruciatingly bad. On a CPU which had an integer multiply instruction and a rich register set, it would emit 20 instructions followed by a subroutine call to do an integer multiply. This was a big factor in the selection of "Arthur" (written directly in assembler) rather than its more sophisticated competitor for the Acorn Archimedes; Arthur could do things in 256KB that the other one couldn't do in 4MB. – Chromatix Sep 13 '20 at 07:35
2

@Chromatix And it cost how much and wanted what resources? Those were the days in which if you were serious you bought yourself a Logitech compiler hosted on a VAX... none of this PC 640K crap. Which leads to the extreme case of Stallman's observation on the Pastel compiler: it built (and presumably optimised) the entire parse tree in memory before generating code. In the middle is the interesting case of Tree Meta, where the syntax equations contained an explicit directive to specify the point at which the tree should be "unparsed" to machine code. – Mark Morgan Lloyd Sep 13 '20 at 09:06
@MarkMorganLloyd Cost was a real factor here - a 4MB machine was simply unaffordable at that time (circa 1986) though it could be built, and the way it swapped incessantly under even a light load meant users would not be getting what they paid for. "Arthur" - short for "A RISC operating system by Thursday" - did more in practice with far fewer resources, and allowed useful, fast programs to be written in interpreted BASIC. And you can run an updated version of it on a Raspberry Pi. – Chromatix Sep 13 '20 at 10:25
2

@Chromatix Exactly. But I think my reply still stands: provided that you had the resources, you could have an efficient compiler in the mid-70s. I'd suggest that even with the improved techniques available today, if you built a machine with 64K you'd be hard pushed to improve on the efficacy of something like CP/M Turbo Pascal. – Mark Morgan Lloyd Sep 13 '20 at 10:53
The gist of these comments seems to be that the quality of compilation depends on what you've got to run the compiler. – dave Sep 13 '20 at 18:01
@another-dave - yes - and back then there were no "compiler frameworks" - multi-targeted compilers - e.g., the gcc family much less the llvm family. A manufacturer (or a university, and in some few cases an ISV like Softech or Intermetrics) provided compilers and each was a one off. E.g., DEC had compilers for multiple languages for its entire product line and barely had shared code generators between languages for the same machine - they certainly didn't have a global optimizer that could be targeted to different architectures. And they didn't even have common compiler dev teams! – davidbak Sep 13 '20 at 18:21
The industry is certainly littered with the remains of companies who've attempted to implement a compiler family with shared code. I think Digital Research got as far as having a couple of compilers (Common User Back End possibly?) but their quality was notoriously bad. I think JPI/TopSpeed was about the first successful attempt, and they failed because they overestimated the market... I've been told a very interesting inside story but I can't repeat it. – Mark Morgan Lloyd Sep 13 '20 at 19:24
@davidbak: Even today, I think the concept of multi-architecture optimizations still has trouble. When targeting a Cortex-M0, if code loads a constant into a register-qualified automatic object before a loop, gcc with optimizations disabled will let the constant sit in the register through all iterations of the loop, but enabling optimizations will sometimes cause it to reload the constant on every loop iteration. Propagating the constant into the loop may be helpful on some architectures, but on the popular Cortex-M0 platform it's horrible. – supercat Sep 14 '20 at 03:20
@supercat - I believe you're correct. And even more correct when you look into SIMD-style vectorization, by the compiler. (Basically, you've got to do it yourself if you want SIMD on multiple architectures, e.g., ARM + x86.) – davidbak Sep 14 '20 at 03:26
@Chromatix Do you know the history of that Modula-2 compiler? Sounds exactly what I would expect from a ... non-committed port of a compiler for another platform. – Thorbjørn Ravn Andersen Sep 15 '20 at 12:49
@ThorbjørnRavnAndersen I was vaguely aware of it, but I don't think a copy ever crossed my path while I was at GM in Devon. I suspect that it was one of the class of compiler which didn't really use disc for anything, which even in the mid-80s was still fairly common for things like the BBC Micro and other home computers. Needless to say, that implied rather severe resource constraints... – Mark Morgan Lloyd Sep 15 '20 at 13:15
When I worked at Psion in the 80s-90s, targeting the 8086, we first used Turbo C then Watcom a little (superb) and finally switched everything to JPI C. We couldn’t use Watcom without inspecting generated code by hand because unfortunately its output was incompatible with our architecture. Storing the value of segment registers was illegal in our o/s because segment register values could be modified by interrupts as the o/s moved memory around while it was being accessed. Watcom occasionally did things like push ds / pop es iirc, which was verboten (Poss was in switch statements, I forget). – Cecil Ward Oct 22 '21 at 17:58
I was aware of Psion's use of JPI/TopSpeed, and there were hints of either a special compiler mode or possibly a special set of project defaults... weren't you using one of the V-series chips? By then it would have been v3, I remember George Barwood's binary patch that told the v1 code generator to not use segment registers which allowed them to port their tools to OS/2. – Mark Morgan Lloyd Oct 23 '21 at 18:16

score 13 · Answer 5 · answered Sep 13 '20 at 22:41

13

The primary advantage that C would have over an assembly-language programmer is the ability to adjust generated code to deal with changes to various constants. When using a quality compiler, if one writes "unsigned foo = bar/8;" a compiler can generate a shift instruction, but if the constant would later need to be 5, a compiler can switch to using other means of performing the computation, such as a combination of a multiply and shift. If, by contrast, the code had been written in optimal assembly language, changing the divisor would require more substantial changes to the code.

Otherwise, while the makers of some free compilers may like to claim that their compilers can do as well or better than assembly languages, and while they may find some "clever" optimizations that occasionally allow them to do so, they routinely generate code which is significantly worse than would be expected from any competent assembly-language programmer. For example, when targeting the popular Cortex-M0 microcontroller, gcc will process the use of a constant within a loop by generating code that reloads the constant every time through the loop. Even if the constant is loaded to a register-qualified object before the loop, gcc will still defer the load until the value is used, and re-execute the load on every iteration through the loop.

answered Sep 13 '20 at 22:41

supercat

35,993
3
63
159

Besides changing constants, switching to a different inlining strategy must be fun. – Holger Sep 14 '20 at 07:07
2

@Holger: I should perhaps have been a bit more general about the kinds of changes that C code readily accommodates, but the basic point is that hand-written machine code which is designed for a particular precise usage case can often substantially perform compiled code. Writing assembly code to be more broadly useful often requires sacrificing performance. If one were writing an assembly-language loop which needs to perform some task 120 times today, but may need to do it 126 times tomorrow, one might unroll it 8x but have to include logic that could add an extra 1-7 repetitions... – supercat Sep 14 '20 at 14:54
2

...if needed (determined at runtime) while a compiler could automatically switch between unrolling strategies to yield optimal results for any constant number of repetitions. As noted, however, while the free compilers employ some rather complex and clever optimizations, they are also prone to miss some low hanging fruit. There's IMHO something philosophically wrong with a compiler that simultaneously makes complex and unsound inferences based upon pointer equality, but misses as much low-hanging fruit as clang and gcc do. – supercat Sep 14 '20 at 14:59
2

Recognizing, for example, that the sequence "subtract; compare with zero; jump if not equal" may be replaced with "subtract; jump if not equal" is not an obscure optimization; nor is recognizing that there is no need to sign-extend a register whose upper bits are never going to be used. If gcc had a mode which would apply that sort of optimization, identify what variables or constants would be eligible for a "register" qualifier, and automatically apply that qualifier to the ones that are used the most, that would reap more than half of the execution time savings that would be available... – supercat Sep 14 '20 at 15:03
2

...for many programs, without sacrificing compatibility with programs written for low-level compilers that will process constructs "in a documented fashion characteristic of the environment" even when they are outside the Standard's jurisdiction. – supercat Sep 14 '20 at 15:05
4

If the described behavior is generally reproducible, then it's a major defect in the GCC ARM code-generator and you should really file it as a bug with the GCC folks. – Cody Gray - on strike Sep 14 '20 at 20:56
@CodyGray: Unsound optimizations based on pointer equality have been reported to the maintainers of clang and gcc many versions ago, but still remain. See https://godbolt.org/z/1sTv7T for a minimalist example of the problem. Note that evaluation of x+1 is well-defined as yielding a pointer "just past" the array, and the Standard explicitly addresses the scenario where a just-past pointer is compared to the starting address of whatever object happens to immediately follow it. So far as I can tell, the maintainers of clang and gcc view the fact that the Standard specifies... – supercat Sep 14 '20 at 21:20
...behavior in that case as being a defect in the Standard that needlessly impairs optimizations. I could see some usefulness to allowing implementations that do not claim to be suitable for low-level programming to treat such comparisons as arbitrarily yielding 0 or 1, chosen in Unspecified fashion, but no usefulness to allowing implementations to use the fact that a comparison returns true as justification for assuming that a pointer that compares equal to x+1 cannot possibly have been produced by taking the address of y. – supercat Sep 14 '20 at 21:22
@CodyGray: As for the failure to optimize subs/cmp #0/bne, I'd rather have compiler writers prioritize the generation of correct code than worry about performance. – supercat Sep 14 '20 at 21:46
@supercat I’m not sure whether you got me wrong. My comment was agreeing, just adding that not only constants can easily changed in C code. Especially with the old machines (thinking of my Amiga/m68k experience), the next CPU gen could have a substantial bigger code cache, making a different loop unrolling or function inlining policy more efficient, which means just recompiling with different options for C code, but may need substantial rewriting of assembly code. And yes, for a particular piece of code and a particular target architecture, handwritten asm may still outperform the compiler. – Holger Sep 15 '20 at 07:20
@Holger: Unless one releases multiple versions of an executable code program for people whose CPUs have more or less cache, or the customers of a program will rebuild it themselves, a compiler's ability to exploit target architectural features will be limited to those the programmer and compiler explicitly decide to target. JIT-based systems seem superior from that perspective. Compiled code is something of a middle ground, but in many cases achieving optimal performance on machines with different kinds of cache, etc. will require using different algorithms on different targets. – supercat Sep 26 '20 at 17:18
@Holger: I wish that rather than try to push C to accomplish the kinds of tasks for which FORTRAN was designed, people wanting high-performance computing would focus their efforts on creating a language which would give more algorithmic flexibility to compilers, among other things by expressing the concept "If X happens, this program will not be able to execute usefully, but must still behave in tolerably useless fashion", along with expressing what range of actions would be "tolerably useless" versus "intolerable". If e.g. one is trying to perform a stress physics simulation... – supercat Sep 26 '20 at 17:21
...for a bridge, a program which will normally take 10 minutes to run, but will 1% of the time simply hang may be superior to one that always runs successfully but takes 15 minutes, and infinitely more useful than one which would never hang, but 0.1% of the time produce answers that are seemingly valid but wrong. Compilers can only be reasonably expected to generate the most efficient code meeting requirements if they have a full description of what the requirements actually are. C, however, lacks any constructs to make distinctions between tolerable versus intolerable actions, which... – supercat Sep 26 '20 at 17:32
...form an essential part of many programs' requirements. – supercat Sep 26 '20 at 17:35

score 12 · Answer 6 · answered Sep 13 '20 at 02:41

12

There's another factor going on here, also, that I have noticed in examining compiler output vs what I would have written (admittedly, I haven't done enough assembly to be a real expert at it):

Given what the compilers know I have been impressed at how efficiently it was coded. However, in every case I have examined I could have done better because I knew things about the problem the compiler didn't.

answered Sep 13 '20 at 02:41

Loren Pechtel

301
1
5

Which I think reinforces what I said about about languages that give sufficient hints to the compiler to let it get the optimisation right, and opens the question about how a language can best solicit advice without insisting that the programmer use predefined functions (written in a particular style and heavy with hints) or litter his code with pragmata. – Mark Morgan Lloyd Sep 14 '20 at 06:32
2

A key insight here might be that a good C programmer can write code which is optimizer-friendly. There's very little in real-world problems that can be expressed only in assembly. One hint in this regard is restrict in C99 - it's been added to make code optimizer-friendly, but in the intervening 20 years no similar extensions have been added. – MSalters Sep 14 '20 at 09:14
@MSalters: Unfortunately, some compilers make it annoyingly difficult to write code that is optimizer friendly. For example, clang and gcc process restrict in such a way that within a statement like if (p==q) doSomething(p); the value passed to doSomething won't necessarily be recognized as being based upon p. Although the Standard's definition of "based upon" is sufficiently ambiguous that such behavior might be conforming, nothing in the Standard would indicate any intention to forbid comparisons between restrict-qualified pointers and anything else. – supercat Sep 14 '20 at 19:03

score 7 · Answer 7 · answered Sep 14 '20 at 12:23

7

Never. That's my short and provocative answer. The code generated by the compiler was chosen by a programmer, the optimisations applied can also be applied to assembly, giving unlimited time and resources to the programmer, he will always be able to generate better code than the compiler. The question is, is it worthwhile to try to overcome the limitations of the compiler or not. There is a limit a compiler cannot break that a human can. The compiler has to conform to certain constraints (ABI, UB, call conventions, register usage, etc.) that the human can decide to violate.

answered Sep 14 '20 at 12:23

Patrick Schlüter

4,120
1
15
22

5

The OP wanted a comparison for the "average" programmer though. The advantage for a compiler is that the compiler knows all the tricks a truly expert programmer does, and the "average" programmer may not. – Graham Sep 14 '20 at 13:09
6

You’re wrong. Specify any amount of time, you spend the time on assembler code, I spend the same time on writing C code, my code will run faster. Because writing assembler code takes so much longer, I’ll have better algorithms. – gnasher729 Sep 14 '20 at 14:14
3

@gnasher729 In a challenge like that, the assembler champion could spend a small fraction of the time to write a simple C implementation to get most of the optimizer's benefit, then spend the rest of the time tuning and profiling without having to worry about spurious pessimization or other compiler/optimizer oddities. Writing C is faster than writing assembly in general, and I'd agree that most manual optimization costs more than it saves/earns/is worth, but I wouldn't dismiss the abilities of a decent assembly programmer. – Extrarius Sep 14 '20 at 14:30
1

@gnasher729 you misread my answer. You restate what my clumsy prose stated with "giving unlimited time and resources". Most of the time there is no point in writing in assembly because it doesn't make sense commercially. Extrarius understood it to point. It's cheaper (timewise and effortwise) to write C (or any other HL language) but the code generated is far from perfect and any decent assembly programmer can extract quite some amelioration. People also quite often overestimate the quality of optimization of compilers. – Patrick Schlüter Sep 14 '20 at 15:14
4

This seems a fair answer. The original question claims, without presenting evidence, that some unspecified C compilers (on unspecified ISAs) can generate faster code than some unspecified "average" assembly-level programmer. – dave Sep 14 '20 at 16:00
5

While I agree this is a fair answer, I submit it's wrong. On x86 at least, I believe the question's claim is very true. The average assembly-language programmer does horribly inefficient things on modern x86, mistakes that a compiler would never make. You need a real wizard of an assembly-language programmer who is also an expert on the x86 microarchitecture in order to beat a good optimizing compiler. Those wizards exist, but they certainly aren't "average". This is all beyond the simple "time it takes to write" metric that gnasher and others are talking about. @another-dave – Cody Gray - on strike Sep 14 '20 at 20:59
giving unlimited time and resources to the programmer, he will always be able to generate better code -That's the key point. All programming tasks put some value on development time. When we say "an average programmer's code", we mean code they'd actually take the time to write, given their motivation / job pressure to get something working, not always to explicitly spend time optimizing. – Peter Cordes Sep 14 '20 at 23:03
1

But if you start with C compiler output, it's possible to beat a compiler by using that as a starting point and fixing missed optimizations in hot loops. With careful benchmarking you can avoid making it worse, and sometimes find or stumble into optimizations. But from scratch could easily be worse, as @Cody said. To be fair, modern x86 is really pretty robust and free of "glass jaws" if you follow the basic recommendations of Agner Fog's optimization guide. If you don't shoot yourself in the foot with store-forwarding stalls, OoO exec can work its magic. (There are some pitfalls though...) – Peter Cordes Sep 14 '20 at 23:06
Also, C UB (Undefined Behaviour) is not a constraint on the compiler; it's exactly the opposite. It's an opportunity to make asm that makes assumptions (namely that UB never happens). signed overflow UB is what allows widening an int loop counter to pointer width, instead of redoing sign-extension every iteration, for example. https://blog.llvm.org/posts/2011-05-13-what-every-c-programmer-should-know/ explains more. – Peter Cordes Sep 14 '20 at 23:10
6

@CodyGray - focus on x86 is a modern disease; in retrocomputing SE we should take a wider view. – dave Sep 15 '20 at 00:47
I would assume that this answer is wrong. Even an assembly language programmer should be unable to compete with someone who has the sole task of optimizing the assembly output of a compiler. Sure there's likely cases in which the assembly programmer is better, but those should be extremely rare. – xyious Sep 15 '20 at 17:37
1

I think that this may well be the best answer to the given question. It is obvious that compilers can do useful optimizations, some of which may be pretty laborious. What a lot of people seem to be missing is that assembly is a different language from the high level language of your choice and that the truly optimal expression of the algorithm is invariably lost in translation, however brilliant the translator is. And, frankly, most of the translators are just not that brilliant. – introspec Sep 15 '20 at 22:39
@PeterCordes: In many situations where multiple corner-cases behaviors would be equally acceptable, jump-the-rails UB prevents programmers from letting compilers choose among them. If e.g. the behaviors of both (int)((long long)int1*int2)/int3); and ((int)(1u*int1*int2) / int3) would be acceptable in cases where int1*int2 would exceed the range of int, letting a compiler choose could offer major speedups versus having a function always use the first or always use the second, but if the expression int1*int2 would jump the rails in case of overflow, a programmer would... – supercat Sep 16 '20 at 15:38
...be required to pick one and force the compiler to use it even when the other would be much faster. – supercat Sep 16 '20 at 15:39
"There is a limit a compiler cannot break that a human can" -- I rather think it's the other way around. Compilers can use an arbitrary amount of "short-term memory" with perfect recall; humans cannot. And rules can be enforced or relaxed for compilers just as they are for human developers. – jeffB Sep 26 '20 at 19:29
There comes a point where the time used by hand assembling a piece of code exceeds the time saved in every single run of the program ever and it's surprisingly easy to cross that particular line. – JeremyP Sep 30 '20 at 12:51
@JeremyP: Conversely, compilers spend more time trying to optimize many pieces of code than those pieces of code will spend executing. Further, the knowledge that machines can have about applications' corner-case requirements is fundamentally limited, and in many cases it would be impossible to reliably generate optimal code meeting requirements without knowing what those requirements actually are. – supercat Oct 10 '20 at 21:58
3

This is the correct answer. the reason is that as assembly language is now used only when the compiler is inadequate, today's 'average' assembly language code must be better than a compiler (otherwise why use it?). – Bruce Abbott Sep 29 '21 at 09:38

score 7 · Answer 8 · answered Sep 15 '20 at 00:36

7

My eureka moment was in the late 80's (88 or 89) when a Senior developer on our team decided that a hand-coded assembly language routine he needed to change must be rewritten in C first. At that time we were using the WATCOM C compiler. The immediate result was that the compiled C version was 20% smaller. I no longer recall what the speed difference was.

That day I sent an email to WATCOM's top developer on the C compiler reporting the result, and claimed that I'd never write a another routine in assembly language. I still haven't, although with the rise of Arduino and tiny microprocessors, I would no longer rule it out.

answered Sep 15 '20 at 00:36

Bill Ferguson

111
1

That's pretty kewl. – Tomachi Sep 15 '20 at 11:41
I did a fairly big hand-coded assembler project on a ATtiny88 and a C compiler for sure couldn't have made it fit but it had a stack leak which took me one day to find, so I strongly advise against it. – Janka Sep 15 '20 at 22:26
2

Seriously, how is this even an argument? The assembly code may have been optimized for speed, so the fact that optimizing compiler produced a shorter code is frankly meaningless without side-by-side comparison of both speed and size of both routines. – introspec Sep 16 '20 at 13:44
@introspec one surprising finding from compilers that could be optimized for speed or size via a switch was that the programs optimized for size tended to be faster as well. – Mark Ransom Sep 30 '21 at 19:08
@MarkRansom, as an experienced assembly programmer, I believe it to be almost never true. You can almost always get a bit of extra speed by sacrificing some memory. So what you are observing, is not an indication of efficiency of small programs, but conversely, an indication of the compiler inability to optimize for speed all that well. – introspec Sep 30 '21 at 22:03
@introspec see for example Why does GCC generate 15-20% faster code if I optimize for size instead of speed?. I don't think that was the first time I heard of it either. – Mark Ransom Sep 30 '21 at 22:48
@introspec the actual reasons why the speed-optimized code was slower are irrelevant, unless it was a temporary bug that was long ago fixed. And as I said that wasn't the only example I've seen, just the first one to pop up when I did a search. – Mark Ransom Oct 01 '21 at 15:20

score 7 · Answer 9 · answered Sep 15 '20 at 21:35

Its really a cost/benefit problem. Hand optimized assembly could still be faster as your optimizing for a specific code path, not a more general one. That being said, each iteration of a compiler could make better decisions and generate tighter code with less room for further optimization. At some point, the extra few instructions that could be saved are not worth the time/cost to hand optimize. There was a time, I believe early 90's, where we were using partial assembly. Some routines were hand optimized assembly for critical performance, but most were done in higher level languages. Eventually,those hand optimized assembly routines were re-coded into higher level languages as chips became faster and the need for performance gains were reduced.

As recently as a few years ago I dusted off my wizards cap and robes and hand coded a tiny inline ASM routine to perform a simple transformation...more because I could shave a few tics off of a routine that was being called in a tight loop and could manage the registers myself. The end result was something that out performed a similarly coded C routine by approximately twice (although we are talking tics). It is possible that a future version of the compiler could generate tighter code and/or new processor technologies would further reduce any noticeable gains.

In many cases, figuring out the most efficient sequence of machine code instructions to accomplish a task which is simple and repetitive, but doesn't match any of a compiler's built-in constructs, will often be easier than trying to find a piece of C code which can achieve the same output even if the latter is possible. Ironically, the hard part is often figuring out how to make the build system integrate the machine code into the rest of the project. I wish there were a standard syntax to express the concept "put this structure into code memory and treat it as a function" on platforms... — supercat, Sep 26 '20 at 17:09
...where that would make sense, since an assembler that could produce output in that format would make it possible to use assembly code in a manner that would allow rebuilding with any tool set in cases where one didn't need to change the machine code, and would allow one to use e.g. a browser-based cross-assembler in cases where one did need to change it. — supercat, Sep 26 '20 at 17:11

score 6 · Answer 10 · answered Apr 28 '23 at 11:10

6

I would say that the big change in compiler tech comes around 1988, with the invention of single static assignment form (SSA).

Before this, optimizers could often be beaten by humans because the optimizer struggled to prove that a given suboptimal code form had the semantics needed for an optimization to apply. SSA changed this so that lots of powerful optimizations (such as value numbering) become cheap to apply.

answered Apr 28 '23 at 11:10

Simon Farnsworth

1,174
7
18

1

SSA works well for objects of automatic duration whose address isn't taken. It works less well for other kinds of objects, whose relationships may not be statically resolvable. Applying it to other forms of objects allows compilers to determine that a particular code form has the semantics necessary to apply a certain transform in many cases where proving that would be impossible--sometimes because the code doesn't actually have the semantics in question. – supercat Apr 28 '23 at 17:21

score 3 · Answer 11 · answered Mar 02 '21 at 16:38

And yet another point on this topic... the "four times" may have been referring to specific, common, platforms.

Older languages generally had global scope and limited subroutine functionality. For instance, FORTRAN had user-controlled scoping and in many cases, there was no local data in a routine. Programs also generally used subs as defined functions, as opposed to a way of organizing code (well...).

In contrast, Algol-derived languages use the block as the primary code organization concept, and programs are generally a collection of subroutine calls. Because the blocks have local scope, every one of these calls generally results in the creation (and destruction) of an activation record. As a result, there is significant overhead in the call dynamics in C (et all) that older languages didn't have.

This led to the widespread use of intermediate systems programming languages, like BLISS mentioned above. On micros, these languages generally combined block-like layout with non-recursive call semantics that didn't require activation records. For instance, Action! on the Atari was generally considered to be about half the speed of hand-coded assembler, whereas C programs were much slower.

While larger platforms like the PDP's and VAXen had larger stacks and userspace controls that aided block-oriented languages, along with the room needed for more optimizations, I suspect at least some of that "four times" was a result of this same effect. Assembler on the same platform could tightly control the calls and unwind with ease, things that did not come to compilers until later.

You can also see this in the performance of systems that were designed to support block-oriented languages; things like the CRISP, the original RISC II and various stack-oriented machines generally offered performance from C that was highly competitive with assembler - that was their entire raison d'etre.

Well, IIRC there was a discussion on certain transformation of graphics, comparing amateur codes, optimized amateur codes, c code by the pro, and asm by the pro. And each of them saved about 10 times of execution time. But I forget if it was posted here in RC, or anywhere else discussing some kind of 8-bit C. — Schezuk, Mar 02 '21 at 18:19
When targeting something like the Z80 or the original 8088/8086, there can be a huge performance difference between a loop that keeps all non-array-ish objects in registers, versus code that can't, and assembly language can exploit byte-split registers in ways that programmers wouldn't be able to. The advantage of hand-written assembly language diminishes one one is no longer working with tight loops whose working set can be kept, or largely kept, in registers. After that, what matters is often whether code written in a high-level language can exploit things like chunking optimizations. — supercat, Mar 02 '21 at 23:26
If one is using a compiled language or dialect which makes it possible to e.g. process suitably aligned pairs of 16-bit values using 32-bit or 64-bit types, that may allow a roughly 2x or 4x speedup versus what would be possible otherwise. If one is using a dialect that does not allow that, that will represent an extra 2x or 4x advantage for using assembly language versus that particular compiled dialect. — supercat, Mar 02 '21 at 23:29
Hmm, as far as I recall, BLISS had recursive calls, nested routines, and uplevel references. It might be that the compilers were better at optimizing the cases that didn't use any of these. On VAX, the activation record was basically constructed by the CALLS instruction anyway, with maybe a subtract in the callee to allocate stack storage (the compiler tended to allocate the max required stack at routine entry). — dave, Mar 03 '21 at 03:02

score 3 · Answer 12 · answered Apr 29 '23 at 11:00

The choice between compiled language and assembly programming isn't always binary. In 1988 I was writing a sort-merge system on MS-DOS for specialised variable-length records. This was intended for CD-ROM publishing, so we had machines that exceeded the 32MB volume limit for MS-DOS 3.x, by using larger disc sectors. I was using MS C 5.0, I think.

The comparison routine for two records was obviously performance-critical. Rather than writing it in assembly, I spent a couple of days writing versions in C, reading the assembler listing from the compiler, and benchmarking promising-looking versions. I got a factor of two speed-up from an initial naïve design, with rather less work than writing assembler.

I think the advantage of writing in C would be the ability to have code function, even if sub-optimally, on other platforms to which it might be migrated. If code needs to move to a differnet hardware platform after 10 years because the old one becomes unmaintainable, improvements in machine speed would likely mean that even sub-optimal performance on the new machine would be better than optimized performance on the old one. What's annoying are situations where it's much easier to figure out what the optimal sequence of instructions should be than to coax devtools to generate it. — supercat, Apr 29 '23 at 17:40

KRK Owner · Answer 13 · 2021-03-04T01:03:25.270

I guess the difference between "an average programmer" and a compiler is that the compiler has "mechanical sympathy" with the hardware it's compiled to. Also feel the need to quote Donald Knuth / Hoare / Dijkstra, depending on who you ask: "premature optimisation is the root of all evil".
In today's world of cloud computing, it all gets fuzzy: virtual machines, containers and runtime virtual machines (eg Java's Virtual Machine) can all co-exist together. Therefore, compiler micro-optimisations are meaningless in the grander scheme of things - code optimised for a container might be irrelevant on the VM / Physical hardware it runs on.

Of course, if we're talking about bare-metal control, then it matters. However these scenarios are quite niche, unless we're talking about running code on Micro Controllers, then optimising power by optimising CPU cycles is good. x number of CPU cycles costs microamps of battery life, so this could be critical for some applications.

Processors have branch condition caches, L1 and L2 caches for RAM to speed up memory access and branching, as well as disk/ssd-backed virtual memory. Processors can also pipeline instructions and effectively run some parts of code in parallel if there are groups of unrelated instructions which are unaffected by the order of which they are executed. Intel did this with their Hyper Threading technology, and there were probably others before them, but I'm not certain who they are without some proper research.

The JVM has a Hotspot compiler. The hotspot compiler converts frequently interpreted portions of byte-code into native machine code to save repeatedly parsing/translating the byte code continuously. Compilers have optimisation too, like in-lining code to save on some kind of machine-code call instruction (which normally might involve additional cycles for saving the return address at the very least). From a heuristics perspective, it's the empirical data which handles on-the-fly optimisation, so they're going to catch stuff you never even thought about.

Much of this depends on the processor, the language, the operating system, the compiler and the data that needs to be processed.

On the flip side, programmers often know much more about corner-case behavioral requirements than compilers can. If the optimal machine code that would meet requirements would have a corner-case behavior which can't be expressed in C, and if a significant but not unlimited variety of behaviors would be acceptable in the corner case, it may be impossible for to write source code which from which any compiler should be expected to even be capable of generating optimal machine code unless the compiler extends the language to support such corner cases. — supercat, Oct 10 '20 at 21:55

score 1 · Answer 14 · edited Sep 29 '21 at 07:48

1

DISCLAIMER: I'm no "expert", but I have been around the block quite a few times.

This should have occurred to me earlier, but being otherwise distracted, never articulated fully my thoughts.

I'll give you some background:

Optimisations are very likely heuristic. If > 50% of specific sequences of code does x, rather than y, then optimise for that scenario. Rinse and repeat. This was something that Systems Programmers on mainframes did for a living to eke out some more CPU cycles.
We're not just talking about C here. It's done in many languages...
Processors have (adaptive) branch prediction logic/caches (see How does the branch predictor know if it is not correct?)
VMs have optimising compilers which alter the underlying runtime bytecode to speed things up a bit by switching whether a branch needs to be made (possibly costing slightly more CPU cycles to fetch more memory if the branch is far away(see 3))
It's possible that some higher-level languages could be leveraging functionality from C or C++ compilers... meaning the same "boilerplate" code is shared across many languages.
No doubt there are many many more that I'd care to know.
The moral of the story: you either have to be lucky, or FULLY understand what's going on in order to crack performance optimisation.

edited Sep 29 '21 at 07:48

Toby Speight

1,611
14
31

answered Mar 02 '21 at 01:33

KRK Owner

129
2

1

Is there a reason you wrote a second answer, instead of extending the existing? – Raffzahn Mar 02 '21 at 01:38
Not particularly, except that there are many aspects to consider here. Optimisation is a very broad-ranging subject, and is subjective in itself. I simply wanted to add another perspective without causing further confusion. Happy to merge them together if you thing this would help? – KRK Owner Mar 04 '21 at 00:49
1

None of this addresses the question - when did the crossover occur? (I'm not convinced it's a good question, as it's too subjective). – Toby Speight Sep 29 '21 at 07:49
1

@TobySpeight - thanks for being on-point. I also agree that it's highly subjective to precisely pinpoint where a crossover occurred. That's probably because engineers have always been creative in optimising things based upon the world of constraints they live within.... hardware, software, knowledge, experience, contacts... – KRK Owner Feb 24 '23 at 00:59

score 0 · Answer 15 · edited Apr 30 '23 at 18:56

0

Quotes lilburne

Many years ago I was teaching someone to program in C. The exercise was to rotate a graphic through 90 degrees. He came back with a solution that took several minutes to complete, mainly because he was using multiplies and divides etc.

I showed him how to recast the problem using bit shifts, and the time to process came down to about 30 seconds on the non-optimizing compiler he had.

I had just got an optimizing compiler and the same code rotated the graphic in < 5 seconds. I looked at the assembly code that the compiler was generating, and from what I saw decided there and then that my days of writing assembler were over.

edited Apr 30 '23 at 18:56

Sep Roland

1,043
5
14

answered Sep 29 '21 at 03:37

Schezuk

3,752
1
17
40

4

Amusing, but just anecdotal, and it doesn’t even give so much as a specific year when that happened. – user3840170 Sep 29 '21 at 07:18
It was on an Atari ST, according to comments. – wizzwizz4 Apr 30 '23 at 19:00

When did compilers start generating optimized code that runs faster than an average programmer's assembly code?

15 Answers15