What makes MOVEQ quicker than a normal MOVE in 68000 assembly?

Question

I'm "re-learning" 68000 assembly language and came across the "MOVEQ" command that is labeled "MOVE QUICK".

According to the NXP Programmers Reference Manual (reference below), the command MOVEQ (MOVE QUICK) is described as:

Moves a byte of immediate data to a 32-bit data register. The data in an 8-bit
field within the operation word is sign- extended to a long operand in the data
register as it is transferred.

I've searched the manual and cannot find why it's "quick".

Meaning, what's the difference (in performance) in the following instructions?

MOVEQ #100, D0
MOVE #100, D0

I gather the MOVEQ is a better fit for moving 8-bit data. Or, is it ONLY 8-bits of data as I cannot seem to confirm.

REF:

https://www.nxp.com/files-static/archives/doc/ref_manual/M68000PRM.pdf

score 35 · Accepted Answer · edited Dec 24 '21 at 16:35

35

The MOVE immediate instruction takes 8 cycles in byte and word modes. There are two memory reads, one for the instruction and one for the immediate value.

The MOVEQ instruction encodes the immediate value into the instruction op-code itself, so only takes 4 cycles and 1 memory read. It can only take a byte immediate value.

Instruction	Performance
`MOVEQ #1, D0`	4 clocks, 1 memory read
`MOVE.b #1, D0`	8 clocks, 2 memory reads
`MOVE.w #1000, D0`	8 clocks, 2 memory reads

Note that the immediate value loaded for byte and word size moves overwrites the entire 32 bits of the register, and is sign extended.

As such, for loading values $00-$FF, it is twice as fast in instruction cycles and uses half as much memory bandwidth (important on systems where it is shared with DMA).

edited Dec 24 '21 at 16:35

user3840170

23,072
4
91
150

answered Jul 18 '19 at 15:47

user

15,213
3
35
69

10

"for loading values 0-255..." - more precisely, it loads a 32 bit value between -128 and +127.This is 3 times faster than doing it the normal way with move.l #... – Bruce Abbott Jul 19 '19 at 01:20
1

@BruceAbbott that's a good point, it does sign extend to 32 bits. – user Jul 19 '19 at 07:54
4

Showing the actual hex codes of the three instructions will improve the answer. – Leo B. Jul 19 '19 at 23:17
Isn't this design of MOVE.b/w/l a little toxic? E.g. the immediate that follow the MOVE could decode into an invalid instruction had it been done speculatively? I believe modern RISC ISAs have all removed support for long immediate values if the opcode and the value can not both fit into one instruction word? – user3528438 Dec 25 '21 at 05:14
1

Why did it need its own instruction rather than MOVE.b being translated into MOVEQ's opcode by the assembler? Is there any value in providing the "slow" MOVE.b instruction? – Luca Citi Dec 25 '21 at 07:56
@user3528438 I believe 68ks use microcode translation into internal simple instructions so the pipeline benefits from fixed length probably aren't all that big. DRAM bandwidth is much more of an issue especially with no cache on chip. Also, note they 9nly squeezed 8 bite into the instruction byte. 12 or 16 is ok, but loading your 32bit constants 8 bits at a time? Also there was little in the way of branch prediction then, so probably not a big risk. 68k isn't my favourite ISA though, so I may be wrong. I'd expect clever assemblers to rewrite but as an option, cycle counting was big back then. – Dan Dec 25 '21 at 18:35
The Apple MPW assembler would convert a MOVE immediate to register into a MOVEQ if the value was small enough. (The programmer might not know it was possible if the value was defined instead of being hardcoded.)
The MOVE.B instruction can have any source and destination mode, including memory to memory. At the time it was easier and simpler to have a standard operand decoder + microcode generator that you re-used for as many instructions as possible. Today, with many more gates, maybe you would reserve the move.b immediate -> register opcode pattern for another purpose.
– Hugh Fisher Dec 26 '21 at 00:09
@LucaCiti: MOVE.B only operates on the 8 least significant bits of the register (leaving the remaining 24 bits untouched). Only MOVE.L #-128..127, dN can be safely replaced by MOVEQ without considering the context. It is redundant, but can speed up many common sequences (e.g. returning a fixed 32-bit value in a register). On a 68000 MOVEQ+ADD.L is also faster than ADDI.L for example. – user786653 Dec 26 '21 at 07:27
@user786653 Yes, sorry, I got confused. I guess my (corrected) question is why does one need an extra assembly instruction (MOVEQ) rather than encoding all MOVE.L instructions with an immediate argument -128..127 with MOVEQ's bytecode. Is there any reason to have a slow MOVE.L? – Luca Citi Dec 27 '21 at 01:05
@LucaCiti: You want a separate instruction in your assembler code, so the assembler can give a warning/error if the immediate is out of range (it might be an assembly-time expression for example). Also not every assembler performs the optimization (it's actually not safe if self-modifying code is used). If you're asking why the long version is allowed, that's because it would be much too complicated for everything involved (CPU/assemblers/linkers/programmers using SMC etc.) to disallow it for little benefit. – user786653 Dec 27 '21 at 06:35
@user786653 I see. I'm probably reasoning with the mindset of a compiler rather than an assembler. In this case there is a close matching between mnemonic and bytecode and the programmer is responsible for choosing the correct version. – Luca Citi Dec 27 '21 at 08:58

Tommy · Answer 2 · 2019-07-19T23:36:04.890

13

To give the exact cycle-by-cycle breakdown:

MOVEQ is a one word instruction so will nominally perform in four cycles; in practice it can occur immediately following operation decoding because all necessary information is within the instruction word. Four cycles are then expended fetching the next value to feed into the instruction prefetch queue.

Both MOVE.b MOVE.w are two-word instructions. The 68000 actually knows both words before either instruction begins, so both can occur pretty much immediately but both then require that a further two words be fetched to repopulate the instruction prefetch queue, which occupies eight cycles before the next instruction can begin.

MOVE.l is a three-word instruction. The 68000's prefetch queue is only two words long. So after decoding it can't actually be completed until a further word has been fetched, and after that fetch a further two will be needed to repopulate the queue. So twelve cycles total.

MOVEs are the most primitive operation available; the general rule is that the number of words needed to complete an operation plus the number needed then to [re]populate the prefetch queue is only a floor for cycle counting. See Yacht.txt for a more detailed summary of the work each instruction does; bear in mind that things like RTS are only one word long but imply two further prefetches since the whole queue needs to be replenished, and anything that might change the supervisor flag will often result in a refetch of data that's ostensibly already in the queue, in case the memory subsystem is designed to serve conditional results.

edited Jul 19 '19 at 23:36

answered Jul 18 '19 at 17:25

Tommy

36,843
2
124
171

(obiter: this answer was offered despite the other answer already being present because I felt the fact of the prefetch queue makes it a different answer, technically. Even if very similar) – Tommy Jul 19 '19 at 20:56
1

Comment only: Your last paragraph doesn't seem to quite 'scan' correctly - or I'm still half asleep :-). I think "that anything that things" contains at least one typo (but may not) and " in case the ..." may not say what you want as precisely as it could (but may :-) ). – Russell McMahon Jul 19 '19 at 22:32
Are there any circumstances in which the time to execute a 68000 instruction would vary with context? For example, how would the timing of muls r0,r1 / moveq r0,#0 / rts compare to muls r0,r1 / move.l r0,#0 / rts? – supercat Jul 19 '19 at 22:33
@supercat I can't think of anything, as every instruction is microcoded to make sure the prefetch queue is exactly full again within its execution time. It's not intelligent like, say, the instruction queue on an 8086. – Tommy Jul 20 '19 at 13:40
@Tommy: I can't think of any advantage to waiting before the second fetch is complete before starting instruction execution, but could easily imagine that instruction execution decode couldn't start until the cycle after the first fetch was complete, starting the fetch of the second word immediately without regard for whether it's needed would allow it to begin a cycle or two sooner than would otherwise be possible. It may have been possible to design the 68000 shave two cycles off an RTS if the attached memory system could process a two-cycle "ignored value" read, but... – supercat Jul 20 '19 at 16:34
...a typical DRAM system requires that once a read is started, severe memory corruption is likely if anything prevents it from running to conclusion. – supercat Jul 20 '19 at 16:35

Raffzahn · Answer 3 · 2019-07-20T07:36:27.563

8

I've searched the manual and cannot find why it's "quick".

Simply because MOVEQ is a single word (two byte) instruction, which can be fetched in a single memory cycle, while an equal constant move will be 2 (MOVE.W) or 3 words (MOVE.L) and need one/two additional memory cycles - each four clocks.

So effectively you'll get the following execution timing:

MOVEQ #5,D0 - 4 Clocks,
MOVE.B #5,D0 - 8 Clocks,
MOVE.W #5,D0 - 8 Clocks,
MOVE.L #5,D0 - 12 Clocks,

making MOVEQ about 50/66% faster.

MOVEQ even got it's own opcode (7) to squeeze all into a single word.

ADDQ and SUBQ works similar (*1) - except mixed into the Scc/DBcc/TRAPcc group (5).

I gather the MOVEQ is a better fit for moving 8-bit data. Or, is it ONLY 8-bits of data as I cannot seem to confirm.

Only. There is no room for more than 8 bits of constant within the 16 bit instruction word (*2), as the encoding is

|OPCODE|Dest.| Res || Data      |
|Group |Reg. |     ||           |
| 0111 | xxx |  0  || yyyy yyyy |

*1 - Not exactly like it as they may have additional parameters.

*2 - Well, in the original 68000 encoding there was one unused bit, but won't get far.

edited Jul 20 '19 at 07:36

answered Jul 18 '19 at 15:39

Raffzahn

222,541
22
631
918

Don't forget the move.b instruction... – UncleBod Jul 18 '19 at 16:14
@UncleBod MOVE.B is exactly like MOVE.W – Raffzahn Jul 18 '19 at 16:40
"making MOVEQ about 50/66% faster." - https://math.stackexchange.com/questions/1404234/what-does-200-faster-mean-how-can-something-be-more-than-100-faster/1404242 – Bruce Abbott Jul 20 '19 at 23:20
@BruceAbbott And your point is? – Raffzahn Jul 21 '19 at 10:19
@Raffzahn I think he's suggesting 4 cycles is 100% faster than 8 cycles because the speed is doubled. That post claims "faster" means "speed", not "time reduction". I don't think it matters at all. We can all do basic arithmetic operations on the numbers 4 and 8, and so understand what you said, without resorting to doing duelling banjos with dictionaries. – Dan Dec 24 '21 at 16:19
1

@DanSheppard Oh, I'm all with you, except, a percentage is always relative notation between two values. and faster (or slower) is as well a relation, one that marks a direction. In this case the direction is clear, as MOVEQ is faster than MOVE in absolute numbers. So a relation using the term 'faster (than)' must be based on the slower, isn't it, so MOVE with 8 clocks is 100%. MOVE-W is 4, so it's 50% faster. 100% would simply mean it being executed in zero clocks :)) – Raffzahn Dec 24 '21 at 16:27
@Raffzahn I think "faster" refers to stuff done divided by time. If I drive 60 miles and it takes two hours and you drive 60 miles and it takes only one hour, were you going 50% faster than me? No. You were going 100% faster because you were doing 60mph and I was doing 30mph. You can do twice as many moveqs as I can do move.bs so you are 100% faster. – JeremyP Dec 27 '21 at 15:10

What makes MOVEQ quicker than a normal MOVE in 68000 assembly?

3 Answers3