When do rep and stos appear in compiled C?

Question

Can you give me some example C code which would be compiled to rep and stos?

00401059  /$  55            PUSH EBP
0040105A  |.  8BEC          MOV EBP,ESP
0040105C  |.  83EC 5C       SUB ESP,5C
0040105F  |.  57            PUSH EDI
00401060  |.  66:A1 E068400>MOV AX,WORD PTR DS:[4068E0]
00401066  |.  66:8945 B0    MOV WORD PTR SS:[EBP-50],AX
0040106A  |.  B9 13000000   MOV ECX,13
0040106F  |.  33C0          XOR EAX,EAX
00401071  |.  8D7D B2       LEA EDI,DWORD PTR SS:[EBP-4E]
00401074  |.  F3:AB         REP STOS DWORD PTR ES:[EDI]
00401076  |.  66:AB         STOS WORD PTR ES:[EDI]
00401078  |.  8D4D B0       LEA ECX,DWORD PTR SS:[EBP-50]
0040107B  |.  51            PUSH ECX                                 ; /Arg2
0040107C  |.  8D55 A4       LEA EDX,DWORD PTR SS:[EBP-5C]            ; |
0040107F  |.  52            PUSH EDX                                 ; |Arg1
00401080  |.  E8 7BFFFFFF   CALL 00401000                            ;

This's just my example.

yaspr · Answer 1 · 2016-12-02T10:07:52.700

Let's start by explaining what each instruction does. REP OPD works as follows :

          for (; ecx > 0; ecx--) OPD

This instruction will repeat the operand while decrementing ECX until it reaches 0. Notice in your code that ECX is set to 13 (address 0040106A).

STOS OPD, on the other hand, stores the value of AL or AX, or EAX in the given memory operand. Register size is defined by the memory location size hence the DWORD in your code.

In overall, these two instructions are combined to create a loop that stores data in memory. Now to put things in C form, if we want to initialize an array of bytes to 0 (e.g. memset) we can do it this way :

    unsigned char t[MAX_CHAR];

    for (int i = 0; i < MAX_CHAR; i++)
         t[i] = 0;

This C code can be translated to multiple assembly equivalents. It depends mostly on the compiler and the specified level of optimization. One variant based on REP STOS could be :

    mov ecx, MAX_CHAR  //Initialize ecx to the number of iterations desired
    xor eax, eax       //Initialize eax to 0  
    rep stos [t + ecx] //Store the value of eax in t[i] where i = ecx

Another equivalent assembly variant could be :

    mov ecx, MAX_CHAR   //Initialize ecx to the number of iterations
    xor eax, eax        //Initialize eax to 0
    loop0:              //Define loop label
    mov [t + ecx], eax  //Copy eax into t[i] where i = ecx 
    dec ecx             //Decrement ecx ==> ecx = ecx - 1
    jnz loop0           //Jump only if previous result (ecx - 1) isn't 0

These two assembly codes are totally similar. The only difference is that one will cost you more in terms of cycles than the other. In other words, one is faster than the other. How to define which is the best ? If you check Agner Fog's instruction tables page 162, you'll notice that REP STOS latency is n (n being the number of iterations) in the worst case. If you look around in the instruction tables, you'll see that the XOR REG, SAME will cost us 0.25 cycles, and that the MOV R, IMM will cost us 0.5 cycles. In overall, the performance of the first assembly code can be evaluated to : latency = n + 0.75 cycles (for a 1000 iterations ~ 1000.75 cycles). If we look at the second assembly code we get the following :

    mov ecx, MAX_CHAR   //0.50 cycles
    xor eax, eax        //0.25 cycles
    loop0:              
    mov [t + ecx], eax  //1 cycle 
    dec ecx             //2 cycles
    jnz loop0           //1 cycle

In this case, we get latency = 4 * n + 0.75.

Now, you might think that the first code is faster than the second one 'cause n < 4n. Don't forget that the Intel architecture is pipelined and there's some other weirdness going on. What I can assure you though, is that the first code is VERY slow compared to the second one. Why? REP STOS is microcoded (Fused µOps column). Meaning, it's not a real hardware instruction. It is essentially made by combining multiple circuits (two here) in a way that could've - in the past, with Pentium III & IV - save you time. The problem with this approach is that it doesn't hold great in the pipeline. The second assembly code will be much faster because all instructions are hardcoded. Meaning, there's a special circuit for each one and there's no need to combine other circuits to perform the desired task. Therefore, these instructions can be pipelined easily, thus the computed 4n overall latency can go down far below n (if everything is cached properly). The whole loop iteration will go down to around 0.75 cycles for all three instructions. Plus a bonus when the branch predictor hits it right.

I have to remind you that these instructions came to life in a time where CPUs had a hard time running loop constructs efficiently. Especially for string processing because of too many branches and small caches. With the advent of efficient branch predictors and larger/more intricate cache systems, these instructions became, somehow, obsolete. Rare are the compilers that will generate code based on REP STOS or REP LODS, unless you're running your code on some pretty old CPU or using a pretty old compiler (like Turbo C for example).

I hope that my post helped shed some light on those instructions and how they can be used. Let me know if you need more details or more explanations.

Nope! Not explicitly. There are no jumps in your assembly code. But the REP STOS is a loop. — yaspr, Dec 01 '16 at 07:55
What is you "OPD" an abbreviation of? (Also, please go easy on the bold. Other than that: nice answer, I already upvoted.) — Jongware, Dec 01 '16 at 09:57
OPD ==> operand (in this case STOS. I put OPD 'cause REP can be used with LODS too. The bold was used to show the important elements, nothing more. — yaspr, Dec 01 '16 at 11:32

blabb · Answer 2 · 2016-11-30T19:27:47.530

Normally during memcpy(), memset(), etc, operations. Something like the code below:

#define BUF_SIZE 1024

wchar _t foo[BUF_SIZE]  = "yyyyyyyyy";
wchar _t blah[BUF_SIZE] = {0};   
wchar _t bar[BUF_SIZE];

memcpy(blah, BUF_SIZE, foo);
memset(bar, BUF_SIZE, 0);

UPDATE

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#pragma intrinsic (memset , memcpy)
void somefunc() { 
  char *foo = "yabba dabba choo";
  char *blah = (char *)malloc(0x20);
  if(blah) {
    memset(blah,0,0x20);  <<-- line no 9
    memcpy(blah,foo,0x20);
    printf("%s\n" , blah);
  }
}
void main (void) {
  somefunc();
}

compiled with /O1 optimization vc++

cl /nologo /Zi /W4 /O1 /analyze /EHsc /Ferepstos__opt.exe repstos.cpp  /link /RELEASE

disassembly

cdb -c "uf repstos__opt!somefunc;q" repstos__opt.exe | grep initial -A 40

0:000> cdb: Reading initial command 'uf repstos__opt!somefunc;q'
repstos__opt!somefunc:
010a1261 6a20            push    20h
010a1263 e8c69a0100      call    repstos__opt!malloc (010bad2e)
010a1268 8bd0            mov     edx,eax
010a126a 59              pop     ecx
010a126b 85d2            test    edx,edx
010a126d 7426            je      repstos__opt!somefunc+0x34 (010a1295)

repstos__opt!somefunc+0xe:
010a126f 56              push    esi
010a1270 57              push    edi
010a1271 6a08            push    8
010a1273 59              pop     ecx
010a1274 33c0            xor     eax,eax
010a1276 8bfa            mov     edi,edx
010a1278 f3ab            rep stos dword ptr es:[edi] <<----
010a127a 6a08            push    8
010a127c 59              pop     ecx
010a127d 8bfa            mov     edi,edx
010a127f beb0c10d01      mov     esi,offset repstos__opt!`string' (010dc1b0)
010a1284 52              push    edx
010a1285 f3a5            rep movs dword ptr es:[edi],dword ptr [esi]
010a1287 68c4c10d01      push    offset repstos__opt!`string' (010dc1c4)
010a128c e836000000      call    repstos__opt!printf (010a12c7)
010a1291 59              pop     ecx
010a1292 59              pop     ecx
010a1293 5f              pop     edi
010a1294 5e              pop     esi

repstos__opt!somefunc+0x34:
010a1295 c3              ret
quit:

Disassembly pertaining to line no 9

0:000> lsa .,4,2
     9:     memset(blah,0,0x20);
    10:     memcpy(blah,foo,0x20);
0:000> u `repstos__opt!repstos.cpp:9`
repstos__opt!somefunc+0xe [xxxx\repstos.cpp @ 9]:
002c126f 56              push    esi
002c1270 57              push    edi
002c1271 6a08            push    8
002c1273 59              pop     ecx
002c1274 33c0            xor     eax,eax
002c1276 8bfa            mov     edi,edx
002c1278 f3ab            rep stos dword ptr es:[edi] <<<---
002c127a 6a08            push    8
0:000>

#include <stdio.h> #include <string.h> int main () { char str1[]="Sample string"; char str2[40]; char str3[40];
memcpy (str2,str1,strlen(str1)+1); memcpy (str3,"copy successful",16); printf ("str1: %s\nstr2: %s\nstr3: %s\n",str1,str2,str3); return 0;

} — beginner, Nov 30 '16 at 16:02
you need to compile with optimizations on and may be inline the function or use __forceinline compiler is smart it calculates cost benefit analysis and will make memset a fumction or inline depending on the net gain for example with the edit by perror compiler may decide repstos is not optimal and may call memset as a function but some one else with the same code as @perror editied but with a buffsize smaller like say 0x20 compiler may not call memset as function but will inline it as repstos nowadays even repstos is outdated compiler uses movups xmm instruction and moves multi dwords — blabb, Nov 30 '16 at 19:34

When do rep and stos appear in compiled C?

2 Answers2