Let's start by explaining what each instruction does. REP OPD works as follows :
for (; ecx > 0; ecx--) OPD
This instruction will repeat the operand while decrementing ECX until it reaches 0. Notice in your code that ECX is set to 13 (address 0040106A).
STOS OPD, on the other hand, stores the value of AL or AX, or EAX in the given memory operand. Register size is defined by the memory location size hence the DWORD in your code.
In overall, these two instructions are combined to create a loop that stores data in memory.
Now to put things in C form, if we want to initialize an array of bytes to 0 (e.g. memset) we can do it this way :
unsigned char t[MAX_CHAR];
for (int i = 0; i < MAX_CHAR; i++)
t[i] = 0;
This C code can be translated to multiple assembly equivalents. It depends mostly on the compiler and the specified level of optimization.
One variant based on REP STOS could be :
mov ecx, MAX_CHAR //Initialize ecx to the number of iterations desired
xor eax, eax //Initialize eax to 0
rep stos [t + ecx] //Store the value of eax in t[i] where i = ecx
Another equivalent assembly variant could be :
mov ecx, MAX_CHAR //Initialize ecx to the number of iterations
xor eax, eax //Initialize eax to 0
loop0: //Define loop label
mov [t + ecx], eax //Copy eax into t[i] where i = ecx
dec ecx //Decrement ecx ==> ecx = ecx - 1
jnz loop0 //Jump only if previous result (ecx - 1) isn't 0
These two assembly codes are totally similar. The only difference is that one will cost you more in terms of cycles than the other. In other words, one is faster than the other. How to define which is the best ? If you check Agner Fog's instruction tables page 162, you'll notice that REP STOS latency is n (n being the number of iterations) in the worst case.
If you look around in the instruction tables, you'll see that the XOR REG, SAME will cost us 0.25 cycles, and that the MOV R, IMM will cost us 0.5 cycles. In overall, the performance of the first assembly code can be evaluated to : latency = n + 0.75 cycles (for a 1000 iterations ~ 1000.75 cycles).
If we look at the second assembly code we get the following :
mov ecx, MAX_CHAR //0.50 cycles
xor eax, eax //0.25 cycles
loop0:
mov [t + ecx], eax //1 cycle
dec ecx //2 cycles
jnz loop0 //1 cycle
In this case, we get latency = 4 * n + 0.75.
Now, you might think that the first code is faster than the second one 'cause n < 4n. Don't forget that the Intel architecture is pipelined and there's some other weirdness going on. What I can assure you though, is that the first code is VERY slow compared to the second one. Why? REP STOS is microcoded (Fused µOps column). Meaning, it's not a real hardware instruction. It is essentially made by combining multiple circuits (two here) in a way that could've - in the past, with Pentium III & IV - save you time. The problem with this approach is that it doesn't hold great in the pipeline.
The second assembly code will be much faster because all instructions are hardcoded. Meaning, there's a special circuit for each one and there's no need to combine other circuits to perform the desired task. Therefore, these instructions can be pipelined easily, thus the computed 4n overall latency can go down far below n (if everything is cached properly). The whole loop iteration will go down to around 0.75 cycles for all three instructions. Plus a bonus when the branch predictor hits it right.
I have to remind you that these instructions came to life in a time where CPUs had a hard time running loop constructs efficiently. Especially for string processing because of too many branches and small caches. With the advent of efficient branch predictors and larger/more intricate cache systems, these instructions became, somehow, obsolete. Rare are the compilers that will generate code based on REP STOS or REP LODS, unless you're running your code on some pretty old CPU or using a pretty old compiler (like Turbo C for example).
I hope that my post helped shed some light on those instructions and how they can be used. Let me know if you need more details or more explanations.