I have seen the related question including here and here, but it seems that the only instruction ever mentioned for serializing rdtsc is cpuid.
Unfortunately, cpuid takes roughly 1000 cycles on my system, so I am wondering if anyone knows of a cheaper (fewer cycles and no read or write to memory) serializing instruction?
I looked at iret, but that seems to change control flow, which is also undesirable.
I have actually looked at the whitespaper linked in Alex's answer about rstscp, but it says:
The RDTSCP instruction waits until all previous instructions have been executed before reading the counter. However, subsequent instructions may begin execution before the read operation is performed.
That second point seems to be make it less than ideal.