use atomic operations on the PCIe host/device shared memory?

Question

Some PCIe devices (for example FPGA card) can expose segments of its physical memory via host's BARs and the host can access the memory region via the memory devices (on Linux, we can memory mapped the devices to virtual memory). I suppose the device itself could also access this part of memory through /dev/mem mapped mechanism if it runs Linux too.

One thing a program could do to the (virtual) memory is atomic operations such as "__atomic_sub_fetch" and that could be very useful when writing high performance code.

My question is what if the memory comes from the above PCIe shared memory (and mapped to user's virtual memory space)? Does the atomic operation still hold? I do not know if PCIe can guarantee the atomic-ness considering the atomic operations could come from both the host and the device's CPUs at the same time. If yes, how is its perf compare to the same atomic operation on the regular memory?

I have seen related question asked here, not direct answer. PCI Express BAR memory mapping basic understanding

Thanks a lot!

BZKN · Answer 1 · 2022-01-12T09:51:31.477

OP Question 1: My question is what if the memory comes from the above PCIe shared memory (and mapped to user's virtual memory space)? Does the atomic operation still hold?

Yes. Both FPGA and CPU host software can request a lock for exclusive access to a memory region to perform atomic operations. For example, OpenCL shared virtual memory (SVM) introduces fine-grained host-device synchronization, which allows the host and device to access shared data structures concurrently and synchronize at the granularity of atomic load/store instructions. This enables true concurrency between software threads and FPGA kernels in the presence of shared data structures.
Having said that, such synchronization for concurrent memory access through atomic load/store operations requires a mechanism to ensure that a CPU or FPGA hardware kernel/accelerator access to shared data is guarded against an interfering access to the same location by the other side until the access has been completed (atomicity of the access).
Furthermore the answer on SO here says that PCIe 3.0 does support certain "Locked Transactions".
Furthermore, since your question has mentioned FPGA, lets take a concrete example. You can also understand about atomic operation for 7 Series FPGAs Integrated Block for PCI Express v3.3. It mentions the 7 Series FPGAs Integrated Block for PCI Express supports both sending and receiving atomic operations (atomic Ops) as defined in the PCI Express Base Specification v2.1. The specification defines three TLP types that allow advanced synchronization mechanisms amongst multiple producers and/or consumers. The integrated block treats atomic Ops TLPs as Non-Posted Memory Transactions. The three TLP types are:
- FetchAdd
- Swap
- CAS (Compare And Set)

OP Question 2: If yes, how is its perf compare to the same atomic operation on the regular memory?

This depends. One of the significant factors is also the size of the data. For example, in some applications the same atomic operation can perform better on regular memory system if array size is small. On the other hand, the same atomic operation can be better for SVM with larger array sizes. At times in case of SVM achieving equal runtime performance to regular memory can also be considered a performance gain since SVM itself has overheads.

What you're saying may be true for certain devices, but I'm not sure that it is generally true for any generic PCIe device. See [my answer](https://stackoverflow.com/a/70677297/119527). — Jonathon Reinhart, Jan 12 '22 at 06:41

score 0 · Answer 2 · answered Jan 12 '22 at 06:40

I think the answer is "No."

Atomic operations like __atomic_fetch_add are implemented (on x86) as an instruction with a LOCK prefix. This prefix would traditionally tell the CPU to "lock" the bus by asserting a LOCK# signal which other physical CPUs would respect. Nowdays, this atomicity is all handled by the cache coherency protocol (MESI) which dictates the behavior of the cache hierarchy.

See What is processor Lock# signal and how it works?.

The point is that these CPU instructions only protect the memory against other CPUs.

So you may be able to use atomic instructions to provide atomicity from the perspective of software running on other CPU cores, but to my knowledge there are no atomic primitives available on the PCIe protocol that would provide atomicity against the device itself.

See: http://xillybus.com/tutorials/pci-express-tlp-pcie-primer-tutorial-guide-1

Edit: Actually, I might be wrong. This answer says that PCIe 3.0 does support certain "Locked Transactions". But I'm not sure that an x86 CPU will translate a lock inc instruction against a memory-mapped PCIe address to a PCIe FetchAdd instruction. I would be very interested to hear more insight here.

use atomic operations on the PCIe host/device shared memory?

2 Answers2