Hi I am reading some materials about kernel fusion when using TensorRT. From this post and this blog, I know the main benefit from using kernel fusion is reusing data in shared memory/register, resulting in the reduction of load/store operations.
This code example is also very clear. It fuses kernel to avoid multiple kernel launch and uses register memory to reduce number of data load/stores.
My question is from one paper , which mentions the performance issues:
Two performance issues: (i) One often has to transfer the output matrix from one operator to another in the GPU global memory; (ii) Switching between operators introduces on-and off- chip data movement because the lifetime of an on-chip variable cannot go across kernels. Although the state-of-the-art optimizations, such as vertical and horizontal kernel fusion from TensorRT, can mitigate the overhead of issue (i), issue (ii) unfortunately remains. The root cause lies in the fact that TensorRT cannot change how each operator is implemented.
My question is mainly on the second issue they have mentioned. Is it because TensorRt is graph-level optimization? So though multiple operators are fused, the data transfer is still necessary. Let me give a toy example: global memory -> shared memory(kernel A) -> global memory -> shared memory (kernel B).
Any further material is appreciated.