I had an application that was originally single threaded and worked as follows:
- gather the items to be drawn (occlusion / frustrum culling / sorting into batches)
- draw items using an immediate context
- present
I decided to use a deferred context and parallelize step 2 as follows:
- gather the items to be drawn (occlusion / frustrum culling / sorting into batches)
- draw items in parallel used a deferred context
- execute the command lists for step 2)
- present
However, I saw almost no performance benefit. I finally decided move the "execute command lists" after present:
- gather the items to be drawn (occlusion / frustrum culling / sorting into batches)
- draw items in parallel used a deferred context
- present (is actually presenting the previous frame's scene now)
- execute the command lists
I saw a huge increase in performance with this method. My theory as to why is this worked is that the are basically 3 bottlenecks
- graphics card (triggered by present)
- my app's cpu thread
- device driver cpu thread (triggered by executing the command lists)
In my original order, each successive stage was blocked by the previous one to be finished. However now, I believe that all three are running in parallel:
- present with (frame - 2) data
- device driver thread with (frame - 1) data
- my cpu thread to run with this frame's data
My question is, is this a common pattern or is there some better way to achieve maximum parallization while using deferred context rendering?