The last big problem I encountered was working around bottlenecks. I had a lot of fields that were frequently referenced by many threads, and, with the program running flat out, I'd have CPU usage in the 8 to 10 percent range. Hopefully your threads won't interfere with each other like that, but I spent a lot of time pausing my debugger and checking who was stuck where. I ended up breaking up my lock/synchronized blocks to look at a minimum number of fields and working with local variables whenever possible. (There are plenty of other reasons to stick with local variables. This was just one more.)
I also ended up putting blocks of code in separate threads. This might use up a bit of CPU in overhead, but I had 90 to 92 percent available. The code in the separate thread didn't have to wait while it's parent thread waited on a lock and didn't hold up the parent thread while it waited on a lock.
And I went looking for data structures that were both thread-safe and thread-efficient. That's how I found Java's ConcurrentSkipListMap.
Everything else here, as I write this, is more important, and maybe you won't get into quite the tangle of threads I did, but, as I say, this is where I ended up. It's really just a rehash of the problems you listed, but faced after the decisions had supposed all been made and implemented.