0

I bought a 24-core processor (AMD Ryzen Threadripper 2970WX) for CPU-intensive workloads like converting large media files and rendering video effects. But it generally uses only 35-55% of the total capacity, even when doing large multi-core jobs over many hours, such as:

  • Rendering video effects with Premiere Pro (with or without graphics acceleration enabled)
  • Exporting a huge file with Adobe Media Encoder (with or without graphics acceleration enabled)
  • Converting a huge video file with Handbrake

Below are screenshots taken from Process Explorer and Core Temp while converting a huge video file with Handbrake.exe. By hovering over the individual core histograms, I can see that Handbrake.exe is the main consumer of every core, but it seems to be limited to about 33-34% usage (that said, a few minutes ago it seemed to increase to 40% usage per-core for Handbrake, for a while, so it's not completely consistent).

The same is true when using Adobe Media Encoder or Premiere Pro to do a large render job. Process Explorer looks about the same.

Is my CPU being under-utilised, and what can I do to un-throttle it if so? Or is it just something to do with how Process Explorer presents the information, and in reality I'm using the full capacity? I don't know much about CPUs, I just want to make sure I'm getting my money's worth!

I considered whether it could be thermal throttling, but Core Temp (2nd screenshot) shows the temperature hovering around 40°C, which doesn't seem high to me.

Screenshot of Microsoft Process Explorer showing current CPU usage over 24 cores, averaging 48% usage

Screenshot of application "Core Temp", showing CPU temperature of 40 degrees celcius


UPDATE: I just discovered Cinebench, and ran it, and it immediately maxed out all 24 cores at 100% usage (and CPU temp reached 64°C). I guess that rules out thermal throttling. So why are Handbrake and Adobe Media Encoder (the main apps I need to be fast) apparently throttled?

callum
  • 1,883
  • 4
  • 20
  • 26
  • 1
    Is your disk bandwidth maxed out? Is it offloading some processing to the GPU? Most present software doesn't do things in parallel with the GPU; it chooses one or the other. – Brannon Jan 14 '21 at 13:13
  • @Brannon I can confirm I'm not using the GPU, at least when using Adobe Media Encoder (which I have set to CPU-only). How to find out if disk bandwidth is maxed out? It seems unlikely, it's a very fast NVMe drive. That said, I do sometimes get weird unexplained UI freezes in Explorer and in 'Save' dialogs, lasting 10 seconds or so, but only happens a couple of times a day. – callum Jan 14 '21 at 13:32
  • Another thing: I've got 128GB RAM, which is barely utilised, so if the bottleneck could be disk bandwidth, then I'd welcome any suggestions on how to configure Windows 10 to make better use of RAM! – callum Jan 14 '21 at 13:39
  • 1
    How big are the videos? One ugly option might be to get some way to set up a ram disk and see if that helps improve performance. – Journeyman Geek Jan 14 '21 at 14:08
  • @JourneymanGeek - just tried with a ramdisk as you suggest, i.e. using Handbrake to compress a 19GB video file, both reading and writing to a ramdisk. Unfortunately the CPU usage looks exactly the same. – callum Jan 14 '21 at 14:26
  • 245 cores, 48 threads. So if I/O and RAM are fast enough then isn't 50% maxed out? – Hennes Jan 14 '21 at 14:39
  • 1
    That should make it reasonable plausible that your storage may not be the bottleneck – Journeyman Geek Jan 14 '21 at 14:47
  • @Hennes I don’t know enough about how it works. Are you saying it’s 48 threads between 24 cores, so each thread can only use half of a core? I still don’t get why the video conversion task can’t use the full capacity, couldn’t it just be parallelised to 48 threads in that case? – callum Jan 17 '21 at 15:27
  • This is cutting tons of corners, but as a simplified analogy: 2 treads per core is similar to placing two people at one desk. Each with their own notepad, but with only one phone, pne calculator etc. Depeneding on what do do one might phone and the other might look up something from storage, archiving full speed for both treads. But if both need to phone and one will have to wait and no speed is gained. On average, with treading on a CPU will perform about 1/3rd faster than one without, for less of 1/3rd of CPU cost added. So this is a good tihing to have, – Hennes Jan 17 '21 at 16:34
  • So I would not expect all 48 treads to run at 100%. Just how much they should (or could) run at though, wild guess since it depends on tons of variables. You could reboot, disable treading and see if the avarage core is suddenly harder at work. (But undo that after testing). And that should morely be an interesting point in diagnosis. Other items are more likely, like file locking, io bandwith, memory bandwith, .... – Hennes Jan 17 '21 at 16:38
  • 1
    Do keep in mind that software work is, in general, not infinitely subdivisible. Video encoding is still an algorithm that has well-defined limits in how much of it can run in parallel. Sometimes, efficiency can be sacrificed to enable more parallel processing. Different codecs have different characteristics in this regard. – Daniel B Aug 25 '22 at 19:53
  • Agree with @DanielB. You do not specify with codec you are using for conversions, but not all algorithms scale well to a large number of threads. x264 for instance can suffer from quality loss when using too many threads. x265 is supposed to maintain quality but can suffer from latency because of communications between threads. – PierU Aug 25 '22 at 21:01

2 Answers2

0

I'm not entirely sure what the issue is, but here are my 2 cents that won't fit in a comment. Maybe someone can make use of it.

Firstly, video processing workload is best processed on a GPU. A powerful CPU is good, but a good GPU is better. It's just more suited for the job. See this answer for details. Going for a powerful CPU may be an expensive dead end. You've already bought it though, and maybe to fits your workflow better for some reason, so there's that.

Secondly, a 2nd generation Threadripper is less than ideal because it consists of two NUMA domains. What that means is that it's basically two CPUs glued together into one package and configured to appear as a single CPU. This approach has a potential problem: each of these internal CPUs has its own NUMA domain and they can't access each other's cache. When a workload is moved to a core on the other CPU, the relevant cache from 1st one must be either moved to the other one or flushed to RAM (I'm not sure how this works exactly). This adds latency and wastes time that could be spent on computations. Some programs are NUMA aware, meaning that they will manage their workload to avoid this memory shuffling, but your software may not consider this. This issue was remediated in 3rd generation Threadrippers if I remember correctly.

gronostaj
  • 57,004
  • didn't those threadrippers have a mode that kept jobs on one ccu? Might help performance here, even if OP dosen't get to use all their cores – Journeyman Geek Jan 14 '21 at 14:07
  • Thanks, I do actually have a good GPU too (2080 Ti), but it isn't always very reliable; I usually set Premiere Pro to "CPU-only" because certain effects result in black screen when using GPU acceleration (searching online suggests this is a common problem). Regarding 2nd point, that's very interesting, but could it really account for the massive underutilisation of CPU I'm seeing in Process Explorer? – callum Jan 14 '21 at 14:10
  • 1
    @callum I'm not sure if it explains basically halving the CPU load on a Threadripper (hence the disclaimer in the first paragraph), but it's not unheard of for systems with actual dual CPUs to display such behavior – gronostaj Jan 14 '21 at 14:24
  • @gronostaj Interesting, thanks. Did you see my update about Cinebench successfully using 100% on all cores? So the problem is only apparent in real-world programs that I actually care about being fast, like Premiere Pro, Media Encoder, and Handbrake. If the problem you describe with dual CPUs could account for such behaviour, would you expect it to affect all programs, including stress-testers like Cinebench? – callum Jan 14 '21 at 14:31
  • 1
    @callum See the "NUMA aware" part of my answer. Programs can be coded to work around this issue. It's additional work though and NUMA systems are not common, so it may not pay off for developers. – gronostaj Jan 14 '21 at 14:33
0

I don't fully agree with @gronostaj. A good GPU is good for video rendering, games, et al. But in my experience, software (CPU) transcoding can usually be superior and is far more flexible than hardware (GPU) transcoding. SVT-AV1 on a GPU anyone? I was not aware of the 2nd gen threadripper's numa nodes, so thanks for that info.

You don't specify an encoder. The information below is for x265 with 10 bit encoding. If interested in AV1, use only the svt-av1 encoder since it is designed for far more parallelism than the reference AOM encoder.

With numa in mind and HandBrake on the screen, perhaps trying the parallel options from https://x265.readthedocs.io/en/master/cli.html#performance-options

Specifically in the Advanced Options box try:

pools:pmode

This will enable worker pools on all nodes. If pools alone isn't right try

pools=24:pmode:wpp

at2010
  • 114