Perf: Could not find an useful description of "branch-load-misses" metric

Question

I'm trying to show that the stalls due to branch misprediction may be reduced due to a certain optimization. My intuition suggests this could be due to a reduction in the stall cycles related to loads that delay the branch outcome.

For this, I was planning to use the Linux Perf utility to get the Hardware performance counter values. There is a related metric called branch-load-misses, however, no useful description is provided.

Can anybody please confirm if this is the right metric to use? If not, please suggest a related metric that could be of help.

Thank you

score 0 · Answer 1 · answered Sep 16 '21 at 10:55

For measuring the branch prediction and the branch misprediction rate you can use the VTune Profiler. Download link: https://software.intel.com/content/www/us/en/develop/tools/oneapi/components/vtune-profiler.html#gs.bh5zrq

Just create custom VTune analysis with 2 events:

BR_INST_RETIRED.ALL_BRANCHES

BR_MISP_RETIRED.ALL_BRANCHES

You'll need to manually divide one by another to get the ration though.

BR_INST_RETIRED.ALL_BRANCHES

Counts all (macro) branch instructions retired.

BR_MISP_RETIRED.ALL_BRANCHES

Counts all the retired branch instructions that were miss predicted by the processor.

A branch misprediction occurs when the processor incorrectly predicts the destination of the branch.

When the misprediction is discovered at execution, all the instructions executed in the wrong (speculative) path must be discarded, and the processor must start fetching from the correct path.

In case you don't know - custom analysis in VTune can be created by selecting any pre-defined analysis and pressing 'Customize...' button in the top-right corner.

E.g., you can select Microarchitecture Exploration, uncheck all checkboxes there, press 'Customize...', scroll down to the table with CPU events and uncheck all not needed/add needed events.

Regards

Ok, yes those are useful HW events to look at, but what HW event does Linux `perf` map its `branch-loads` and `branch-load-misses` to? You're not answering that part, or the querent's attempt to measure the total misprediction *penalties* rather than just the misprediction *rate*. — Peter Cordes, Sep 16 '21 at 11:01
(Linux `perf stat` already calculates the miss rate if you ask it to measure both `branches` and `branch-misses`, which I think on Intel HW maps to `br_misp_retired.all_branches` or `br_misp_retired.all_branches_pebs`) — Peter Cordes, Sep 16 '21 at 11:02
And yes the CPU has to discard uops from the wrong path, but the surrounding code and microarchitectural conditions can have an effect on how many that is, and how much overall throughput that costs. [Avoid stalling pipeline by calculating conditional early](https://stackoverflow.com/q/49932119) discuses one way that branch misses can be cheaper, by *not* having the branch condition be the end of a long dep chain that can't be confirmed for a long time. — Peter Cordes, Sep 16 '21 at 11:05

Perf: Could not find an useful description of "branch-load-misses" metric

1 Answers1