How do I compute the parallel overhead of a parallel code run on a single processor when no sequential code is available?

Question

I'm profiling the performance of PETSc's linear solvers. As I understand it,

$$\text{speedup}=\frac{\text{Sequential Time}}{\text{Parallel Time}}.$$

I know that running the parallel code on one processor can be used as a proxy for the sequential performance. However, I don't think it is a good measure of an equivalent sequential code due to the parallel overhead incurred. Often, the performance of a sequential code is faster than the parallel performance on a single processor. I suppose that I could search for numerical libraries that implement the same solver, but there is no guarantee that the algorithm is really equivalent.

As I understand it,

$\text{Parallel performance on one processor} = \text{Sequential Time} + \text{Parallel Overhead}$

Thus, if there is a way to quantify the parallel overhead, we could subtract it from the parallel time on one processor to get a better measure of the sequential time.

So, my questions then would be:

Is there a way to compute the parallel overhead of a parallel code run on a single processor when no sequential code is available?
Is it really necessary? Is the parallel performance on one processor good enough to approximate the sequential performance, in general?

Speedup is a relative measure and does not say much about absolute performance. Speedup can be measured either to a pure sequential implementation doing same useful work or a parallel implementation (with overhead as you indicate). In each case, the speedup can be quite different due to the parallel overhead and/or possible different execution paths in the instructions set of the application. Why do you want to estimate the parallel overhead? — Allan P. Engsig-Karup, Jan 29 '12 at 16:14
@AllanP.Engsig-Karup: I think my primary motivation just knowing how to get a reasonable estimate of the sequential code when all I have is parallel code. — Paul, Jan 29 '12 at 16:45
@Paul: I will try to give an actual answer later, but there are often tradeoffs in made in order to get a scalable algorithm. Speedup should really be measured relative to the best serial implementation that, given the same input, produces the same output (modulo small perturbations). — Jack Poulson, Jan 29 '12 at 17:36
@Jack: Agree. Also, in my opinion, relative speedups are not very useful unless absolute timings are given as well. — Allan P. Engsig-Karup, Jan 29 '12 at 17:50
@Jack: I disagree. The best serial implementation might not be yours. Speed up of a code should probably be measured against its own serial performance. Then, your implementation should be measured against other implementations in both relative and absolute terms. Using an unrelated serial implementation in the numerator is likely to be misleading. — Bill Barth, Jan 29 '12 at 19:53
@BillBarth: Sure, but it should certainly be part of the discussion. I agree that it would confuse the graph if unexplained. I am simply arguing that artificial coefficients should be blatantly obvious to outside observers. — Jack Poulson, Jan 29 '12 at 19:58
@Jack: Agreed. As long as there's an absolute performance comparison between methods/implementation as well, there should be no confusion. — Bill Barth, Jan 29 '12 at 19:59
A reference to the best serial performance can be the parallel version executed in sequential mode if a there is no original sequential-only version available. — Allan P. Engsig-Karup, Jan 29 '12 at 20:22
Just measuring speedup is in general not useful since we can be comparing with a bad and inefficient implementation. Knowledge of the execution time of a sequential version can be useful to assess performance gains in parallel, however, ultimately, most people in practice are interested in how long time and how much resources are needed to complete a given task in absolute time (e.g. a day? a month?). — Allan P. Engsig-Karup, Jan 29 '12 at 20:26

score 5 · Accepted Answer · answered Jan 29 '12 at 19:24

I think that as long as you say what you measure speedup against, no one will fault you for using the time the parallel version of the code takes to run on one processor. If you also give the total time for one of your cases (say the uni-processor time), then people will be able to compare your implementation to others in the literature or to their own.

For some problems, no uni-processor result will even be possible to compute given the constraints of memory or time. Given that, most people understand when looking at speedup results, that the things are computed relative to the smallest number of processors available and that the uni-processor datum is the parallel code run on one processor.

There are no hard and fast rules, but you should be explicit about what you are doing and give your reader enough information to compute other quantities that might interest them.

stali · Answer 2 · 2012-01-29T16:40:47.967

0

I am not familiar with PETSc internals (unlike other PETSc experts here) but imho there should be no parallel overhead with PETSc as long as you run your job as a single process (i.e., no partitioning etc).

Remember PETSc can also be installed without MPI meaning whatever small MPI overhead that might be there (assuming if at all any actual MPI calls are made when running on 1 core which I highly doubt) can also be discounted for.

This obviously is true when parallel overhead is mostly communication and not algorithmic.

edited Jan 29 '12 at 16:40

answered Jan 29 '12 at 16:27

stali

1,759
1
10
13

That may be true for petsc, but it is probably not true for parallel code in general. – Paul Jan 29 '12 at 16:43

How do I compute the parallel overhead of a parallel code run on a single processor when no sequential code is available?

2 Answers2