2

I am analysing CPU useage on a large network. In order to do that, I was provided with a large excel sheet. It contains batchID (means we dedicate a CPU to run that task) startTime, endTime (means we know the CPU is fully occupied during this time).

Based on this data, I need to understand how many batches are running at a particular instance. Therefore, I will use a chat with x-axis been the time and y-axis been the count of batches running at each time instance.

The whole file is over 15000 rows over two days's data. Here is a fraction of it.

BATCHID     startTime             endTime
560062  13/10/2011 11:59:23 13/10/2011 11:59:26
560061  13/10/2011 08:59:18 13/10/2011 08:59:21
560060  13/10/2011 05:59:21 13/10/2011 05:59:30
560059  13/10/2011 02:59:34 13/10/2011 02:59:43
560058  13/10/2011 01:57:24 13/10/2011 01:57:29
560057  13/10/2011 01:57:24 13/10/2011 01:57:28
560056  12/10/2011 23:59:19 12/10/2011 23:59:28
560055  12/10/2011 20:59:21 12/10/2011 20:59:30
560054  12/10/2011 18:02:13 12/10/2011 18:02:22
560053  12/10/2011 18:02:13 12/10/2011 18:02:21
560052  12/10/2011 18:02:12 12/10/2011 18:02:21
560051  12/10/2011 18:02:07 12/10/2011 18:02:16
560050  12/10/2011 18:02:03 12/10/2011 18:02:11
560049  12/10/2011 18:02:10 12/10/2011 18:02:19
560048  12/10/2011 18:02:11 12/10/2011 18:02:16
560047  12/10/2011 18:02:09 12/10/2011 18:02:13
560046  12/10/2011 18:02:04 12/10/2011 18:02:13
560045  12/10/2011 18:02:12 12/10/2011 18:02:21

Requirment:

  1. We need array to contain the time slice data. This could be every 1 minuts or 5 minuts. If we need to analyse two days for every 1 minuts interval, we could then need 2880 data points for the x-axis.
  2. Because at any instance there are could be many jobs running. We'll need to device a mechnism to count number of running batches at that time slice.

I suspect Excel 2003 can do a good job as the number of columns is limited to 256.

I'm welcoming any advise on how to permoer this task efficiently in Octave/MATLAB, ORACLE PL/SQL, R or Bash Script.

Dean
  • 121
  • 1

1 Answers1

3

Satisfying requirement (1) is trivial: just enter the start time in a cell, enter the next time in the array below it, select the two, and drag down as far as you like. Excel will extend this to an arithmetic progression of times. Alternatively, place the start time in one cell, then in the cell below it add a small increment:

=E2+Duration

Requirement (2) uses a simple trick: the number of batches running at time $t$ must equal the number of batches started on or before $t$ minus the number of batches ended on or before $t$. (This trick would be the basis for an efficient implementation in any platform, not just Excel.) Count these with formulas like

=COUNTIF(Start, "<=" & E2)

and

=COUNTIF(End, "<=" & E2)

(where E2 is a cell containing $t$, [Start] is a named range for the "startTime" data, and [End] is a named range for the "endTime" data).

Here's an example using a small part of your data. (Extending it to 15000 rows and, say, 2800 time periods will be no problem and should recalculate almost instantly. The graph should be changed to a scatterplot to accommodate that many time periods, though.)

Screenshot

whuber
  • 322,774
  • This is really a clever method. But it still has a problem. For example, there are should be 6 batches running at 18:02:11 for row 3,4,5,6,7 and 8. – Dean Oct 15 '11 at 12:35
  • @Dean The code is correct. What you're overlooking is the rounding that occurs in Excel's display of the [Period] values: note that the displayed difference between periods is not constant, indicating that there are some fractional seconds here. As you can see in the [Started] column, all six starts are counted, but also one batch ended at exactly 18:02:11 (row 7), which is counted in the [Ended] column. Thus, immediately after 18:02:11, there really were only 6-1=5 batches still running. To avoid confusion, it's probably wise to set all the period cutpoints at fractional seconds. – whuber Oct 16 '11 at 14:56