Optimize MySQL average query for millions of rows

Question

I have 4 tables (I've specified main columns for each):

monitors (id)
monitor_node (monitor_id, node_id, average_response_time)
monitor_checks (id, monitor_id, node_id, response_time)
nodes (id)

Their relations:

1 monitor - N checks

N monitor - N nodes

1 check - 1 monitor & 1 node

So, I'm logging a lot of checks for each monitor. Monitor_checks will have millions of rows (approximately up to 500 millions).

When I insert another bunch of checks (~1k checks), I need to calculate average response time per node (table monitor_node, column average_response_time).

I'm pretty sure I'm doing it the wrong way and I need a faster solution. What I do now: after inserting 1k rows in monitor_checks I calculate average response time for each monitor (grouping by monitor_id). Then, based on this info I make an array and use insert ... on duplicate key update for monitor_node table, in order to do bulk update. Apart from average response time I calculate some other attributes, which goes along with response time in this bulk update.

Making the array with info and executing insert ... on duplicate key update is fast enough.

The slow query is:

select monitor_id, avg(response_time) as avg_response_time 
from `monitor_checks` 
where `node_id` = 2 
group by `monitor_id`

which takes like ~10-20 seconds for ~4m rows I guess.

I also realised that it's not necessary to get average time based on all checks, I can as well use last 50-100 rows. But I couldn't figure out how to do this. I got a fancy solution for grouping checks for each monitor and limit them for some number of rows: How to get latest 2 records of each group but it took way too long as well.

So the question is: how to quickly recalculate average response times in monitor_node when inserting ~1k rows in monitor_checks table?

DB Fiddle: https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=bd95afc030361bf1d87f8bc5c3935c2f

Final desired result:

monitor_id  node_id average_response_time
1   1   0.30
1   2   0.25
2   1   0.55 
2   2   0.65

I convert your DDLs into a fiddle. Please edit it: 1) remove all fields which are not nesessary for the task 2) add some sample data (5- 10 records per table enough) 3) add separate insert which emulates your "bunch of checks" (~5 records too) 4) show final table state which you need after processing inserted data. — Akina, Mar 05 '20 at 09:30
@Akina hopefully done, never saw the site before, thank you! — Victor, Mar 05 '20 at 10:55
The calculation and update are simple - fiddle, and there is not a surprise that this queries are expensive on a huge data array. it's not necessary to get average time based on all checks, I can as well use last 50-100 rows. - You mean last 50-100 at all, or 50-100 for each separate (monitor_id, node_id) pair? And check YOUR server version - is it 8.0 (like the fiddle is used now) or it is 5.x? — Akina, Mar 05 '20 at 11:25
I use 8.0.19, just checked it. I need to get average for pair, but I filter everything by node_id in advance. So, in the beginning in my code I have monitors array and a node id (I retrieve data each time for specific node). Then it'd be okay for me to get last 50-100 checks for each monitor, but checks should be filtered by node_id. So, yes, generally speaking I keep stats for each pair, and also for the whole monitor (I update its stats separately, getting average from monitor_node, which is not that expensive since there is not that many rows) — Victor, Mar 05 '20 at 11:37
fiddle - I add records count restriction. Test the query which calculates averages on your real data array - does it takes less time than calculation over whole data? — Akina, Mar 05 '20 at 11:51
Hmm, for some reason it's just being executing endlessly. Honestly I haven't used such operators before so maybe I'm doing something wrong, but I've just connected to the mysql and copy-pasted the query — Victor, Mar 05 '20 at 12:10
This means server decides to use fullscan, not index. Kill it. — Akina, Mar 05 '20 at 12:21
Please test SELECT COUNT(DISTINCT monitor_id, node_id) FROM monitor_checks - what is result and time? — Akina, Mar 05 '20 at 12:24
execution: 5 s 723 ms, 6570 is the result btw, select count(id) return 3 956 786 rows — Victor, Mar 05 '20 at 12:36
5s for list and 10-20s for average calculations? hmmm... I thought that 50-100 values per pair would be less than 1%, but you get > 15%. Then there is simply nothing to optimize. See the answer Gerard H. Pille - he is right, it is cheaper to store an array of the last 100 values and recalculate it for each pair, adding new values and discarding the oldest ones. I have tried one more variant fiddle but it will not help too on such statistic. — Akina, Mar 05 '20 at 12:41
did you try to change columns order in index and create it on (node_id, monitor_id, response_time)? — NikitaSerbskiy, Mar 05 '20 at 12:53

Gerard H. Pille · Accepted Answer · 2020-03-05T15:04:15.707

1

I would add an array to monitor_node, containing the most recent 100 response times. When adding recent response times, drop those exceeding 100. No need for extra data when a new average has to be calculated.

As an image may say more than a thousand words, I've updated the DB Fiddle. A trigger does the job, but that logic could be incorporated in the job doing the inserts.

create trigger monitor_checks_air
  after insert on monitor_checks for each row
  begin
  update monitor_node mn
    set mn.rec_resp_times =
          json_extract(
             json_array_insert(mn.rec_resp_times,'$[0]',new.response_time),
             '$[0 to 99]'),
        mn.average_response_time = (
            select sum(jt.rt)
              from json_table(mn.rec_resp_times,
                             '$[*]' columns( rt double path '$[0]')) as jt
          )
          / json_length(mn.rec_resp_times) 
    where mn.monitor_id = new.monitor_id
      and mn.node_id    = new.node_id;
  end;

The fastest query is the one you don't execute.

edited Mar 05 '20 at 15:04

answered Mar 05 '20 at 11:50

Gerard H. Pille

3,255
1
9
13

What's difference - to build such array and store, or to calculate average over this array and store? – Akina Mar 05 '20 at 11:52
You would have to build the array only once, and from then on, just update it when you register new checks. – Gerard H. Pille Mar 05 '20 at 11:54
But the averages are STORED into the table monitor_node after calculation - they're calculated only once. Then they are taken from the table until the next bunch of data inserted. – Akina Mar 05 '20 at 11:56
I understand, but my proposition is to add the data used to calculate it. That way, no need to look it up. – Gerard H. Pille Mar 05 '20 at 11:58
But after calculate and store we do not need both look up and calculate... – Akina Mar 05 '20 at 12:01
You do! Victor said : "When I insert another bunch of checks, I need to calculate ...". – Gerard H. Pille Mar 05 '20 at 12:05
But after "insert another bunch of checks" the "most recent 100 response times" must be renewed, its previous state is not applicable yet. – Akina Mar 05 '20 at 12:08
@GerardH.Pille woah, thank you so much for your effort! Wanted to ask you, if I implement this logic inside code, wouldn't it be much slower than having a trigger? I've almost never used triggers and going to test it now, but I realise that if I do the same in my code I'd do 1000 queries for 1000 monitors, getting latest 100 checks. And the time is crucial for me, I literally need to handle all these checks up to one minute, so probably I need to move all the logic into triggers or stored procedures? – Victor Mar 05 '20 at 17:02
Adding this logic to the routine responsible for the insert of the batch of 1000 checks, should be a lot faster than having it executed by a trigger. Only if you cannot control where the inserts come from, you'd need the trigger solution. – Gerard H. Pille Mar 05 '20 at 17:08
Adding 1000 records with DB Fiddle took 3 seconds. https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=d80bb20582ea2bd72a9b0da6de79355e – Gerard H. Pille Mar 05 '20 at 18:28

Rick James · Answer 2 · 2020-03-18T23:08:56.657

Build and maintain a summary table. It would probably be something like the following. I am assuming you care about the average response for each hour between each monitor-node pair?

CREATE TABLE response_summary (
    hr DATETIME NOT NULL,  -- a number representing an hourly bucket
    monitor_id ...,
    node_id ...,
    ct INT UNSIGNED NOT NULL,
    sum_resp FLOAT NOT NULL,
    PRIMARY KEY(node_id, monitor_id, hr)
    INDEX(monitor),
    INDEX(hr),
) ENGINE=InnoDB

When you have a batch of 1K readings, there are multiple ways you can pour it into that table. I'll code it assuming you do most of the work in MySQL. (Some of this could be done in the client instead.)

CREATE TEMPORARY TABLE batch (hr, monitor_id, node_id, response);
INSERT INTO batch VALUES (,,,), (,,,), (,,,), ..., (,,,);  -- the 1K readings
INSERT INTO response_summary
        (hr, monitor_id, node_id,
         ct, sum_resp)
    ON DUPLICATE KEY UPDATE
        ct = ct + VALUES(ct)
        sum_resp = sum_resp + VALUES(sum_resp)
    SELECT CONCAT(LEFT(dt, 13), ':00:00') AS hr,
           monitor_id, node_id,
           COUNT(*), SUM(response)
        FROM batch
        GROUP BY node_id, monitor_id, hr;

For more discussion:

Summary tables: http://mysql.rjweb.org/doc.php/summarytables
High-speed ingestion: http://mysql.rjweb.org/doc.php/staging_table

The order of the columns in the suggested PK is optimal for your where node_id = 2 group by monitor_id.

To get the average for a day:

    SUM(sum_resp) / SUM(ct)

This works to find the average over however many samples you have. If you need, for example, the day's average to be the average of 24 averages, then something else should be done.

Optimize MySQL average query for millions of rows

2 Answers2