Why is "select join insert" so much faster than "loop insert"?

Question

MySQL docs state that the time required for inserts has these approximate proportions:

Connecting: (3)

Sending query to server: (2)

Parsing query: (2)

Inserting row: (1 × size of row)

Inserting indexes: (1 × number of indexes)

Closing: (1)

In other words, if we use stored procedures to do inserts into a table without indexes, the deciding factor for the time required would simply be:

Inserting row: (1 × size of row)

So I test with a stored procedure that made 2 million inserts:

create table t(i int)engine innodb;
delimiter $
    create procedure f()begin
        while @a<=2000000 do
            insert t select null;
            set@a=@a+1;
        end while;
    end$
delimiter ;

set@a=1;
begin;call f;commit;drop procedure f;
select count(*)from t;drop table t;

The query call f; takes 29.44 seconds.

Rolando suggests (below) to use a prepared statement, so I updated the stored procedure to use prepare:

create table t(i int)engine innodb;
delimiter $
    create procedure f()begin
        while @a<=2000000 do
            execute e; # # # # # # # # # # # # # # changed insert statement
            set@a=@a+1;
        end while;
    end$
delimiter ;

set@a=1;
prepare e from'insert t select null';  # # # # # # added prepare statement
    begin;call f;commit;drop procedure f;
deallocate prepare e;  # # # # # # # # # # # # # # added optional deallocate prepare statement
select count(*)from t;drop table t;

However, the query call f; now still takes 30.27 seconds.

It seems like there's no way to beat raw "begin + loop insert + commit" (unless we resort to file loading using load data infile).

However, if we insert using select joins, it only takes 6.28 seconds:

create table t(i int)engine innodb;
insert t select null
from(
    select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9 union select 10
)`10`
join(
    select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9 union select 10
)`100`
join(
    select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9 union select 10
)`1000`
join(
    select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9 union select 10
)`10k`
join(
    select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9 union select 10
)`100k`
join(
    select 1 union select 2 union select 3 union select 4 union select 5 union select 6 union select 7 union select 8 union select 9 union select 10
)`1m`
join(
    select 1 union select 2
)`2m`;
select count(*)from t;drop table t;

I'm using MySQL (though I suspect this isn't a MySQL-specific issue).

Why is "select join insert" so much faster than "loop insert"?

What's the explanation for this oddity?

score 2 · Answer 1 · edited Jun 15 '20 at 09:05

QUERY #1

Each time you do an INSERT, you are doing this under the hood

SET @sql = 'insert t select null';
PREPARE s FROM @sql;
EXECUTE s;
DEALLOCATE PREPARE s;

Within the stored procedure, you fully parse, compile, execute and deallocate structures for the prepared SQL statement 2 million times.

QUERY #2

Running insert t select null from(, you fully parse, compile, execute and deallocate structures for the prepared SQL statement ONCE !!!

The execution of the Cartesian joins is really a self-contained atomic operation. Though it does perform iterations ( O(n²) running time ) to form the result table, it is not tethered by creating/destroying query parsing structures with each iteration as you are doing in the stored procedure (despite an O(n) running time). The execution plan of the single INSERT may be uglier to behold, but will surely run faster.

The inline queries are most likely memory tables. The resulting join could be a temporary MEMORY table or temporary MyISAM table. Either way, the formation of the result set is a single operation.

CONCLUSION

Inserting 2 million rows on a single INSERT is always faster that than doing an INSERT of a single row 2 million times. In principle, this would be more evident if you connected, inserted and disconnected. Connecting once beats connecting 2 million times (See my post How costly is opening and closing of a DB connection?)

@RickJames I know I sound silly saying that. I'll try to rephrase ... — RolandoMySQLDBA, Apr 08 '15 at 19:41
@RolandoMySQLDBA, This doesn't seem right. I've tried with a manual prepare before the while loop: prepare e from'insert t select null' and then I replaced the insert statement in the code above with execute e (see the update), but then the timing is 31.23 seconds. As we can see, allocation and deallocation is done only once, but the performance is still slow. Do you know what's the explanation for this oddity? — Pacerier, Apr 09 '15 at 04:01
@RolandoMySQLDBA, Am I missing something? I'm not sure why this answer is getting upvotes. As I mentioned above, allocation and deallocation is done only once since it is a prepare execute, but the performance is still slow. — Pacerier, Apr 11 '15 at 17:39

Rick James · Answer 2 · 2015-04-11T17:49:18.373

1

I have a simpler way to put it.

"Batched INSERTs and LOAD DATA run 10 times as fast as single-row INSERTs."

By "batching", I mean INSERT INTO t (a,b) VALUES (1,2), (2,3), .... The optimal number is between 100 and 1000 rows per INSERT. Beyond that, you get into diminishing returns.

Here are some issues that impact performance, especially for Batched Inserts:

innodb_flush_log_on_trx_commit = 1 is secure, but takes ~10ms (typical spinning media) per INSERT (unless part of a transaction).
When Replicating a large INSERT, you are clogging up replication, so you might want to do only 100 rows at a time.
Huge columns (BLOB/TEXT) can lead to hitting some limit when batching.

Edit As for the title --

The "select join insert" is a single query
The "loop insert" is lots of 1-row inserts

The overhead for talking to the server, parsing the query, etc, is significant. Hence the latter is 5 times as slow. This is similar to the "10" I mentioned above.

edited Apr 11 '15 at 17:49

answered Apr 08 '15 at 19:00

Rick James

78,038
5
47
113

Yes as stated in the question I know that load data infile is the most optimum, but that's not quite relevant in our comparison of "select join insert" vs "loop insert". The question is, Are you aware why "select join insert" is so much faster than "loop insert"? – Pacerier Apr 11 '15 at 17:39
I added an edit. – Rick James Apr 11 '15 at 17:49
The difference in time may also be affected by the transaction logging (one statement vs a million). (Not sure about this, just an idea.) – ypercubeᵀᴹ Apr 11 '15 at 18:08
innodb_flush_log_at_trx_commit makes a huge difference on 1 vs 1M. (=1 is slow; =2 is fast.) – Rick James Apr 11 '15 at 18:17

Why is "select join insert" so much faster than "loop insert"?

2 Answers2

QUERY #1

QUERY #2

CONCLUSION