We have a MySQL database running on AWS RDS with a very large event table of about 70 million rows (table A). We are trying to create an aggregate table (table B) that groups the events from table A to daily level in order to handle the data more easily. the query that I use is as follow:
INSERT INTO Aggregate_table_B
SELECT
A.username AS username,
LEFT(A.event_cts, 10) AS eventtime,
SUM(A.total_event_value) AS total_daily_event_value
FROM
Table_A as A
WHERE
A.id >= x
AND A.id < y
GROUP BY
username,
eventtime;
The id column is the event index in event table A.
Now if the value y is very close to max(id) from table A, the query will fail with the lock wait timeout error. But if the value y is sufficiently smaller than max(id), then the query will run successfully. So if I run the same query twice back to back, but one with the id in the "good" range and one with the id in the "bad" range, then the query in the bad range will get the lock wait timeout error and the query with the good range will run fine.
The error message is as follow:
Error Code: 1205. Lock wait timeout exceeded; try restarting transaction
I have identified the limit of the value y above which the query stop working and check on the row with that id value and it seems completely normal.
Also, I tried to run this query:
show open tables where In_use > 0
when the first query is being run with too large y value (that will trigger the error) and the result shows that there is only 1 lock placed on each table, not multiple locks.
I also tried to run
Show PROCESSLIST
to see if there is some sleeper connection that causes the lock but the result is the same when the query fails and when the query succeeds with different id ranges
Another thing is that if I try to run a similar version of the first query:
INSERT INTO Aggregate_table_B
SELECT
A.username AS username,
LEFT(A.event_cts, 10) AS eventtime,
SUM(A.total_event_value) AS total_daily_event_value
FROM
Table_A as A
WHERE
A.id = z
GROUP BY
username,
eventtime;
but where I specify only 1 event row in the condition, then the query runs fine even when z exceed the limit that broke the first query. The error only appears again if I specify a value of z that is greater than max(id).
Then the last point is that this database is destroyed every day and a new version is created which contains updated data from the previous day and all the days before that. And this error only occurs some of the days. This means that the data that used to belong to a previous database state that fails the query still exist in the subsequent database state, but the query runs fine in the newer database state.
I appreciate any insights into how I can drill down to find the actual error cause.