1

I'm trying to understand why the data size of my table increased so much after performing this ALTER TABLE statement on an InnoDB table:

ALTER TABLE alert ADD COLUMN receiver_id INT(10) UNSIGNED

Note that I used gh-ost to perform the ALTER because the size of the table was rather large. I know that this will create a net new table, copy the rows and rebuild the indexes as data is copied. Based on what I know, I would expect the outcome to be very similar to how OPTIMIZE TABLE works.

Here's some stats from the old table (I got these from SQL Workbench, but they should be calculated from information_schema):

  • Table rows: 40037398
  • AVG row length: 8800
  • Data length: 328 GiB
  • Index length: 12.9 GiB
  • Table size (estimate): 341 GiB
  • Autoincrement: 65447094

And here's a screenshot of the column data from information_schema.COLUMNS, including the data types and encodings:

enter image description here

And here are the same stats for the new table which has the receiver_id column applied to it:

  • Table rows: 24763900
  • AVG row length: 18735
  • Data length: 432.1 GiB
  • Index length: 12.2 GiB
  • Table size (estimate): 444.3 GiB
  • Autoincrement: 65472610

And here's a screenshot of the column data from information_schema.COLUMNS, including the data types and encodings. You should see the additional INT column at the bottom:

enter image description here

The difference in autoincrement makes sense because after the tables had been swapped, new writes were being applied to the latter. The decrease in index length also makes some sense to me, as rebuilding indexes can optimize the storage underneath. Can anyone provide any insight into why Table rows, Data length would change so dramatically?

EDIT 10/4 adds COLUMN data screenshots.

Xopherus
  • 121
  • 2
  • What ROW_FORMAT was the table (before and after)? – Rick James Oct 12 '19 at 00:39
  • Is each VARCHAR and TEXT usually the same length from row to row? Or maybe just some are 'constant' length? SELECT AVG(LENGTH(x)), STD(LENGTH(x)) FROM t for each of the 5 text columns. – Rick James Oct 12 '19 at 00:49

1 Answers1

0

It's difficult to answer without knowing the nature of the data being copied and maybe of the integration software that is used.

But I decided to share my guesses hoping it may lead to better diagnostic:

  1. Row length increase may be because texts were converted into Unicode during migration.

  2. Lower row count maybe due to some kind of deduplication that also removed records with less text data thus increasing average row length. Or it might be just that the migration process was aborted in the middle.

It might help to make a more educated guess if you supplement the question with SHOW CREATE TABLE output for both source and target tables.

ichalov
  • 56
  • 1
  • 6
  • Thanks for responding @ichalov! I suspected encoding at first as well, but it seems like that was unchanged. I will edit the post to add that information in there though.

    The lower row count doesn't really make much sense to me because the select count(*) queries were the same.

    – Xopherus Oct 04 '19 at 12:53
  • Though from this article (https://dba.stackexchange.com/questions/151769/mysql-difference-between-using-count-and-information-schema-tables-for-coun), it seems like the row count metric is already inaccurate, so it shouldn't be trusted. The big outlier still seems to be the data length. – Xopherus Oct 04 '19 at 13:26
  • Another idea is to try ANALYZE TABLE. It may change values in question to the correct ones. I guess AVG row length is a calculated value, so it should decrease if Table rows is recalculated to be bigger. – ichalov Oct 04 '19 at 16:07
  • I've done that as well. The numbers change, but not significantly so. The diffs between data length are still over 100GiB apart. – Xopherus Oct 04 '19 at 17:56
  • Another option is to copy both tables in the same database and compare them by LEFT JOIN. But that's probably not easy with these table sizes (maybe better done by cold file copy and not mysqldump). I can also propose to perform a statistical research of the data, e.g. take and compare gistograms of some text field length from both tables, with by query like select floor(pow(length(<text_field>), 0.1)), count(*) from <table> group by 1 order by 1; 30% difference in overall file size doesn't look too big (compared to other discrepancies). Maybe it somehow got fragmented during migration? – ichalov Oct 04 '19 at 19:19
  • You may also make LEFT JOIN testing faster by copying only list of ids from the old database into new. – ichalov Oct 04 '19 at 19:27
  • The text columns are utf8 before and after, regardless of the encoding during migration. So, I don't believe guess #1. – Rick James Oct 12 '19 at 00:30
  • The "Rows" and "Row length" are estimated (in InnoDB); the only thing that is reliable is "Data length", which will match the disk space used. Since "Data length" grew significantly, the Question is still valid. Also, COUNT(*) is reliable (and it did not change). – Rick James Oct 12 '19 at 00:33
  • Abortion of the migration would have no effect. Unfinished results would be thrown away, not left as junk in the resulting table. – Rick James Oct 12 '19 at 00:34
  • Percona tools has a way to checksum a table -- this can provide 'proof' that two tables have identical data. – Rick James Oct 12 '19 at 00:36