At this years LinuxForum I was
manning the MySQL booth together with Carsten Pedersen. We were kept
quite busy with lots of people coming to tell about their use of the MySQL
database for their particular project and ask about or discuss a particular
issue of theirs. Which was fine, since the talks did not appeal a lot to me
anyway.
One guy (I forgot who) had a small performance problem in his application. The application is a database of about 550,000 companies, storing name and various other bits of information about each company. What I would call a “small” database (since it is easily kept completely cached in ram), though not a trivial one.
This application has a facility to search for a company using any part of the company name:
SELECT ... FROM company WHERE name LIKE '%pattern%'
The query was taking too long, I think something like 10 or 15 seconds. Clearly a full table scan is needed, since neither BTREE or full-text indexes can be used, but still >10 seconds seems way over the top on modern hardware.
He was using InnoDB, so the obvious suggestion was to make sure that InnoDB was configured with sufficiently large buffer pool to cache all of the database in ram (the machine had sufficient memory for that, but it is necessary to set the innodb_buffer_pool_size sufficiently large in order for MySQL to actually cache the data).
Still, the problem got me thinking: How fast can one actually do full table scans in MySQL on modern hardware? 15 seconds for a measly 550.000 records seems pretty poor given that modern CPUs will execute billions of instructions each second. Probably in this particular case insufficient caching was causing disk I/O, reducing the speed. But another issue might be that this table has lots of fields with different information, so the scan needs to not only search the 'apos'
column, but also skip over all of the other fields in the table.
In this case, an index on the ‘name’ column can help. Normally you are told that in a query using LIKE '%...%'
, an index can not be used. But if all columns used in the query are available from a single index, MySQL can use an index scan instead of a table scan. Even though the whole of the index must be scanned, this can still be beneficial if the index is smaller than the table.
So lets run a quick test on some actual data. First a table definition:
CREATE TABLE company ( id INT AUTO_INCREMENT PRIMARY KEY, name VARCHAR(255), addr VARCHAR(255), phone VARCHAR(64), status INT ) ENGINE=innodb;
The table in the original application undoubtedly had many more columns, but this will do for a quick test.
Now some Perl code to fill the table with some realistic data:
for (1..550000) { $dbh->commit() if($_ % 500 == 0); my $words = int(rand(3)) + 1; my @word = (); for (1..$words) { my $x = ''; my $length = int(rand(13)) + 2; for (1..$length) { $x .= chr(int(rand(26)) + ord('a')); } push @word, $x; } my $name = join(' ', @word); $dbh->do(<<SQL, undef, $name, "foobar", "+1 234 56", int(rand(1000))); INSERT INTO company(name, addr, phone, status) VALUES (?, ?, ?, ?) SQL } $dbh->commit();
And lets set a reasonable InnoDB buffer pool size in my.cnf
:
innodb-buffer-pool-size = 262144000
Now let us try it:
mysql> CREATE UNIQUE INDEX cover1 on company(name, id); mysql> EXPLAIN SELECT id FROM company WHERE name LIKE '%xgef%'; +----+-------------+---------+-------+---------------+--------+---------+------+--------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+---------+-------+---------------+--------+---------+------+--------+--------------------------+ | 1 | SIMPLE | company | index | NULL | cover1 | 262 | NULL | 548220 | Using where; Using index | +----+-------------+---------+-------+---------------+--------+---------+------+--------+--------------------------+ 1 row in set (0.00 sec) mysql> SELECT id FROM company WHERE name LIKE '%xgef%'; +--------+ | id | +--------+ | 310501 | | 408997 | | 274935 | | 418352 | | 335782 | | 98100 | | 239031 | | 454742 | +--------+ 8 rows in set (0.52 sec)
So half a second to search all 550,000 names. Much better than 15 seconds, and completely adequate for the application in question, where such a search is performed perhaps once every minute.
Of course half a second for a search would not be acceptable for all applications. To speed it up more, one idea might be to store all suffixes of company names in a separate table, along with the company id. This table would be quite big, but still of reasonable size: If names are on average 25 characters, the number of rows would be about 14 million. Then with an index on the column of name suffixes, a query like this:
SELECT id FROM suffix_table WHERE name_suffix LIKE 'pattern%'
could use the index to immediately look up the required data with millisecond response time. But that kind of complexity was not required for the problem at hand.