Selecting random records from a table is helpful when showing items with no preferences or testing different result sets.
The following SQL defines a table holding images. Imagine that we wish to pick up 10 random rows from it. In our test, we fill it with ~50k (524,288) rows, approximately 90 MB storage space. Note that we also have a PRIMARY KEY index on “id”, and it uses ~5 MB storage space.
CREATE TABLE media ( id int NOT NULL AUTO_INCREMENT, caption varchar(64), url varchar(128) NOT NULL, PRIMARY KEY (id) ); INSERT INTO media (caption, url) VALUES (REPEAT('a', 60), REPEAT('b', 100)); INSERT INTO media (caption, url) SELECT caption, url FROM media; INSERT INTO media (caption, url) SELECT caption, url FROM media; -- repeat the above statement until satisfied
Using ORDER BY RAND()
A simple way to select random rows is using the RAND() function and ORDER BY. RAND() returns a random floating-point value between 0 and 1.
SELECT id, caption, url FROM media ORDER BY RAND() LIMIT 10
This query does the following:
- Scan the whole table to retrieve all rows.
- Generate a random number for each row.
- Sort all rows ordering by the random numbers.
- Return the first 10 rows in the sorted result.
Obviously, there are some unnecessary heavy jobs in it. This approach is simple, but not efficient for big data, say, a table of 1 million rows. In our test with 50k rows, it takes 1.2 seconds.
A simple improvement is to do sorting only on the “id”s.
SELECT id, caption, url FROM media, ( SELECT id AS sid FROM media ORDER BY RAND( ) LIMIT 10 ) tmp WHERE media.id = tmp.sid;
This query uses a subquery to generate 10 random “id”s and then use a join to select rows by these “id”s. In our test, it runs in 0.44 seconds. Using EXPAIN, we get the following execution plan for the subquery selecting random ids.
Using index; Using temporary; Using filesort
From this we see the improvement happens in the following parts:
- Inside the subquery, use the index to pick up ids; so there is no scan of the whole table.
- Sorting is only on ids which has much smaller size. (It is roughly the index size 5 MB, comparing with the table size 90 MB)
- After 10 ids are selected, JOIN is simple using the index.
A similar query using IN clause is as below. Note that earlier versions of MySQL does not support this query since it has LIMIT within IN clause.
SELECT id, caption, url FROM media WHERE id in ( SELECT id FROM media ORDER BY RAND() LIMIT 10 );
Generating Random Ids for Rows
A better way to generate 10 random ids and use them to pick up rows. However, here we will assume the ids are almost evenly distributed.
The following query generates one random id, and then select the 10 closet rows with ids following the random id.
SELECT id, caption, url FROM media, (SELECT RAND() * (SELECT MAX(id)-10 FROM media) AS tid) AS tmp WHERE media.id >= tmp.tid ORDER BY id ASC LIMIT 10;
RAND() * (SELECT MAX(id) FROM media) would generate a random id between 0 and the maximum id.
This solution is not perfect. It is not 10 random rows, but rather 10 rows from one random point. It does not work well if ids are not evenly distributed or there are lots of missing ids. However, it may be good enough for some applications. In our test it runs in 0.0009 seconds, 500 times faster!
To generate 10 random ids, it is better to define a function which essentially runs the above query 10 times (one row each time). To get a truly random result for arbitrarily distributed ids, it is better to build a mapping table which maps the original ids to sequentially generated new ids.