Remove Duplicate Rows in MySQL
Duplicate elimination is necessary when the database is messed up or when we integrate data from multiple sources. This tutorial shows several ways of removing duplicates in MySQL that may fit in different application scenarios.
Suppose we have a table defined as below, containing a collection of addresses. Although we have primary key on the “id” attribute, we might have duplicates in “address” and we wish to remove such duplicates.
CREATE TABLE address_table ( id INT PRIMARY KEY, address VARCHAR(64), update_time TIMESTAMP);
An example dataset is as below, with duplicated addresses.
INSERT INTO address_table VALUES (1, "123 A Street", NOW()), (2, "456 B Route", NOW()), (3, "123 A Street", NOW());
A quick solution is to use “ALTER IGNORE” to add a unique index on “address”.
ALTER IGNORE TABLE address_table ADD UNIQUE INDEX(address);
According to MySQL document, when IGNORE is specified, for rows with duplicates on the unique key,
only the first row is kept, the other conflicting rows are deleted.
A second approach is to create another table and then using “GROUP BY” to transfer the data.
CREATE TABLE tmp LIKE address_table; INSERT INTO tmp SELECT id, address, update_time FROM address_table GROUP BY address;
Here we use the MySQL extension of GROUP BY where the select list can refer to nonaggregated columns (which are not in the GROUP BY clause). The drawback is that the values for these nonaggregated columns may be from any row within a group.
Although the two approaches above do the job, we do not have controls on which rows in the duplicates to be kept. Oftentimes, we want to keep specific rows; for example, the ones with the largest “id” in the “address_table”.
One solution is the following. We first create a temporary table holding the rows we want to keep for each duplicate group, and then remove all duplicates which are not in the temporary table.
CREATE TABLE tmp (id INT, address VARCHAR(64)); INSERT INTO tmp SELECT MAX(id), address FROM address_table GROUP BY address HAVING COUNT(*) > 1; DELETE address_table FROM address_table, tmp WHERE address_table.address = tmp.address AND address_table.id < tmp.id;
Here we use the “multiple-table DELETE” to remove rows from “address_table” which satisfy a joining condition with another table “tmp”. This approach is particularly useful when the table size is big but only a small fraction has duplicates, since it avoids transferring large amounts of data.