Remove Duplicate Rows in MySQL

Duplicate elimination is necessary when the database is messed up or when we integrate data from multiple sources. This tutorial shows several ways of removing duplicates in MySQL that may fit in different application scenarios.

Suppose we have a table defined as below, containing a collection of addresses. Although we have primary key on the “id” attribute, we might have duplicates in “address” and we wish to remove such duplicates.

CREATE TABLE address_table (
    id INT PRIMARY KEY, 
    address VARCHAR(64), 
    update_time TIMESTAMP);

An example dataset is as below, with duplicated addresses.

INSERT INTO address_table VALUES
    (1, "123 A Street", NOW()),
    (2, "456 B Route", NOW()),
    (3, "123 A Street", NOW());

A quick solution is to use “ALTER IGNORE” to add a unique index on “address”.

ALTER IGNORE TABLE address_table ADD UNIQUE INDEX(address);

According to MySQL document, when IGNORE is specified, for rows with duplicates on the unique key,
only the first row is kept, the other conflicting rows are deleted.

A second approach is to create another table and then using “GROUP BY” to transfer the data.

CREATE TABLE tmp LIKE address_table;
INSERT INTO tmp 
    SELECT id, address, update_time 
    FROM address_table
    GROUP BY address;

Here we use the MySQL extension of GROUP BY where the select list can refer to nonaggregated columns (which are not in the GROUP BY clause). The drawback is that the values for these nonaggregated columns may be from any row within a group.

Although the two approaches above do the job, we do not have controls on which rows in the duplicates to be kept. Oftentimes, we want to keep specific rows; for example, the ones with the largest “id” in the “address_table”.

One solution is the following. We first create a temporary table holding the rows we want to keep for each duplicate group, and then remove all duplicates which are not in the temporary table.

CREATE TABLE tmp (id INT, address VARCHAR(64));
INSERT INTO tmp 
    SELECT MAX(id), address
    FROM address_table
    GROUP BY address
    HAVING COUNT(*) > 1;
DELETE address_table 
    FROM address_table, tmp 
    WHERE address_table.address = tmp.address
    AND address_table.id < tmp.id;

Here we use the “multiple-table DELETE” to remove rows from “address_table” which satisfy a joining condition with another table “tmp”. This approach is particularly useful when the table size is big but only a small fraction has duplicates, since it avoids transferring large amounts of data.

Comments

comments