Nearest Neighbor Search by Distance in Large Datasets with MySQL
Nearest neighbor search is common in geographical databases. Here we consider searching for nearby locations in a large geographical database. Many commercial databases have geographical indexes supporting such queries; however, in MySQL there is no simple implementation. We show a simple way below.
Consider the following table definition.
CREATE TABLE `locations` ( `id` INT NOT NULL AUTO_INCREMENT PRIMARY KEY, `lat` DOUBLE NOT NULL, `lng` DOUBLE NOT NULL, `address` VARCHAR(64) NOT NULL, KEY `lat` (`lat`), KEY `lng` (`lng`) );
Suppose we have a center at ($lat0, $lng0), and we wish to find nearby locations by distance. The following query contains a formula calculating the distance from the center. It returns the closet 20 locations within 10 km from the center. Here 6371 is the earth’s radius in km.
SELECT *, ( 6371 * acos( cos(radians($lat0)) * cos(radians(lat)) * cos(radians(lng) - radians($lng0)) + sin(radians($lat0)) * sin(radians(lat)) ) ) AS distance FROM `locations` HAVING distance < 10 ORDER BY distance LIMIT 20;
The query works well on small datasets; however, it is not efficient when the table is large (say, with 1 million records). This query scans all rows in the table, calculates the distance to the center, and finally sorts the result by distance. It runs in linear time of the number of rows. None of the indexes is effective.
A quick fix is to specify a bounding box for latitude and longitude. The following query specifies a bounding box by simply adding/subtracting 1 degree from latitude/longitude of the center.
SELECT * FROM `locations` WHERE lat < $lat0 + 1 AND lat > $lat0 - 1 AND lng < $lng0 + 1 AND lng > $lng0 - 1;
And the next query sorts the result by the distance to the center.
SELECT *, ( 6371 * acos( cos(radians($lat0)) * cos(radians(lat)) * cos(radians(lng) - radians($lng0)) + sin(radians($lat0)) * sin(radians(lat)) ) ) AS distance FROM `locations` WHERE lat < $lat0 + 1 AND lat > $lat0 - 1 AND lng < $lng0 + 1 AND lng > $lng0 - 1 HAVING distance < 10 ORDER BY distance LIMIT 20;
However, this simple “degree manipulation” may not be good for applications requiring precise results. A better (but more complicated) way is to calculate the extremal degrees of points within certain distance from the center (the north/south/east/west point of 10 km to the center). Suppose we have parameters ($lat0, $lng0, $dist) for the center and the query distance.
SELECT * FROM `locations` WHERE lat < degrees( asin( sin(radians($lat0)) * cos($dist / 6371) + cos(radians($lat0)) * sin($dist / 6371) * cos(radians(0)) )) AND lat > degrees( asin( sin(radians($lat0)) * cos($dist / 6371) + cos(radians($lat0)) * sin($dist / 6371) * cos(radians(180)) )) AND lng < $lng0 - degrees( atan2(sin(radians(90)) * sin(radians($dist / 6371)) * cos(radians($lat0)), cos(radians($dist / 6371)) - sin(radians($lat0)) * sin(radians($lat0))) ) AND lng > $lng0 + degrees( atan2(sin(radians(90)) * sin(radians($dist / 6371)) * cos(radians($lat0)), cos(radians($dist / 6371)) - sin(radians($lat0)) * sin(radians($lat0))) )
The queries with bounding boxes will benefit from the indexes on “lat” and “lng”; the queries will be run with index scan on “lat” and “lng”. Note again that, we assume the dataset is large and scattered while the querying box is relatively small.
The final query below searches the nearest 20 locations within distance $dist (in kilometers) from the center ($lat0, $lng0), with the result sorted by the distance.
SELECT *, ( 6371 * acos( cos(radians($lat0)) * cos(radians(lat)) * cos(radians(lng) - radians($lng0)) + sin(radians($lat0)) * sin(radians(lat)) ) ) AS distance FROM `locations` WHERE lat < degrees( asin( sin(radians($lat0)) * cos($dist / 6371) + cos(radians($lat0)) * sin($dist / 6371) * cos(radians(0)) )) AND lat > degrees( asin( sin(radians($lat0)) * cos($dist / 6371) + cos(radians($lat0)) * sin($dist / 6371) * cos(radians(180)) )) AND lng < $lng0 - degrees( atan2(sin(radians(90)) * sin(radians($dist / 6371)) * cos(radians($lat0)), cos(radians($dist / 6371)) - sin(radians($lat0)) * sin(radians($lat0))) ) AND lng > $lng0 + degrees( atan2(sin(radians(90)) * sin(radians($dist / 6371)) * cos(radians($lat0)), cos(radians($dist / 6371)) - sin(radians($lat0)) * sin(radians($lat0))) ) ORDER BY distance LIMIT 20;
One should note that the range of latitude is from -90 to 90 and the range of longitude is from -180 to 180; if the bounding box overlaps with the boundaries of the ranges, the WHERE condition has to be adapted!