Building a Fast Address Autocomplete Service Using Elasticsearch
In this blog post, we will build a fast, privacy-focused address autocomplete service, leveraging Elasticsearch's powerful search capabilities to provide address suggestions to users as they type in an address.
Core Features
- Rapid Response: Deliver address suggestions within milliseconds, even with a dataset of 10 million records.
- Geolocation Support: Return addresses sorted by distance from the user's location.
Why not use Google Places API?
While Google Places API is a popular choice for address autocomplete, it comes with potential costs and privacy concerns. Google Places API is a subscription-based service, and the free tier has usage limits. For address autocomplete, the number of requests can quickly accumulate due to every user interaction with the address field triggering a request. Additionally, using Google Places API means that user data is shared with Google, which may not be desirable in some cases. By building our own address autocomplete API, we can ensure that user data is not shared with third parties.
Prerequisites
- Node.js
- Elasticsearch
- Logstash
You will first need to install Node.js on your machine, available for download here. Next, ensure you have Elasticsearch or access to an Elasticsearch instance; download it here. Finally, you will need Logstash installed on your machine, which you can download here.
Downloading the Data
The first step is to download the data. In this project, we will use the The Open Database of Addresses (ODA) from Statistics Canada. The data is available in CSV format and can be downloaded from the link above. After downloading and unzipping the data, put the CSV files in the data
directory. The structure of the data
directory should look like this:
data
├── ODA_AB_v1.csv
├── ODA_BC_v1.csv
├── ODA_MB_v1.csv
├── ODA_NB_v1.csv
├── ODA_NS_v1.csv
├── ODA_NT_v1.csv
├── ODA_ON_v1.csv
├── ODA_PE_v1.csv
├── ODA_QC_v1.csv
└── ODA_SK_v1.csv
Indexing the Data
The next step is to index the data into Elasticsearch. To do this, we will use Logstash. Logstash is a tool that can ingest data from multiple sources, transform it, and send it to multiple destinations. In this project, we will use Logstash to read the CSV files and send the data to Elasticsearch.
Create a new file called oda.conf
in the logstash
directory with the following content:
input {
file {
path => "/path/to/data/*.csv"
start_position => "beginning"
sincedb_path => "/dev/null"
}
}
filter {
csv {
separator => ","
columns => ["latitude","longitude","source_id","id","group_id","street_no","street","str_name","str_type","str_dir","unit","city","postal_code","full_addr","city_pcs","str_name_pcs","str_type_pcs","str_dir_pcs","csduid","csdname","pruid","provider"]
}
mutate {
convert => {
"latitude" => "float"
"longitude" => "float"
"pruid" => "integer"
}
rename => {
"latitude" => "[location][lat]"
"longitude" => "[location][lon]"
}
# we only keep fields: location, city, full_addr, pruid, postal_code, "source_id","id","group_id"
remove_field => ["latitude","longitude","street_no","street","str_name","str_type","str_dir","unit","city_pcs","str_name_pcs","str_type_pcs","str_dir_pcs","csduid","csdname","provider"]
}
}
output {
elasticsearch {
; index => "canada-addresses-%{+YYYY.MM.dd}"
index => "canada-addresses-2024.04.05"
document_id => "%{id}"
hosts=> "https://localhost:9200" # change to your Elasticsearch host
user=> "elastic" # change to your Elasticsearch user
password=> "changeme" # change to your Elasticsearch password
cacert => "/path/to/your/ca.crt" # change to your Elasticsearch certificate
}
}
Before running Logstash, you need to create an index in Elasticsearch. You can do this by sending a PUT request to the Elasticsearch using the curl command. For example:
# create an index called "canada-addresses-2024.04.05" with the mapping, you will need to change the hostname, username, password, and certificate path to match your Elasticsearch instance
curl -X PUT "https://localhost:9200/canada-addresses-2024.04.05" -H 'Content-Type: application/json' -u elastic:changeme --cacert /path/to/your/ca.crt -d '{
"mappings": {
"properties": {
"location": {
"type": "geo_point"
},
"city": {
"type": "text"
},
"full_addr": {
"type": "text"
},
"pruid": {
"type": "integer"
},
"postal_code": {
"type": "text"
}
}
}
}'
Or you can use the Kibana Dev Tools to create the index, Go to the Kibana Dev Tools under the Management tab and run the following command:
PUT /canada-addresses-2024.04.05
{
"mappings": {
"properties": {
"location": {
"type": "geo_point"
},
"city": {
"type": "text"
},
"full_addr": {
"type": "text"
},
"pruid": {
"type": "integer"
},
"postal_code": {
"type": "text"
}
}
}
}
After creating the index, you can run Logstash to index the data:
logstash -f logstash/oda.conf
The dataset includes 10 million records, so indexing the data may take some time.
Building the Address Autocomplete Service
Now that we have indexed the data, we can build the address autocomplete service. The service will provide an endpoint to autocomplete addresses based on the user's input. We will use Node.js, Express, TypeScript, and the Elasticsearch JavaScript client to build the service.
Setting Up the Project
Create a new directory called address-autocomplete-api
and navigate to it:
mkdir address-autocomplete-api
cd address-autocomplete-api
Initialize a new Node.js project:
npm init -y
Install the required dependencies:
npm i express @elastic/elasticsearch dotenv
npm i -D typescript ts-node-dev @types/express nodemon
NOTE
express
is a web framework for Node.js, @elastic/elasticsearch
is the official Elasticsearch client for Node.js, dotenv
is a module to load environment variables from a .env
file, typescript
is a superset of JavaScript that adds static types, ts-node-dev
is a TypeScript execution environment for Node.js, @types/express
is the TypeScript type definitions for Express, and nodemon
is a tool to monitor changes in your source code and automatically restart the server.
Initialize a TypeScript configuration file:
npx tsc --init
Create a .env
file in the root directory with the following content:
API_PORT=4000 # the port the service will run on
ES_ENDPOINT=https://localhost:9200 # the Elasticsearch endpoint
ES_CA_CERT=./ca.crt # the path to the Elasticsearch certificate
NODE_EXTRA_CA_CERTS=./ca.crt # the path to the Elasticsearch certificate
ES_USERNAME=elastic # the Elasticsearch username
ES_PASSWORD=changeme # the Elasticsearch password
ES_INDEX=canada-addresses-2024.04.05 # the Elasticsearch index you created in the previous step
Create a index.ts
file in the root directory with the following content:
import express from 'express';
import dotenv from 'dotenv';
// load .env file
dotenv.config();
const API_PORT = process.env.API_PORT || 3000;
const app = express();
app.get("/", (req, res) => {
res.send("OK");
});
app.listen(API_PORT, () => {
console.log(`Canada Address Autocomplete API isrunning on port ${API_PORT}, http://localhost:${API_PORT}`);
});
Update the package.json
file to include the following scripts:
"scripts": {
"dev": "nodemon index.ts"
}
Now you can start the service:
npm run dev
The service will be available at http://localhost:4000
or the port you specified in the .env
file. You can test the service by sending a GET request to the /
endpoint:
curl http://localhost:4000/
You should receive a response with OK
.
Adding the Autocomplete Endpoint
Next, we will add an endpoint to autocomplete addresses.
Before adding the endpoint, we need to load the necessary environment variables. Update the index.ts
file with the following content:
import express from 'express';
import dotenv from 'dotenv';
// load .env file
dotenv.config();
const API_PORT = process.env.API_PORT || 3000;
const ES_ENDPOINT = process.env.ES_ENDPOINT;
if (!ES_ENDPOINT) {
console.error("Missing ES_ENDPOINT environmentvariable");
process.exit(1);
}
const ES_CA_CERT = process.env.ES_CA_CERT;
if (!ES_CA_CERT) {
console.error("Missing ES_CA_CERT environmentvariable");
process.exit(1);
}
const ES_USERNAME = process.env.ES_USERNAME;
if (!ES_USERNAME) {
console.error("Missing ES_USERNAME environmentvariable");
process.exit(1);
}
const ES_PASSWORD = process.env.ES_PASSWORD;
if (!ES_PASSWORD) {
console.error("Missing ES_PASSWORD environmentvariable");
process.exit(1);
}
const ES_INDEX = process.env.ES_INDEX;
if (!ES_INDEX) {
console.error("Missing ES_INDEX environmentvariable");
process.exit(1);
}
const app = express();
app.get("/", (req, res) => {
res.send("OK");
});
app.listen(API_PORT, () => {
console.log(`Canada Address Autocomplete API is running on port ${API_PORT}, http://localhost:${API_PORT}`);
});
Next, create a Elasticsearch client to connect to the Elasticsearch instance. Update the index.ts
file with the following content:
import express from 'express';
import dotenv from 'dotenv';
import { Client } from '@elastic/elasticsearch';
import fs from 'fs';
// load .env file
dotenv.config();
const API_PORT = process.env.API_PORT || 3000;
const ES_ENDPOINT = process.env.ES_ENDPOINT;
if (!ES_ENDPOINT) {
console.error("Missing ES_ENDPOINT environment variable");
process.exit(1);
}
const ES_CA_CERT = process.env.ES_CA_CERT;
if (!ES_CA_CERT) {
console.error("Missing ES_CA_CERT environment variable");
process.exit(1);
}
const ES_USERNAME = process.env.ES_USERNAME;
if (!ES_USERNAME) {
console.error("Missing ES_USERNAME environment variable");
process.exit(1);
}
const ES_PASSWORD = process.env.ES_PASSWORD;
if (!ES_PASSWORD) {
console.error("Missing ES_PASSWORD environment variable");
process.exit(1);
}
const ES_INDEX = process.env.ES_INDEX;
if (!ES_INDEX) {
console.error("Missing ES_INDEX environment variable");
process.exit(1);
}
const app = express();
const esClient = new Client({
node: ES_ENDPOINT,
tls: {
ca: fs.readFileSync(ES_CA_CERT),
rejectUnauthorized: false,
},
auth: {
username: ES_USERNAME,
password: ES_PASSWORD,
},
});
app.get("/", (req, res) => {
res.send("OK");
});
app.listen(API_PORT, () => {
console.log(`Canada Address Autocomplete API is running on port ${API_PORT}, http://localhost:${API_PORT}`);
});
Now, we can add the autocomplete endpoint. The endpoint will accept query parameters q
, lat
, and lon
. The q
parameter is the address query, and the lat
and lon
parameters are the user's geolocation. The endpoint will return a list of addresses that match the query.
For the address query, we will use the match_phrase_prefix
query to match the address prefix because we want to sugest addresses that match the user's input before they finish typing.
We will also filter the results to ensure that the address has the required fields (full_addr
, city
, and pruid
).
If the user provides the lat
and lon
parameters, we will sort the results by distance from the user's location. Since we have geo_point
mapping for the location
field in Elasticsearch, we can leverage the _geo_distance
sort provided by Elasticsearch to sort the results by distance.
Update the index.ts
file with the following content:
import express from 'express';
import dotenv from 'dotenv';
import { Client } from '@elastic/elasticsearch';
import fs from 'fs';
import { LatLonGeoLocation, QueryDslQueryContainer, SearchRequest, SortCombinations } from '@elastic/elasticsearch/lib/api/typesWithBodyKey';
// load .env file
dotenv.config();
const API_PORT = process.env.API_PORT || 3000;
const ES_ENDPOINT = process.env.ES_ENDPOINT;
if (!ES_ENDPOINT) {
console.error("Missing ES_ENDPOINT environment variable");
process.exit(1);
}
const ES_CA_CERT = process.env.ES_CA_CERT;
if (!ES_CA_CERT) {
console.error("Missing ES_CA_CERT environment variable");
process.exit(1);
}
const ES_USERNAME = process.env.ES_USERNAME;
if (!ES_USERNAME) {
console.error("Missing ES_USERNAME environment variable");
process.exit(1);
}
const ES_PASSWORD = process.env.ES_PASSWORD;
if (!ES_PASSWORD) {
console.error("Missing ES_PASSWORD environment variable");
process.exit(1);
}
const ES_INDEX = process.env.ES_INDEX;
if (!ES_INDEX) {
console.error("Missing ES_INDEX environment variable");
process.exit(1);
}
const app = express();
const esClient = new Client({
node: ES_ENDPOINT,
tls: {
ca: fs.readFileSync(ES_CA_CERT),
rejectUnauthorized: false,
},
auth: {
username: ES_USERNAME,
password: ES_PASSWORD,
},
});
app.get("/autocomplete", async (req, res) => {
const { q, lat, lon } = req.query;
const filter: QueryDslQueryContainer[] = [
// full_addr, city, pruid are required fields
{ exists: { field: "full_addr" } },
{ exists: { field: "city" } },
{ exists: { field: "pruid" } },
];
const query: QueryDslQueryContainer = {
bool: {
must: [
{
// match_phrase_prefix will match"123 Main St" with "123 Main Street"
match_phrase_prefix: { full_addr: q?.toString() || "", },
},
],
filter,
},
}
const sort: SortCombinations[] = [];
const searchRequestBody: SearchRequest['body'] = {
_source: ["full_addr", "city", "postal_code", "pruid", "location"],
query,
sort,
};
// if user provides lat and lon, sort by distance
if (lat && lon) {
const location: LatLonGeoLocation = {
lon: parseFloat(lon.toString()),
lat: parseFloat(lat.toString()),
}
sort.push({
_geo_distance: {
location,
order: "asc",
unit: "km",
mode: "min",
distance_type: "plane",
ignore_unmapped: true,
},
});
}
try {
const result = await esClient.search({
index: ES_INDEX,
size: 20,
body: searchRequestBody,
});
const addresses: Address[] = result.hits.hits. map((hit) => {
const address = hit._source as Address;
const { full_addr, city, pruid, postal_code, location } = address;
// distance is the first element in thesort array
const distance = hit.sort?.[0];
return {
full_addr,
city,
pruid,
postal_code,
location,
distance
};
});
res.json(addresses);
} catch (error) {
console.error(error);
res.status(500).json({ error: "Internal ServerError" });
}
});
app.get("/", (req, res) => {
res.send("OK");
});
app.listen(API_PORT, () => {
console.log(`Canada Address Autocomplete API is running on port ${API_PORT}, http://localhost:${API_PORT}`);
});
// Types
type Pruid = 10 | 11 | 12 | 13 | 24 | 35 | 46 | 47 | 48 | 59 | 60 | 61 | 62;
interface Address {
full_addr: string;
city: string;
pruid: Pruid;
postal_code?: string;
location: LatLonGeoLocation;
}
Now you can start the service:
npm run dev
The service will be available at http://localhost:4000
or the port you specified in the .env
file. You can test the service by sending a GET request to the /autocomplete
endpoint with the query parameter q
:
curl http://localhost:4000/autocomplete?q=123%20Main%20St
You will receive a JSON response with the addresses that match the query.
[
{
"full_addr": "123 MAIN ST",
"city": "RESERVE MINES",
"pruid": 12,
"postal_code": null,
"location": {
"lon": "-60.01856",
"lat": "46.18440"
}
},
{
"full_addr": "123 MAIN ST",
"city": "SPRINGHILL",
"pruid": 12,
"postal_code": null,
"location": {
"lon": "-64.05301",
"lat": "45.65114"
}
}
]
The autocomplete
endpoint also supports specifying the geolocation of the user to return results closer to the user. You can pass the lat
and lon
query parameters to the endpoint. For example:
curl http://localhost:4000/autocomplete?q=123%20Main%20St&lat=43.65114&lon=-79.05301
You will receive a JSON response with the addresses that match the query and are sorted by distance from the user's location.
[
{
"full_addr": "123 Main St",
"city": "Markham",
"pruid": 35,
"postal_code": null,
"location": {
"lon": "-79.26074",
"lat": "43.87854"
},
"distance": 30.29256914280844
},
{
"full_addr": "123 Main St",
"city": "Markham",
"pruid": 35,
"postal_code": null,
"location": {
"lon": "-79.30980",
"lat": "43.86351"
},
"distance": 31.352539804009215
},
{
"full_addr": "123 Main St",
"city": "Liverpool",
"pruid": 12,
"postal_code": null,
"location": {
"lon": "-64.71274",
"lat": "44.03990"
},
"distance": 1150.830347120216
}
]
NOTE
- The
distance
field is only returned when thelat
andlon
query parameters are provided. And the distance is in kilometers. - The
pruid
field is the province ID. You can find the list of province IDs here. - The
postal_code
field may not always be available in the dataset.
Conclusion
In this blog post, we built a privacy-focused address autocomplete service using Elasticsearch. We indexed address data from the Open Database of Addresses (ODA) into Elasticsearch, created an address autocomplete API using Node.js and Express, and used the Elasticsearch JavaScript client to query the data. By building our own address autocomplete service, we can ensure that user data is not shared with third parties and avoid potential costs associated with third-party services.
Source Code
The completed code in this blog is available on GitHub
License
- The Open Database of Addresses (ODA) is a collection of open address point data and is made available under the Open Government License - Canada.