Tim Loderhose | Building a Geocoder in Python

🔗 GeoNames

GeoNames is a public geographical database available under a Creative Commons License. It contains names of geographical points along with alternate names, coordinates, country code, elevation, etc.

For a full description of all tables, check http://download.geonames.org/export/dump/.

🔗 The plan

We want to use the GeoNames dataset to build a search index based on 'name' information - generally these are strings indicating a location. There are likely many places that have the same name, so we'll also need to add some information to order/narrow down/choose candidates.

The following (from here) contains the information we have available:

The main 'geoname' table has the following fields :
---------------------------------------------------
geonameid         : integer id of record in geonames database
name              : name of geographical point (utf8) varchar(200)
asciiname         : name of geographical point in plain ascii characters, varchar(200)
alternatenames    : alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000)
latitude          : latitude in decimal degrees (wgs84)
longitude         : longitude in decimal degrees (wgs84)
feature class     : see http://www.geonames.org/export/codes.html, char(1)
feature code      : see http://www.geonames.org/export/codes.html, varchar(10)
country code      : ISO-3166 2-letter country code, 2 characters
cc2               : alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters
admin1 code       : fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)
admin2 code       : code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80) 
admin3 code       : code for third level administrative division, varchar(20)
admin4 code       : code for fourth level administrative division, varchar(20)
population        : bigint (8 byte int) 
elevation         : in meters, integer
dem               : digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.
timezone          : the iana timezone id (see file timeZone.txt) varchar(40)
modification date : date of last modification in yyyy-MM-dd format

Given a search string, we'd like to return some subset of this information (which we could potentially use later to make a choice among matches).

🔗 Imports and Environment Variables

Note - this blog post is an Org Mode narrative, which runs this work top-to-bottom. In all my narratives, I try to define vital imports and environment variables up top.

from collections import defaultdict
from pathlib import Path

import numpy as np
import pandas as pd

data_dir = Path("../data")

🔗 Load data

Let's start off by downloading the dataset into a data directory.

mkdir ../data -p
wget http://download.geonames.org/export/dump/allCountries.zip -O ../data/allCountries.zip
unzip ../data/allCountries.zip

We'll use pandas dataframes to make use of this data as quickly as possible. We use the pyarrow engine for faster reads, as this table is quite large, and read some of the codes as strings (over numbers) to make later operations easier.

columns = [
    "geonameid",
    "name",
    "asciiname",
    "alternatenames",
    "latitude",
    "longitude",
    "feature class",
    "feature code",
    "country code",
    "cc2",
    "admin1 code",
    "admin2 code",
    "admin3 code",
    "admin4 code",
    "population",
    "elevation",
    "dem",
    "timezone",
    "modification date",
]
df = pd.read_table(
    data_dir / "allCountries.txt",
    header=None,
    names=columns,
    dtype={
        "cc2": str,
        "admin1 code": str,
        "admin2 code": str,
        "admin3 code": str,
        "admin4 code": str,
    },
    index_col="geonameid",
    engine="pyarrow",
)
df.head(2)

	name	asciiname	alternatenames	latitude	longitude	feature class	feature code	country code	cc2	admin1 code	population	elevation	dem	timezone	modification date
geonameid
2986043	Pic de Font Blanca	Pic de Font Blanca	Pic de Font Blanca,Pic du Port	42.64991	1.53335	T	PK	AD		00	0	NaN	2860	Europe/Andorra	2014-11-05
2994701	Roc Mélé	Roc Mele	Roc Mele,Roc Meler,Roc Mélé	42.58765	1.74028	T	MT	AD	AD,FR	00	0	NaN	2803	Europe/Andorra	2020-06-10

🔗 Filtering the main table

GeoNames contains not just place names, but also names of bodies of water, landmarks, notable buildings, and more. What type of feature is recorded in each row of our table, is stated in the feature code column, for which the full list of descriptions can be found here. We'll filter for the following:

PPL: populated place a city, town, village, or other agglomeration of buildings where people live and work
PPL__: (underscore designates additional letters) various subcategories of PPL
ADM#: (for # in [1, 4]): #-order administrative division
ADMD: administrative division an administrative division of a country, undifferentiated as to administrative level
PCL_: (underscore designates additional letters) political entity - roughly speaking these are countries
TERR: territories - includes a few places like Western Sahara or Svalbard

Strictly speaking, we don't have to filter at all - our search will work regardless (although the index will take up a little more memory/storage). We do this here now to restrict the results a little, such that we end up with places that may be part of an address line.

ppls = [
    "PPL",
    "PPLL",
    "PPLA",
    "PPLX",
    "PPLC",
    "PPLQ",
    "PPLA2",
    "PPLW",
    "PPLF",
    "PPLH",  # does not exist anymore, historical
    "PPLA3",
    "PPLR",
    "PPLCH",
    "PPLA4",
    "PPLS",
    "PPLA5",
    "PPLG",
    "PRSH",
]
adms = ["ADMD", "ADM1", "ADM2", "ADM3", "ADM4", "ADM5"]
# We will ignore historical countries, like the German DDR, under PCLH
pcls = ["PCLI", "PCLD", "PCLIX", "PCLS", "PCLF", "PCL", "TERR"]
df = df[df["feature code"].isin(ppls + adms + pcls)]

🔗 Preprocess/transform data

Now that we have all our data loaded, we will replace some of those [country, admin] codes with names, as dealing with those will be much nicer than dealing with codes. We'll start by creating a list of countries. Three territories (TERR feature code) that use the AU (Australia) country code will be dropped, since we need to uniquely map country codes to names. We could drop TERR altogether, but that way we would also miss Antarctica - and that's a no go!

countries = df[df["feature code"].isin(set(pcls))].set_index(["country code"])
drop_terrs = [
    "Territory of Ashmore and Cartier Islands",
    "Coral Sea Islands Territory",
    "Jervis Bay Territory",
]
countries = countries[~countries.name.isin(drop_terrs)]
assert not any(countries.index.duplicated())

Now we can efficiently add the country string as such:

df["country"] = countries.loc[df["country code"], "name"].to_numpy()
df["country"].value_counts().head(10)

People’s Republic of China      826993
Republic of India               554988
Republic of Indonesia           344059
Mexico                          250444
United States                   248855
Russian Federation              203915
Islamic Republic of Pakistan    147126
Republic of France              116931
Federal Republic of Germany      91627
Nepal                            87141
Name: country, dtype: int64

We also printed the top 10 countries per the amount of locations indexed in GeoNames as a sanity check. The top countries are all rather large or have large populations, so this makes sense.

Since we need combinations of code columns to add names of admin regions, df.merge is handier in this case (albeit probably a little slower). Note that using .values is important to make sure pandas merely appends a same-length vector here (otherwise it might use the geonameid index, whose order is not guaranteed to be correct through the merge).

cols = ["country code", "admin1 code"]
adm1 = df[df["feature code"] == "ADM1"]
df["admin1"] = df[cols].merge(adm1[cols + ["name"]], on=cols, how="left")["name"].values
cols.append("admin2 code")
adm2 = df[df["feature code"] == "ADM2"]
df["admin2"] = df[cols].merge(adm2[cols + ["name"]], on=cols, how="left")["name"].values
cols.append("admin3 code")
adm3 = df[df["feature code"] == "ADM3"]
df["admin3"] = df[cols].merge(adm3[cols + ["name"]], on=cols, how="left")["name"].values
cols.append("admin4 code")
adm4 = df[df["feature code"] == "ADM4"]
df["admin4"] = df[cols].merge(adm4[cols + ["name"]], on=cols, how="left")["name"].values

df[["country", "admin1", "admin2", "admin3", "admin4"]].dropna().head()

	country	admin1	admin2	admin3	admin4
geonameid
2765121	Republic of Austria	Wien	Wien Stadt	Gemeindebezirk Liesing	Siebenhirten
2767332	Republic of Austria	Wien	Wien Stadt	Gemeindebezirk Liesing	Rodaun
2770793	Republic of Austria	Wien	Wien Stadt	Gemeindebezirk Liesing	Erlaa
2771781	Republic of Austria	Wien	Wien Stadt	Gemeindebezirk Liesing	Wien Mauer
2772487	Republic of Austria	Wien	Wien Stadt	Gemeindebezirk Liesing	Liesing

In practice, only admin1 and admin2 may actually be really useful for querying, as the other two are both empty for 75% of the data, with only 8% of data having admin4 available:

df[["admin3", "admin4"]].isnull().value_counts(normalize=True)

admin3  admin4
True    True      0.747178
False   True      0.150851
        False     0.081360
True    False     0.020611
dtype: float64

🔗 Building our return structure

Our dataframe is already indexed by a unique identifier for each location, and contains all names and codes we are interested in. Let's just index some important features that can help determine which of the returned matches is the one the user was looking for:

locations = df[
    [
        "name",
        "latitude",
        "longitude",
        "feature code",
        "country code",
        "country",
        "admin1",
        "admin2",
        "admin3",
        "admin4",
        "population",
    ]
]
locations = locations.fillna("")
# Let's also save this in case we'd like to package an API at some point
locations.to_parquet(data_dir / "locations.parquet")

🔗 Building our search index

For our index, we'd like to have a dictionary that maps from search terms to all places that match the search term. Let's start by creating a comma-separated string containing all potential names for each place:

names = df.name + "," + df.asciiname + "," + df.alternatenames
# Get rid of trailing comma in case alternatenames was empty
names = names.str.strip(",")
names.iloc[:10]

geonameid
3038816                                  Xixerella,Xixerella
3038832                            Vila,Vila,Casas Vila,Vila
3038899    Tossalet i Vinyals,Tossalet i Vinyals,Tossalet...
3038987                          Sornàs,Sornas,Sornas,Sornàs
3038999                                        Soldeu,Soldeu
3039039                          Solà d’Encamp,Sola d'Encamp
3039077                              Sispony,Sispony,Sispony
3039132                              Segudet,Segudet,Segudet
3039154             El Tarter,El Tarter,Ehl Tarter,Эл Тартер
3039162    Sant Julià de Lòria,Sant Julia de Loria,Paroch...
dtype: object

Now we lowercase and split this string, and make one large list containing geonameids and corresponding names.

location_ids_names = []
for geonameid, all_names in names.items():
    all_names = set(all_names.lower().split(","))
    location_ids_names.extend([(geonameid, x.strip()) for x in all_names])

len(location_ids_names)

12790646

Our index will consist of over 12M query terms - it contains the names all populated places which exist in the GeoNames database. For something like Twitter bios, this should work just fine!

A fast (and simple) lookup can be done using a python dictionary - they are implemented using hash tables, giving us constant lookup times given a single query. We'll use defaultdict(list) to construct our dictionary, appending new ids to query terms as we encounter them.

name_lookup = defaultdict(list)
for geonameid, name in location_ids_names:
    name_lookup[name].append(geonameid)

This dictionary maps from a search term to a list of geonameids corresponding to all locations which contain the search term in their list of names:

name_lookup["honolulu"]

4471780

5856194

5856195

5856199

🔗 Making lookups

We now have everything we need to query our locations given a search query. The lookup simply looks like this:

Get geonameids of matches by indexing our search dictionary with the query term
Return locations of matching geonameids

Step 2 currently has locations stored in a pandas DataFrame, which we can index directly by geonameid. Indexing a dataframe is not very efficient for single-item lookups, but probably fine depending on how many lookups we want to execute (and in what timeframe).

def search(place: str) -> pd.DataFrame:
    geonameids = name_lookup.get(place, [])
    return locations.loc[geonameids]

search("honolulu")

	name	latitude	longitude	feature code	country code	country	admin1	admin2	admin3	population
geonameid
4471780	Honolulu	35.36294	-77.28163	PPL	US	United States	North Carolina	Craven County	Township 1	0
5856194	Honolulu County	21.45543	-157.96138	ADM2	US	United States	Hawaii	Honolulu County		1016508
5856195	Honolulu	21.30694	-157.85833	PPLA	US	United States	Hawaii	Honolulu County		371657
5856199	Hawaiian Beaches	19.55417	-154.89000	ADMD	US	United States	Hawaii	Hawaii County		0

A single search takes about 250 microseconds on my machine, although this number can go up and down depending on the number of items that actually fit a query.

%timeit search("honolulu")

248 µs ± 2.22 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

🔗 Multiple search terms

Having a single search term can lead to lots of matches, and little specificity. Often, locations contain multiple (especially two) terms, for example:

"Hamburg, Germany" - referring to the German city
"Hamburg, NJ" - referring to Hamburg in the state of New Jersey, USA

We can leverage the fact that locations of all granularities exist in GeoNames. In the above example, "Hamburg, NJ", both Hamburg and NJ can be queried, and a selection can be made that's consistent with both queries.

Recall that each name is mapped according to the following hierarchy:

<country code> <adm1 code> <adm 2 code> <adm 3 code> <adm 4 code>

Let's take the following example queries (shortened up to admin2):

Query	Result
	name	country	admin1	admin2	feature code
hamburg	Hamburg	Germany	Hamburg	Hamburg	PPLA
	Hamburg	USA	New Jersey	Sussex County	PPL
	Hamburg	Canada	Manitoba		PPLL
germany	Germany	Germany			PCLI
	Germany	Canada	New Brunswick		PPLL

We're trying to find "Hamburg, Germany" (which should be obvious), however, there seems to be a "Hamburg, Canada" as well as "Germany, Canada". Could the latter be an acceptable match? No! "Germany, Canada" is a populated locality (PPLL) in the New Brunswick province of Canada. "Hamburg, Canada" is another PPLL in the Manitoba province.

🔗 Naive implementation

When looking up two search terms, we assume that one of them refers to a populated place, and the other to an administrative division the place is part of. We want to find an efficient way to look this up - in pseudocode our algorithm might read something like this:

Given search terms a & b:
1. Lookup a, b in our lookup table
2. For each match in a:
     For each match in b:
       If a is an administrative division:
         If b is part of a:
         -> match
       If b is an administrative division:
         If a is part of b:
         -> match

A naive Python implementation of this, utilizing our pandas datastructures, will look like this (going only up to ADM2 for simplicity):

def search2(a: str, b: str):
    matches_a = search(a)
    matches_b = search(b)
    out = []
    for _, loc_a in matches_a.iterrows():
        for _, loc_b in matches_b.iterrows():
            if loc_a["feature code"] in (pcls + ["ADM1", "ADM2"]):
                loc_b_admins = loc_b[["country", "admin1", "admin2"]]
                # If we have a match, there can't be a more granular division
                if any(loc_a["name"] == loc_b_admins):
                    out.append(loc_b)
            if loc_b["feature code"] in (pcls + ["ADM1", "ADM2"]):
                loc_a_admins = loc_a[["country", "admin1", "admin2"]]
                # If we have a match, there can't be a more granular division
                if any(loc_b["name"] == loc_a_admins):
                    out.append(loc_a)
    return pd.concat(out, axis=1).T

This works - here we'll see how 4 correct results are returned given the query "hamburg, germany" (they're all correct because Hamburg refers both to a place, and several administrative divisions):

search2("hamburg", "germany")[["name", "country", "admin1", "feature code"]]

	name	country	admin1	feature code
2911297	Free and Hanseatic City of Hamburg	Federal Republic of Germany	Free and Hanseatic City of Hamburg	ADM1
2911298	Hamburg	Federal Republic of Germany	Free and Hanseatic City of Hamburg	PPLA
6547395	Hamburg, Freie und Hansestadt	Federal Republic of Germany	Free and Hanseatic City of Hamburg	ADM3
7602585	Hamburg, Freie und Hansestadt	Federal Republic of Germany	Free and Hanseatic City of Hamburg	ADM4

The only problem is its speed - especially as we get up to heavy queries, our for loops and single-element dataframe accessors are going to bite us. Let's query for "San Antonio, Mexico" (San Antonio is the most common place name in this dataset, and there are many San Antonio's in Mexico!), and time the result. Note that this is probably not the most realistic query given its lack of specificity, but let's see how it does:

%time len(search2("san antonio", "mexico"))

CPU times: user 8.6 s, sys: 0 ns, total: 8.6 s
Wall time: 8.59 s
1072

Ouch! Seconds to retrieve 1072 matches, a single query. Like this we'll never be able to get location data for a large dataset containing millions of queries.

🔗 Better utilizing hierarchy

It's actually not trivial to optimize this. Sure, using numpy arrays and faster iterators over our dataframes (like itertuples) will gain us some speed, but not enough to be satisfying. Can we vectorize this, perhaps in part? I first played around broadcasting np.equal over the two searches and filtering the boolean result - this is fast, but can use quite a bit of memory in edge cases (as we'd be creating a potentially large matrix). This approach also got complex quickly because it's harder to leverage the fact that some results are administrative divisions.

In the end, the winner exploits the hierarchical structure of the dataset, allowing us to vectorize one of the two for-loops using numpy functions. Recall how earlier I state how ever more granular divisions are geographically contained in the preceding division? When given two query terms, we assume that one refers to the place someone would like to reference, and the other to some region containing that place. Given two potential matches, we'd thus like to efficiently check whether one could be the "containing region" (or administrative division) of the other.

To achieve this, we'll first create a new feature, admin_string, which is the concatenation of all admin codes. We remove occurences of 00, which represents 'general features where no specific admin code is defined'.

cols = ['country code', 'admin1 code', 'admin2 code', 'admin3 code', 'admin4 code']
admin_string = df[cols[0]].str.replace("00", "")
for col in cols[1:]:
    admin_string = adming_string + df[col].str.replace("00", "")

locations["admin_string"] = admin_string.to_numpy()

We loop over both queries' matches separately, and accept a result if a given admin_string s is the prefix of one of the other query matches' admin_string. To vectorize this for each "direction" of the query, we can use np.char.startswith(admin_strings, s), where s is the admin_string of a given location. Due to the hierarchal nature of locations, any one query must have a paired administrative region whose admin_string is a substring of its own admin_string!

def search2_opt(a, b):
    matches_a = search(a)
    matches_b = search(b)
    index_a, index_b = matches_a.index, matches_b.index
    admin_strings_a = matches_a["admin_string"].to_numpy(dtype=np.string_)
    admin_strings_b = matches_b["admin_string"].to_numpy(dtype=np.string_)
    matches = []
    for s, feature_code in zip(admin_strings_a, matches_a["feature code"]):
        if feature_code in (pcls + ["ADM1", "ADM2", "ADM3", "ADM4"]):
            matches.extend(index_b[np.char.startswith(admin_strings_b, s)].to_list())
    for s, feature_code in zip(admin_strings_b, matches_b["feature code"]):
        if feature_code in (pcls + ["ADM1", "ADM2", "ADM3", "ADM4"]):
            matches.extend(index_a[np.char.startswith(admin_strings_a, s)].to_list())
    return locations.loc[matches]

Let's see how we fare in our worst-case scenario (using timeit now as this one runs fast!), as well the other example I've given earlier:

%timeit search2_opt("san antonio", "mexico")
%timeit search2_opt("hamburg", "germany")

13.5 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.5 ms ± 307 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

From seconds to a few milliseconds - this represents a speedup of almost 1000x in the worst case, and several hundred in the normal case. Neither methods are very pretty in code, but they're short enough to still grasp quickly enough, and do their job just fine.

🔗 Example output

I didn't show so many examples, so there may still be doubts as to how well this works in practice - fair! Here are some examples where I pick random places off the globe and show how it will arrive at a correct location. Note, that population is often only populated for places which have at least thousands of inhabitants, so random/obscure searches will seldom return useful values here.

search2_opt("albuquerque", "brazil")

	name	latitude	longitude	feature code	country code	country	admin1	admin2	population	admin_string
geonameid
3408075	Albuquerque Né	-8.00250	-37.41722	PPLL	BR	Federative Republic of Brazil	Pernambuco	Sertânia	0	BR302614105
3408077	Albuquerque	-6.88722	-58.34083	PPL	BR	Federative Republic of Brazil	Pará	Jacareacanga	0	BR161503754
3408078	Albuquerque	-4.58333	-37.76667	PPL	BR	Federative Republic of Brazil	Ceará	Aracati	0	BR062301109
3464318	Embaúba	-20.98250	-48.83556	PPL	BR	Federative Republic of Brazil	São Paulo	Embaúba	0	BR273514957
3472737	Bairro do Albuquerque	-22.38580	-42.90893	PPLL	BR	Federative Republic of Brazil	Rio de Janeiro	Teresópolis	0	BR213305802
3472738	Albuquerque	-19.40611	-57.41000	PPLL	BR	Federative Republic of Brazil	Mato Grosso do Sul	Corumbá	0	BR1153207
7566388	Albuquerque	-22.38060	-42.91934	PPLL	BR	Federative Republic of Brazil	Rio de Janeiro	Teresópolis	0	BR213305802

search2_opt("albuquerque", "rio de janeiro")

	name	latitude	longitude	feature code	country code	country	admin1	admin2	population	admin_string
geonameid
3472737	Bairro do Albuquerque	-22.3858	-42.90893	PPLL	BR	Federative Republic of Brazil	Rio de Janeiro	Teresópolis	0	BR213305802
7566388	Albuquerque	-22.3806	-42.91934	PPLL	BR	Federative Republic of Brazil	Rio de Janeiro	Teresópolis	0	BR213305802

search2_opt("rio de janeiro", "chiapas")

	name	latitude	longitude	feature code	country code	country	admin1	admin2	population	admin_string
geonameid
8871984	Río de Janeiro	16.11278	-91.54250	PPL	MX	Mexico	Estado de Chiapas	La Trinitaria	201	MX05099
8907915	Río de Janeiro	17.49389	-93.10333	PPL	MX	Mexico	Estado de Chiapas	Pichucalco	17	MX05068

🔗 Conclusion

Using the freely available GeoNames Gazetteer data, we're able to write a relatively competent, yet simple, geocoder in Python that's able to lookup thousands of queries per second (above times are only using a single core [i7-7700K]) on a single machine.

This may not even be a competitive offline geocoder (I link some good ones just below), but it is simple, adaptible, and a fun showcase for how easily we can get a usable thing with some Python. Note though, that while Python is amazing for preparing the data, the actual query implementation would probably a lot nicer (and faster) in other languages, where we could just implement the naive version efficiently.

By the way, the original goal was to automatically enrich Twitter data with geographical coordinates, but what I've shown returns potentially many matches per query. It's not too hard to make a selection - adding this here would just make this article too long! Some ideas:

Sort by population, take first
Make preference for certain feature codes (e.g. prefer seats of administrative divisions over 'normal' places)
add more features encoding some kind of ranking to use in sorting

If you're interested in the plot I needed the geocoder for, have a look at https://timlod.github.io/.

🔗 Street-level data

This approach completely relies on the GeoNames data, and can't give what the dataset doesn't provide, the biggest drawback being addresses. Most geocoders allow you to look up streets - this can't do that. However, the approach could possibly be adapted to use e.g. OpenStreetMap data (which is significantly larger and probably more complex).

Once we go there though, we may want to look into pre-built solutions that are competent already, like Nominatim, photon or pelias. Note that those usually are a bit involved to setup and require you to download many GB of index data.