Tim Loderhose

Building a Geocoder in Python

In this work we'll build a geocoder - a piece of software which, given some string referring to a place (e.g. 'hamburg'), returns possible locations that the string may refer to (for example as coordinates).

Geocoding systems are commonly available as rate-limited APIs, e.g. https://developers.google.com/maps/documentation/geocoding/overview. Often free tiers exist, but once a lot of data is requested, the cost can be substantial (see ArcGIS pricing).

For a university project quite a while ago, I scraped the Twitter API to map out sentiment around climate change across the globe. The only problem there was that most tweets are not geo-tagged - so while millions of tweets could be collected, most of these could not directly be placed on a map. Many Twitter users do state their location in their bio, although not necessarily accurately: some name their country, others their city, some put a flag, some a combination of these, and many don't put a real location at all.

In any case, at scale a commercial geocoding API won't help us when we have collected a large amount of 'informal' location data like this. So let's try building our own!

🔗 GeoNames

GeoNames is a public geographical database available under a Creative Commons License. It contains names of geographical points along with alternate names, coordinates, country code, elevation, etc.

For a full description of all tables, check http://download.geonames.org/export/dump/.

🔗 The plan

We want to use the GeoNames dataset to build a search index based on 'name' information - generally these are strings indicating a location. There are likely many places that have the same name, so we'll also need to add some information to order/narrow down/choose candidates.

The following (from here) contains the information we have available:

The main 'geoname' table has the following fields :
---------------------------------------------------
geonameid         : integer id of record in geonames database
name              : name of geographical point (utf8) varchar(200)
asciiname         : name of geographical point in plain ascii characters, varchar(200)
alternatenames    : alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000)
latitude          : latitude in decimal degrees (wgs84)
longitude         : longitude in decimal degrees (wgs84)
feature class     : see http://www.geonames.org/export/codes.html, char(1)
feature code      : see http://www.geonames.org/export/codes.html, varchar(10)
country code      : ISO-3166 2-letter country code, 2 characters
cc2               : alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters
admin1 code       : fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)
admin2 code       : code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80) 
admin3 code       : code for third level administrative division, varchar(20)
admin4 code       : code for fourth level administrative division, varchar(20)
population        : bigint (8 byte int) 
elevation         : in meters, integer
dem               : digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.
timezone          : the iana timezone id (see file timeZone.txt) varchar(40)
modification date : date of last modification in yyyy-MM-dd format

Given a search string, we'd like to return some subset of this information (which we could potentially use later to make a choice among matches).

🔗 Imports and Environment Variables

Note - this blog post is an Org Mode narrative, which runs this work top-to-bottom. In all my narratives, I try to define vital imports and environment variables up top.

from collections import defaultdict
from pathlib import Path

import numpy as np
import pandas as pd
data_dir = Path("../data")

🔗 Load data

Let's start off by downloading the dataset into a data directory.

mkdir ../data -p
wget http://download.geonames.org/export/dump/allCountries.zip -O ../data/allCountries.zip
unzip ../data/allCountries.zip

We'll use pandas dataframes to make use of this data as quickly as possible. We use the pyarrow engine for faster reads, as this table is quite large, and read some of the codes as strings (over numbers) to make later operations easier.

columns = [
    "geonameid",
    "name",
    "asciiname",
    "alternatenames",
    "latitude",
    "longitude",
    "feature class",
    "feature code",
    "country code",
    "cc2",
    "admin1 code",
    "admin2 code",
    "admin3 code",
    "admin4 code",
    "population",
    "elevation",
    "dem",
    "timezone",
    "modification date",
]
df = pd.read_table(
    data_dir / "allCountries.txt",
    header=None,
    names=columns,
    dtype={
        "cc2": str,
        "admin1 code": str,
        "admin2 code": str,
        "admin3 code": str,
        "admin4 code": str,
    },
    index_col="geonameid",
    engine="pyarrow",
)
df.head(2)
name asciiname alternatenames latitude longitude feature class feature code country code cc2 admin1 code admin2 code admin3 code admin4 code population elevation dem timezone modification date
geonameid
2986043 Pic de Font Blanca Pic de Font Blanca Pic de Font Blanca,Pic du Port 42.64991 1.53335 T PK AD 00 0 NaN 2860 Europe/Andorra 2014-11-05
2994701 Roc Mélé Roc Mele Roc Mele,Roc Meler,Roc Mélé 42.58765 1.74028 T MT AD AD,FR 00 0 NaN 2803 Europe/Andorra 2020-06-10

🔗 Filtering the main table

GeoNames contains not just place names, but also names of bodies of water, landmarks, notable buildings, and more. What type of feature is recorded in each row of our table, is stated in the feature code column, for which the full list of descriptions can be found here. We'll filter for the following:

PPL
populated place a city, town, village, or other agglomeration of buildings where people live and work
PPL__
(underscore designates additional letters) various subcategories of PPL
ADM#
(for # in [1, 4]): #-order administrative division
ADMD
administrative division an administrative division of a country, undifferentiated as to administrative level
PCL_
(underscore designates additional letters) political entity - roughly speaking these are countries
TERR
territories - includes a few places like Western Sahara or Svalbard

Strictly speaking, we don't have to filter at all - our search will work regardless (although the index will take up a little more memory/storage). We do this here now to restrict the results a little, such that we end up with places that may be part of an address line.

ppls = [
    "PPL",
    "PPLL",
    "PPLA",
    "PPLX",
    "PPLC",
    "PPLQ",
    "PPLA2",
    "PPLW",
    "PPLF",
    "PPLH",  # does not exist anymore, historical
    "PPLA3",
    "PPLR",
    "PPLCH",
    "PPLA4",
    "PPLS",
    "PPLA5",
    "PPLG",
    "PRSH",
]
adms = ["ADMD", "ADM1", "ADM2", "ADM3", "ADM4", "ADM5"]
# We will ignore historical countries, like the German DDR, under PCLH
pcls = ["PCLI", "PCLD", "PCLIX", "PCLS", "PCLF", "PCL", "TERR"]
df = df[df["feature code"].isin(ppls + adms + pcls)]

🔗 Preprocess/transform data

Now that we have all our data loaded, we will replace some of those [country, admin] codes with names, as dealing with those will be much nicer than dealing with codes. We'll start by creating a list of countries. Three territories (TERR feature code) that use the AU (Australia) country code will be dropped, since we need to uniquely map country codes to names. We could drop TERR altogether, but that way we would also miss Antarctica - and that's a no go!

countries = df[df["feature code"].isin(set(pcls))].set_index(["country code"])
drop_terrs = [
    "Territory of Ashmore and Cartier Islands",
    "Coral Sea Islands Territory",
    "Jervis Bay Territory",
]
countries = countries[~countries.name.isin(drop_terrs)]
assert not any(countries.index.duplicated())

Now we can efficiently add the country string as such:

df["country"] = countries.loc[df["country code"], "name"].to_numpy()
df["country"].value_counts().head(10)
People’s Republic of China      826993
Republic of India               554988
Republic of Indonesia           344059
Mexico                          250444
United States                   248855
Russian Federation              203915
Islamic Republic of Pakistan    147126
Republic of France              116931
Federal Republic of Germany      91627
Nepal                            87141
Name: country, dtype: int64

We also printed the top 10 countries per the amount of locations indexed in GeoNames as a sanity check. The top countries are all rather large or have large populations, so this makes sense.

Since we need combinations of code columns to add names of admin regions, df.merge is handier in this case (albeit probably a little slower). Note that using .values is important to make sure pandas merely appends a same-length vector here (otherwise it might use the geonameid index, whose order is not guaranteed to be correct through the merge).

cols = ["country code", "admin1 code"]
adm1 = df[df["feature code"] == "ADM1"]
df["admin1"] = df[cols].merge(adm1[cols + ["name"]], on=cols, how="left")["name"].values
cols.append("admin2 code")
adm2 = df[df["feature code"] == "ADM2"]
df["admin2"] = df[cols].merge(adm2[cols + ["name"]], on=cols, how="left")["name"].values
cols.append("admin3 code")
adm3 = df[df["feature code"] == "ADM3"]
df["admin3"] = df[cols].merge(adm3[cols + ["name"]], on=cols, how="left")["name"].values
cols.append("admin4 code")
adm4 = df[df["feature code"] == "ADM4"]
df["admin4"] = df[cols].merge(adm4[cols + ["name"]], on=cols, how="left")["name"].values

df[["country", "admin1", "admin2", "admin3", "admin4"]].dropna().head()
country admin1 admin2 admin3 admin4
geonameid
2765121 Republic of Austria Wien Wien Stadt Gemeindebezirk Liesing Siebenhirten
2767332 Republic of Austria Wien Wien Stadt Gemeindebezirk Liesing Rodaun
2770793 Republic of Austria Wien Wien Stadt Gemeindebezirk Liesing Erlaa
2771781 Republic of Austria Wien Wien Stadt Gemeindebezirk Liesing Wien Mauer
2772487 Republic of Austria Wien Wien Stadt Gemeindebezirk Liesing Liesing

In practice, only admin1 and admin2 may actually be really useful for querying, as the other two are both empty for 75% of the data, with only 8% of data having admin4 available:

df[["admin3", "admin4"]].isnull().value_counts(normalize=True)
admin3  admin4
True    True      0.747178
False   True      0.150851
        False     0.081360
True    False     0.020611
dtype: float64

🔗 Building our return structure

Our dataframe is already indexed by a unique identifier for each location, and contains all names and codes we are interested in. Let's just index some important features that can help determine which of the returned matches is the one the user was looking for:

locations = df[
    [
        "name",
        "latitude",
        "longitude",
        "feature code",
        "country code",
        "country",
        "admin1",
        "admin2",
        "admin3",
        "admin4",
        "population",
    ]
]
locations = locations.fillna("")
# Let's also save this in case we'd like to package an API at some point
locations.to_parquet(data_dir / "locations.parquet")

🔗 Building our search index

For our index, we'd like to have a dictionary that maps from search terms to all places that match the search term. Let's start by creating a comma-separated string containing all potential names for each place:

names = df.name + "," + df.asciiname + "," + df.alternatenames
# Get rid of trailing comma in case alternatenames was empty
names = names.str.strip(",")
names.iloc[:10]
geonameid
3038816                                  Xixerella,Xixerella
3038832                            Vila,Vila,Casas Vila,Vila
3038899    Tossalet i Vinyals,Tossalet i Vinyals,Tossalet...
3038987                          Sornàs,Sornas,Sornas,Sornàs
3038999                                        Soldeu,Soldeu
3039039                          Solà d’Encamp,Sola d'Encamp
3039077                              Sispony,Sispony,Sispony
3039132                              Segudet,Segudet,Segudet
3039154             El Tarter,El Tarter,Ehl Tarter,Эл Тартер
3039162    Sant Julià de Lòria,Sant Julia de Loria,Paroch...
dtype: object

Now we lowercase and split this string, and make one large list containing geonameids and corresponding names.

location_ids_names = []
for geonameid, all_names in names.items():
    all_names = set(all_names.lower().split(","))
    location_ids_names.extend([(geonameid, x.strip()) for x in all_names])

len(location_ids_names)
12790646

Our index will consist of over 12M query terms - it contains the names all populated places which exist in the GeoNames database. For something like Twitter bios, this should work just fine!

A fast (and simple) lookup can be done using a python dictionary - they are implemented using hash tables, giving us constant lookup times given a single query. We'll use defaultdict(list) to construct our dictionary, appending new ids to query terms as we encounter them.

name_lookup = defaultdict(list)
for geonameid, name in location_ids_names:
    name_lookup[name].append(geonameid)

This dictionary maps from a search term to a list of geonameids corresponding to all locations which contain the search term in their list of names:

name_lookup["honolulu"]
4471780 5856194 5856195 5856199

🔗 Making lookups

We now have everything we need to query our locations given a search query. The lookup simply looks like this:

  1. Get geonameids of matches by indexing our search dictionary with the query term
  2. Return locations of matching geonameids

Step 2 currently has locations stored in a pandas DataFrame, which we can index directly by geonameid. Indexing a dataframe is not very efficient for single-item lookups, but probably fine depending on how many lookups we want to execute (and in what timeframe).

def search(place: str) -> pd.DataFrame:
    geonameids = name_lookup.get(place, [])
    return locations.loc[geonameids]

search("honolulu")
name latitude longitude feature code country code country admin1 admin2 admin3 admin4 population
geonameid
4471780 Honolulu 35.36294 -77.28163 PPL US United States North Carolina Craven County Township 1 0
5856194 Honolulu County 21.45543 -157.96138 ADM2 US United States Hawaii Honolulu County 1016508
5856195 Honolulu 21.30694 -157.85833 PPLA US United States Hawaii Honolulu County 371657
5856199 Hawaiian Beaches 19.55417 -154.89000 ADMD US United States Hawaii Hawaii County 0

A single search takes about 250 microseconds on my machine, although this number can go up and down depending on the number of items that actually fit a query.

%timeit search("honolulu")
248 µs ± 2.22 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

🔗 Multiple search terms

Having a single search term can lead to lots of matches, and little specificity. Often, locations contain multiple (especially two) terms, for example:

  • "Hamburg, Germany" - referring to the German city
  • "Hamburg, NJ" - referring to Hamburg in the state of New Jersey, USA

We can leverage the fact that locations of all granularities exist in GeoNames. In the above example, "Hamburg, NJ", both Hamburg and NJ can be queried, and a selection can be made that's consistent with both queries.

Recall that each name is mapped according to the following hierarchy:

<country code> <adm1 code> <adm 2 code> <adm 3 code> <adm 4 code>

Let's take the following example queries (shortened up to admin2):

Query Result
name country admin1 admin2 feature code
hamburg Hamburg Germany Hamburg Hamburg PPLA
Hamburg USA New Jersey Sussex County PPL
Hamburg Canada Manitoba PPLL
germany Germany Germany PCLI
Germany Canada New Brunswick PPLL

We're trying to find "Hamburg, Germany" (which should be obvious), however, there seems to be a "Hamburg, Canada" as well as "Germany, Canada". Could the latter be an acceptable match? No! "Germany, Canada" is a populated locality (PPLL) in the New Brunswick province of Canada. "Hamburg, Canada" is another PPLL in the Manitoba province.

🔗 Naive implementation

When looking up two search terms, we assume that one of them refers to a populated place, and the other to an administrative division the place is part of. We want to find an efficient way to look this up - in pseudocode our algorithm might read something like this:

Given search terms a & b:
1. Lookup a, b in our lookup table
2. For each match in a:
     For each match in b:
       If a is an administrative division:
         If b is part of a:
         -> match
       If b is an administrative division:
         If a is part of b:
         -> match

A naive Python implementation of this, utilizing our pandas datastructures, will look like this (going only up to ADM2 for simplicity):

def search2(a: str, b: str):
    matches_a = search(a)
    matches_b = search(b)
    out = []
    for _, loc_a in matches_a.iterrows():
        for _, loc_b in matches_b.iterrows():
            if loc_a["feature code"] in (pcls + ["ADM1", "ADM2"]):
                loc_b_admins = loc_b[["country", "admin1", "admin2"]]
                # If we have a match, there can't be a more granular division
                if any(loc_a["name"] == loc_b_admins):
                    out.append(loc_b)
            if loc_b["feature code"] in (pcls + ["ADM1", "ADM2"]):
                loc_a_admins = loc_a[["country", "admin1", "admin2"]]
                # If we have a match, there can't be a more granular division
                if any(loc_b["name"] == loc_a_admins):
                    out.append(loc_a)
    return pd.concat(out, axis=1).T

This works - here we'll see how 4 correct results are returned given the query "hamburg, germany" (they're all correct because Hamburg refers both to a place, and several administrative divisions):

search2("hamburg", "germany")[["name", "country", "admin1", "feature code"]]
name country admin1 feature code
2911297 Free and Hanseatic City of Hamburg Federal Republic of Germany Free and Hanseatic City of Hamburg ADM1
2911298 Hamburg Federal Republic of Germany Free and Hanseatic City of Hamburg PPLA
6547395 Hamburg, Freie und Hansestadt Federal Republic of Germany Free and Hanseatic City of Hamburg ADM3
7602585 Hamburg, Freie und Hansestadt Federal Republic of Germany Free and Hanseatic City of Hamburg ADM4

The only problem is its speed - especially as we get up to heavy queries, our for loops and single-element dataframe accessors are going to bite us. Let's query for "San Antonio, Mexico" (San Antonio is the most common place name in this dataset, and there are many San Antonio's in Mexico!), and time the result. Note that this is probably not the most realistic query given its lack of specificity, but let's see how it does:

%time len(search2("san antonio", "mexico"))
CPU times: user 8.6 s, sys: 0 ns, total: 8.6 s
Wall time: 8.59 s
1072

Ouch! Seconds to retrieve 1072 matches, a single query. Like this we'll never be able to get location data for a large dataset containing millions of queries.

🔗 Better utilizing hierarchy

It's actually not trivial to optimize this. Sure, using numpy arrays and faster iterators over our dataframes (like itertuples) will gain us some speed, but not enough to be satisfying. Can we vectorize this, perhaps in part? I first played around broadcasting np.equal over the two searches and filtering the boolean result - this is fast, but can use quite a bit of memory in edge cases (as we'd be creating a potentially large matrix). This approach also got complex quickly because it's harder to leverage the fact that some results are administrative divisions.

In the end, the winner exploits the hierarchical structure of the dataset, allowing us to vectorize one of the two for-loops using numpy functions. Recall how earlier I state how ever more granular divisions are geographically contained in the preceding division? When given two query terms, we assume that one refers to the place someone would like to reference, and the other to some region containing that place. Given two potential matches, we'd thus like to efficiently check whether one could be the "containing region" (or administrative division) of the other.

To achieve this, we'll first create a new feature, admin_string, which is the concatenation of all admin codes. We remove occurences of 00, which represents 'general features where no specific admin code is defined'.

cols = ['country code', 'admin1 code', 'admin2 code', 'admin3 code', 'admin4 code']
admin_string = df[cols[0]].str.replace("00", "")
for col in cols[1:]:
    admin_string = adming_string + df[col].str.replace("00", "")

locations["admin_string"] = admin_string.to_numpy()

We loop over both queries' matches separately, and accept a result if a given admin_string s is the prefix of one of the other query matches' admin_string. To vectorize this for each "direction" of the query, we can use np.char.startswith(admin_strings, s), where s is the admin_string of a given location. Due to the hierarchal nature of locations, any one query must have a paired administrative region whose admin_string is a substring of its own admin_string!

def search2_opt(a, b):
    matches_a = search(a)
    matches_b = search(b)
    index_a, index_b = matches_a.index, matches_b.index
    admin_strings_a = matches_a["admin_string"].to_numpy(dtype=np.string_)
    admin_strings_b = matches_b["admin_string"].to_numpy(dtype=np.string_)
    matches = []
    for s, feature_code in zip(admin_strings_a, matches_a["feature code"]):
        if feature_code in (pcls + ["ADM1", "ADM2", "ADM3", "ADM4"]):
            matches.extend(index_b[np.char.startswith(admin_strings_b, s)].to_list())
    for s, feature_code in zip(admin_strings_b, matches_b["feature code"]):
        if feature_code in (pcls + ["ADM1", "ADM2", "ADM3", "ADM4"]):
            matches.extend(index_a[np.char.startswith(admin_strings_a, s)].to_list())
    return locations.loc[matches]

Let's see how we fare in our worst-case scenario (using timeit now as this one runs fast!), as well the other example I've given earlier:

%timeit search2_opt("san antonio", "mexico")
%timeit search2_opt("hamburg", "germany")
13.5 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.5 ms ± 307 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

From seconds to a few milliseconds - this represents a speedup of almost 1000x in the worst case, and several hundred in the normal case. Neither methods are very pretty in code, but they're short enough to still grasp quickly enough, and do their job just fine.

🔗 Example output

I didn't show so many examples, so there may still be doubts as to how well this works in practice - fair! Here are some examples where I pick random places off the globe and show how it will arrive at a correct location. Note, that population is often only populated for places which have at least thousands of inhabitants, so random/obscure searches will seldom return useful values here.

search2_opt("albuquerque", "brazil")
name latitude longitude feature code country code country admin1 admin2 admin3 admin4 population admin_string
geonameid
3408075 Albuquerque Né -8.00250 -37.41722 PPLL BR Federative Republic of Brazil Pernambuco Sertânia 0 BR302614105
3408077 Albuquerque -6.88722 -58.34083 PPL BR Federative Republic of Brazil Pará Jacareacanga 0 BR161503754
3408078 Albuquerque -4.58333 -37.76667 PPL BR Federative Republic of Brazil Ceará Aracati 0 BR062301109
3464318 Embaúba -20.98250 -48.83556 PPL BR Federative Republic of Brazil São Paulo Embaúba 0 BR273514957
3472737 Bairro do Albuquerque -22.38580 -42.90893 PPLL BR Federative Republic of Brazil Rio de Janeiro Teresópolis 0 BR213305802
3472738 Albuquerque -19.40611 -57.41000 PPLL BR Federative Republic of Brazil Mato Grosso do Sul Corumbá 0 BR1153207
7566388 Albuquerque -22.38060 -42.91934 PPLL BR Federative Republic of Brazil Rio de Janeiro Teresópolis 0 BR213305802
search2_opt("albuquerque", "rio de janeiro")
name latitude longitude feature code country code country admin1 admin2 admin3 admin4 population admin_string
geonameid
3472737 Bairro do Albuquerque -22.3858 -42.90893 PPLL BR Federative Republic of Brazil Rio de Janeiro Teresópolis 0 BR213305802
7566388 Albuquerque -22.3806 -42.91934 PPLL BR Federative Republic of Brazil Rio de Janeiro Teresópolis 0 BR213305802
search2_opt("rio de janeiro", "chiapas")
name latitude longitude feature code country code country admin1 admin2 admin3 admin4 population admin_string
geonameid
8871984 Río de Janeiro 16.11278 -91.54250 PPL MX Mexico Estado de Chiapas La Trinitaria 201 MX05099
8907915 Río de Janeiro 17.49389 -93.10333 PPL MX Mexico Estado de Chiapas Pichucalco 17 MX05068

🔗 Conclusion

Using the freely available GeoNames Gazetteer data, we're able to write a relatively competent, yet simple, geocoder in Python that's able to lookup thousands of queries per second (above times are only using a single core [i7-7700K]) on a single machine.

This may not even be a competitive offline geocoder (I link some good ones just below), but it is simple, adaptible, and a fun showcase for how easily we can get a usable thing with some Python. Note though, that while Python is amazing for preparing the data, the actual query implementation would probably a lot nicer (and faster) in other languages, where we could just implement the naive version efficiently.

By the way, the original goal was to automatically enrich Twitter data with geographical coordinates, but what I've shown returns potentially many matches per query. It's not too hard to make a selection - adding this here would just make this article too long! Some ideas:

  • Sort by population, take first
  • Make preference for certain feature codes (e.g. prefer seats of administrative divisions over 'normal' places)
  • add more features encoding some kind of ranking to use in sorting

If you're interested in the plot I needed the geocoder for, have a look at https://timlod.github.io/.

🔗 Street-level data

This approach completely relies on the GeoNames data, and can't give what the dataset doesn't provide, the biggest drawback being addresses. Most geocoders allow you to look up streets - this can't do that. However, the approach could possibly be adapted to use e.g. OpenStreetMap data (which is significantly larger and probably more complex).

Once we go there though, we may want to look into pre-built solutions that are competent already, like Nominatim, photon or pelias. Note that those usually are a bit involved to setup and require you to download many GB of index data.