Building a Geocoder in Python
In this work we'll build a geocoder - a piece of software which, given some string referring to a place (e.g. 'hamburg'), returns possible locations that the string may refer to (for example as coordinates).
Geocoding systems are commonly available as rate-limited APIs, e.g. https://developers.google.com/maps/documentation/geocoding/overview. Often free tiers exist, but once a lot of data is requested, the cost can be substantial (see ArcGIS pricing).
For a university project quite a while ago, I scraped the Twitter API to map out sentiment around climate change across the globe. The only problem there was that most tweets are not geo-tagged - so while millions of tweets could be collected, most of these could not directly be placed on a map. Many Twitter users do state their location in their bio, although not necessarily accurately: some name their country, others their city, some put a flag, some a combination of these, and many don't put a real location at all.
In any case, at scale a commercial geocoding API won't help us when we have collected a large amount of 'informal' location data like this. So let's try building our own!
🔗 GeoNames
GeoNames is a public geographical database available under a Creative Commons License. It contains names of geographical points along with alternate names, coordinates, country code, elevation, etc.
For a full description of all tables, check http://download.geonames.org/export/dump/.
🔗 The plan
We want to use the GeoNames dataset to build a search index based on 'name' information - generally these are strings indicating a location. There are likely many places that have the same name, so we'll also need to add some information to order/narrow down/choose candidates.
The following (from here) contains the information we have available:
The main 'geoname' table has the following fields :
---------------------------------------------------
geonameid : integer id of record in geonames database
name : name of geographical point (utf8) varchar(200)
asciiname : name of geographical point in plain ascii characters, varchar(200)
alternatenames : alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000)
latitude : latitude in decimal degrees (wgs84)
longitude : longitude in decimal degrees (wgs84)
feature class : see http://www.geonames.org/export/codes.html, char(1)
feature code : see http://www.geonames.org/export/codes.html, varchar(10)
country code : ISO-3166 2-letter country code, 2 characters
cc2 : alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters
admin1 code : fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20)
admin2 code : code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80)
admin3 code : code for third level administrative division, varchar(20)
admin4 code : code for fourth level administrative division, varchar(20)
population : bigint (8 byte int)
elevation : in meters, integer
dem : digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat.
timezone : the iana timezone id (see file timeZone.txt) varchar(40)
modification date : date of last modification in yyyy-MM-dd format
Given a search string, we'd like to return some subset of this information (which we could potentially use later to make a choice among matches).
🔗 Imports and Environment Variables
Note - this blog post is an Org Mode narrative, which runs this work top-to-bottom. In all my narratives, I try to define vital imports and environment variables up top.
from collections import defaultdict
from pathlib import Path
import numpy as np
import pandas as pd
data_dir = Path("../data")
🔗 Load data
Let's start off by downloading the dataset into a data directory.
mkdir ../data -p
wget http://download.geonames.org/export/dump/allCountries.zip -O ../data/allCountries.zip
unzip ../data/allCountries.zip
We'll use pandas dataframes to make use of this data as quickly as possible. We use the pyarrow engine for faster reads, as this table is quite large, and read some of the codes as strings (over numbers) to make later operations easier.
columns = [
"geonameid",
"name",
"asciiname",
"alternatenames",
"latitude",
"longitude",
"feature class",
"feature code",
"country code",
"cc2",
"admin1 code",
"admin2 code",
"admin3 code",
"admin4 code",
"population",
"elevation",
"dem",
"timezone",
"modification date",
]
df = pd.read_table(
data_dir / "allCountries.txt",
header=None,
names=columns,
dtype={
"cc2": str,
"admin1 code": str,
"admin2 code": str,
"admin3 code": str,
"admin4 code": str,
},
index_col="geonameid",
engine="pyarrow",
)
df.head(2)
name | asciiname | alternatenames | latitude | longitude | feature class | feature code | country code | cc2 | admin1 code | admin2 code | admin3 code | admin4 code | population | elevation | dem | timezone | modification date | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
geonameid | ||||||||||||||||||
2986043 | Pic de Font Blanca | Pic de Font Blanca | Pic de Font Blanca,Pic du Port | 42.64991 | 1.53335 | T | PK | AD | 00 | 0 | NaN | 2860 | Europe/Andorra | 2014-11-05 | ||||
2994701 | Roc Mélé | Roc Mele | Roc Mele,Roc Meler,Roc Mélé | 42.58765 | 1.74028 | T | MT | AD | AD,FR | 00 | 0 | NaN | 2803 | Europe/Andorra | 2020-06-10 |
🔗 Filtering the main table
GeoNames contains not just place names, but also names of bodies of water, landmarks,
notable buildings, and more. What type of feature is recorded in each row of our table,
is stated in the feature code
column, for which the full list of descriptions can be
found here. We'll filter for the following:
- PPL
- populated place a city, town, village, or other agglomeration of buildings where people live and work
- PPL__
- (underscore designates additional letters) various subcategories of PPL
- ADM#
- (for # in [1, 4]): #-order administrative division
- ADMD
- administrative division an administrative division of a country, undifferentiated as to administrative level
- PCL_
- (underscore designates additional letters) political entity - roughly speaking these are countries
- TERR
- territories - includes a few places like Western Sahara or Svalbard
Strictly speaking, we don't have to filter at all - our search will work regardless (although the index will take up a little more memory/storage). We do this here now to restrict the results a little, such that we end up with places that may be part of an address line.
ppls = [
"PPL",
"PPLL",
"PPLA",
"PPLX",
"PPLC",
"PPLQ",
"PPLA2",
"PPLW",
"PPLF",
"PPLH", # does not exist anymore, historical
"PPLA3",
"PPLR",
"PPLCH",
"PPLA4",
"PPLS",
"PPLA5",
"PPLG",
"PRSH",
]
adms = ["ADMD", "ADM1", "ADM2", "ADM3", "ADM4", "ADM5"]
# We will ignore historical countries, like the German DDR, under PCLH
pcls = ["PCLI", "PCLD", "PCLIX", "PCLS", "PCLF", "PCL", "TERR"]
df = df[df["feature code"].isin(ppls + adms + pcls)]
🔗 Preprocess/transform data
Now that we have all our data loaded, we will replace some of those [country, admin] codes with names, as dealing with those will be much nicer than dealing with codes. We'll start by creating a list of countries. Three territories (TERR feature code) that use the AU (Australia) country code will be dropped, since we need to uniquely map country codes to names. We could drop TERR altogether, but that way we would also miss Antarctica - and that's a no go!
countries = df[df["feature code"].isin(set(pcls))].set_index(["country code"])
drop_terrs = [
"Territory of Ashmore and Cartier Islands",
"Coral Sea Islands Territory",
"Jervis Bay Territory",
]
countries = countries[~countries.name.isin(drop_terrs)]
assert not any(countries.index.duplicated())
Now we can efficiently add the country
string as such:
df["country"] = countries.loc[df["country code"], "name"].to_numpy()
df["country"].value_counts().head(10)
People’s Republic of China 826993 Republic of India 554988 Republic of Indonesia 344059 Mexico 250444 United States 248855 Russian Federation 203915 Islamic Republic of Pakistan 147126 Republic of France 116931 Federal Republic of Germany 91627 Nepal 87141 Name: country, dtype: int64
We also printed the top 10 countries per the amount of locations indexed in GeoNames as a sanity check. The top countries are all rather large or have large populations, so this makes sense.
Since we need combinations of code
columns to add names of admin regions, df.merge
is handier in this case (albeit probably a little slower). Note that using .values
is
important to make sure pandas merely appends a same-length vector here (otherwise it
might use the geonameid
index, whose order is not guaranteed to be correct through the
merge).
cols = ["country code", "admin1 code"]
adm1 = df[df["feature code"] == "ADM1"]
df["admin1"] = df[cols].merge(adm1[cols + ["name"]], on=cols, how="left")["name"].values
cols.append("admin2 code")
adm2 = df[df["feature code"] == "ADM2"]
df["admin2"] = df[cols].merge(adm2[cols + ["name"]], on=cols, how="left")["name"].values
cols.append("admin3 code")
adm3 = df[df["feature code"] == "ADM3"]
df["admin3"] = df[cols].merge(adm3[cols + ["name"]], on=cols, how="left")["name"].values
cols.append("admin4 code")
adm4 = df[df["feature code"] == "ADM4"]
df["admin4"] = df[cols].merge(adm4[cols + ["name"]], on=cols, how="left")["name"].values
df[["country", "admin1", "admin2", "admin3", "admin4"]].dropna().head()
country | admin1 | admin2 | admin3 | admin4 | |
---|---|---|---|---|---|
geonameid | |||||
2765121 | Republic of Austria | Wien | Wien Stadt | Gemeindebezirk Liesing | Siebenhirten |
2767332 | Republic of Austria | Wien | Wien Stadt | Gemeindebezirk Liesing | Rodaun |
2770793 | Republic of Austria | Wien | Wien Stadt | Gemeindebezirk Liesing | Erlaa |
2771781 | Republic of Austria | Wien | Wien Stadt | Gemeindebezirk Liesing | Wien Mauer |
2772487 | Republic of Austria | Wien | Wien Stadt | Gemeindebezirk Liesing | Liesing |
In practice, only admin1
and admin2
may actually be really useful for querying, as
the other two are both empty for 75% of the data, with only 8% of data having admin4
available:
df[["admin3", "admin4"]].isnull().value_counts(normalize=True)
admin3 admin4 True True 0.747178 False True 0.150851 False 0.081360 True False 0.020611 dtype: float64
🔗 Building our return structure
Our dataframe is already indexed by a unique identifier for each location, and contains all names and codes we are interested in. Let's just index some important features that can help determine which of the returned matches is the one the user was looking for:
locations = df[
[
"name",
"latitude",
"longitude",
"feature code",
"country code",
"country",
"admin1",
"admin2",
"admin3",
"admin4",
"population",
]
]
locations = locations.fillna("")
# Let's also save this in case we'd like to package an API at some point
locations.to_parquet(data_dir / "locations.parquet")
🔗 Building our search index
For our index, we'd like to have a dictionary that maps from search terms to all places that match the search term. Let's start by creating a comma-separated string containing all potential names for each place:
names = df.name + "," + df.asciiname + "," + df.alternatenames
# Get rid of trailing comma in case alternatenames was empty
names = names.str.strip(",")
names.iloc[:10]
geonameid 3038816 Xixerella,Xixerella 3038832 Vila,Vila,Casas Vila,Vila 3038899 Tossalet i Vinyals,Tossalet i Vinyals,Tossalet... 3038987 Sornàs,Sornas,Sornas,Sornàs 3038999 Soldeu,Soldeu 3039039 Solà d’Encamp,Sola d'Encamp 3039077 Sispony,Sispony,Sispony 3039132 Segudet,Segudet,Segudet 3039154 El Tarter,El Tarter,Ehl Tarter,Эл Тартер 3039162 Sant Julià de Lòria,Sant Julia de Loria,Paroch... dtype: object
Now we lowercase and split this string, and make one large list containing geonameids and corresponding names.
location_ids_names = []
for geonameid, all_names in names.items():
all_names = set(all_names.lower().split(","))
location_ids_names.extend([(geonameid, x.strip()) for x in all_names])
len(location_ids_names)
12790646
Our index will consist of over 12M query terms - it contains the names all populated places which exist in the GeoNames database. For something like Twitter bios, this should work just fine!
A fast (and simple) lookup can be done using a python dictionary - they are implemented using hash tables, giving us constant lookup times given a single query. We'll use defaultdict(list) to construct our dictionary, appending new ids to query terms as we encounter them.
name_lookup = defaultdict(list)
for geonameid, name in location_ids_names:
name_lookup[name].append(geonameid)
This dictionary maps from a search term to a list of geonameids corresponding to all locations which contain the search term in their list of names:
name_lookup["honolulu"]
4471780 | 5856194 | 5856195 | 5856199 |
🔗 Making lookups
We now have everything we need to query our locations given a search query. The lookup simply looks like this:
- Get geonameids of matches by indexing our search dictionary with the query term
- Return locations of matching geonameids
Step 2 currently has locations stored in a pandas DataFrame, which we can index directly by geonameid. Indexing a dataframe is not very efficient for single-item lookups, but probably fine depending on how many lookups we want to execute (and in what timeframe).
def search(place: str) -> pd.DataFrame:
geonameids = name_lookup.get(place, [])
return locations.loc[geonameids]
search("honolulu")
name | latitude | longitude | feature code | country code | country | admin1 | admin2 | admin3 | admin4 | population | |
---|---|---|---|---|---|---|---|---|---|---|---|
geonameid | |||||||||||
4471780 | Honolulu | 35.36294 | -77.28163 | PPL | US | United States | North Carolina | Craven County | Township 1 | 0 | |
5856194 | Honolulu County | 21.45543 | -157.96138 | ADM2 | US | United States | Hawaii | Honolulu County | 1016508 | ||
5856195 | Honolulu | 21.30694 | -157.85833 | PPLA | US | United States | Hawaii | Honolulu County | 371657 | ||
5856199 | Hawaiian Beaches | 19.55417 | -154.89000 | ADMD | US | United States | Hawaii | Hawaii County | 0 |
A single search takes about 250 microseconds on my machine, although this number can go up and down depending on the number of items that actually fit a query.
%timeit search("honolulu")
248 µs ± 2.22 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
🔗 Multiple search terms
Having a single search term can lead to lots of matches, and little specificity. Often, locations contain multiple (especially two) terms, for example:
- "Hamburg, Germany" - referring to the German city
- "Hamburg, NJ" - referring to Hamburg in the state of New Jersey, USA
We can leverage the fact that locations of all granularities exist in GeoNames. In the above example, "Hamburg, NJ", both Hamburg and NJ can be queried, and a selection can be made that's consistent with both queries.
Recall that each name is mapped according to the following hierarchy:
<country code> <adm1 code> <adm 2 code> <adm 3 code> <adm 4 code>
Let's take the following example queries (shortened up to admin2):
Query | Result | ||||
name | country | admin1 | admin2 | feature code | |
hamburg | Hamburg | Germany | Hamburg | Hamburg | PPLA |
Hamburg | USA | New Jersey | Sussex County | PPL | |
Hamburg | Canada | Manitoba | PPLL | ||
germany | Germany | Germany | PCLI | ||
Germany | Canada | New Brunswick | PPLL |
We're trying to find "Hamburg, Germany" (which should be obvious), however, there seems to be a "Hamburg, Canada" as well as "Germany, Canada". Could the latter be an acceptable match? No! "Germany, Canada" is a populated locality (PPLL) in the New Brunswick province of Canada. "Hamburg, Canada" is another PPLL in the Manitoba province.
🔗 Naive implementation
When looking up two search terms, we assume that one of them refers to a populated place, and the other to an administrative division the place is part of. We want to find an efficient way to look this up - in pseudocode our algorithm might read something like this:
Given search terms a & b: 1. Lookup a, b in our lookup table 2. For each match in a: For each match in b: If a is an administrative division: If b is part of a: -> match If b is an administrative division: If a is part of b: -> match
A naive Python implementation of this, utilizing our pandas datastructures, will look like this (going only up to ADM2 for simplicity):
def search2(a: str, b: str):
matches_a = search(a)
matches_b = search(b)
out = []
for _, loc_a in matches_a.iterrows():
for _, loc_b in matches_b.iterrows():
if loc_a["feature code"] in (pcls + ["ADM1", "ADM2"]):
loc_b_admins = loc_b[["country", "admin1", "admin2"]]
# If we have a match, there can't be a more granular division
if any(loc_a["name"] == loc_b_admins):
out.append(loc_b)
if loc_b["feature code"] in (pcls + ["ADM1", "ADM2"]):
loc_a_admins = loc_a[["country", "admin1", "admin2"]]
# If we have a match, there can't be a more granular division
if any(loc_b["name"] == loc_a_admins):
out.append(loc_a)
return pd.concat(out, axis=1).T
This works - here we'll see how 4 correct results are returned given the query "hamburg, germany" (they're all correct because Hamburg refers both to a place, and several administrative divisions):
search2("hamburg", "germany")[["name", "country", "admin1", "feature code"]]
name | country | admin1 | feature code | |
---|---|---|---|---|
2911297 | Free and Hanseatic City of Hamburg | Federal Republic of Germany | Free and Hanseatic City of Hamburg | ADM1 |
2911298 | Hamburg | Federal Republic of Germany | Free and Hanseatic City of Hamburg | PPLA |
6547395 | Hamburg, Freie und Hansestadt | Federal Republic of Germany | Free and Hanseatic City of Hamburg | ADM3 |
7602585 | Hamburg, Freie und Hansestadt | Federal Republic of Germany | Free and Hanseatic City of Hamburg | ADM4 |
The only problem is its speed - especially as we get up to heavy queries, our for loops and single-element dataframe accessors are going to bite us. Let's query for "San Antonio, Mexico" (San Antonio is the most common place name in this dataset, and there are many San Antonio's in Mexico!), and time the result. Note that this is probably not the most realistic query given its lack of specificity, but let's see how it does:
%time len(search2("san antonio", "mexico"))
CPU times: user 8.6 s, sys: 0 ns, total: 8.6 s Wall time: 8.59 s 1072
Ouch! Seconds to retrieve 1072 matches, a single query. Like this we'll never be able to get location data for a large dataset containing millions of queries.
🔗 Better utilizing hierarchy
It's actually not trivial to optimize this. Sure, using numpy arrays and faster
iterators over our dataframes (like itertuples) will gain us some speed, but not enough
to be satisfying. Can we vectorize
this, perhaps in part? I first played around
broadcasting np.equal over the two searches and filtering the boolean result - this is
fast, but can use quite a bit of memory in edge cases (as we'd be creating a potentially
large matrix). This approach also got complex quickly because it's harder to leverage
the fact that some results are administrative divisions.
In the end, the winner exploits the hierarchical structure of the dataset, allowing us to vectorize one of the two for-loops using numpy functions. Recall how earlier I state how ever more granular divisions are geographically contained in the preceding division? When given two query terms, we assume that one refers to the place someone would like to reference, and the other to some region containing that place. Given two potential matches, we'd thus like to efficiently check whether one could be the "containing region" (or administrative division) of the other.
To achieve this, we'll first create a new feature, admin_string
, which is the
concatenation of all admin codes. We remove occurences of 00, which represents
'general features where no specific admin code is defined'.
cols = ['country code', 'admin1 code', 'admin2 code', 'admin3 code', 'admin4 code']
admin_string = df[cols[0]].str.replace("00", "")
for col in cols[1:]:
admin_string = adming_string + df[col].str.replace("00", "")
locations["admin_string"] = admin_string.to_numpy()
We loop over both queries' matches separately, and accept a result if a given
admin_string
s is the prefix of one of the other query matches' admin_string
. To
vectorize this for each "direction" of the query, we can use
np.char.startswith(admin_strings, s), where s is the admin_string
of a given location.
Due to the hierarchal nature of locations, any one query must have a paired
administrative region whose admin_string is a substring of its own admin_string!
def search2_opt(a, b):
matches_a = search(a)
matches_b = search(b)
index_a, index_b = matches_a.index, matches_b.index
admin_strings_a = matches_a["admin_string"].to_numpy(dtype=np.string_)
admin_strings_b = matches_b["admin_string"].to_numpy(dtype=np.string_)
matches = []
for s, feature_code in zip(admin_strings_a, matches_a["feature code"]):
if feature_code in (pcls + ["ADM1", "ADM2", "ADM3", "ADM4"]):
matches.extend(index_b[np.char.startswith(admin_strings_b, s)].to_list())
for s, feature_code in zip(admin_strings_b, matches_b["feature code"]):
if feature_code in (pcls + ["ADM1", "ADM2", "ADM3", "ADM4"]):
matches.extend(index_a[np.char.startswith(admin_strings_a, s)].to_list())
return locations.loc[matches]
Let's see how we fare in our worst-case scenario (using timeit
now as this one runs
fast!), as well the other example I've given earlier:
%timeit search2_opt("san antonio", "mexico")
%timeit search2_opt("hamburg", "germany")
13.5 ms ± 2.02 ms per loop (mean ± std. dev. of 7 runs, 100 loops each) 1.5 ms ± 307 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
From seconds to a few milliseconds - this represents a speedup of almost 1000x in the worst case, and several hundred in the normal case. Neither methods are very pretty in code, but they're short enough to still grasp quickly enough, and do their job just fine.
🔗 Example output
I didn't show so many examples, so there may still be doubts as to how well this works in practice - fair! Here are some examples where I pick random places off the globe and show how it will arrive at a correct location. Note, that population is often only populated for places which have at least thousands of inhabitants, so random/obscure searches will seldom return useful values here.
search2_opt("albuquerque", "brazil")
name | latitude | longitude | feature code | country code | country | admin1 | admin2 | admin3 | admin4 | population | admin_string | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
geonameid | ||||||||||||
3408075 | Albuquerque Né | -8.00250 | -37.41722 | PPLL | BR | Federative Republic of Brazil | Pernambuco | Sertânia | 0 | BR302614105 | ||
3408077 | Albuquerque | -6.88722 | -58.34083 | PPL | BR | Federative Republic of Brazil | Pará | Jacareacanga | 0 | BR161503754 | ||
3408078 | Albuquerque | -4.58333 | -37.76667 | PPL | BR | Federative Republic of Brazil | Ceará | Aracati | 0 | BR062301109 | ||
3464318 | Embaúba | -20.98250 | -48.83556 | PPL | BR | Federative Republic of Brazil | São Paulo | Embaúba | 0 | BR273514957 | ||
3472737 | Bairro do Albuquerque | -22.38580 | -42.90893 | PPLL | BR | Federative Republic of Brazil | Rio de Janeiro | Teresópolis | 0 | BR213305802 | ||
3472738 | Albuquerque | -19.40611 | -57.41000 | PPLL | BR | Federative Republic of Brazil | Mato Grosso do Sul | Corumbá | 0 | BR1153207 | ||
7566388 | Albuquerque | -22.38060 | -42.91934 | PPLL | BR | Federative Republic of Brazil | Rio de Janeiro | Teresópolis | 0 | BR213305802 |
search2_opt("albuquerque", "rio de janeiro")
name | latitude | longitude | feature code | country code | country | admin1 | admin2 | admin3 | admin4 | population | admin_string | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
geonameid | ||||||||||||
3472737 | Bairro do Albuquerque | -22.3858 | -42.90893 | PPLL | BR | Federative Republic of Brazil | Rio de Janeiro | Teresópolis | 0 | BR213305802 | ||
7566388 | Albuquerque | -22.3806 | -42.91934 | PPLL | BR | Federative Republic of Brazil | Rio de Janeiro | Teresópolis | 0 | BR213305802 |
search2_opt("rio de janeiro", "chiapas")
name | latitude | longitude | feature code | country code | country | admin1 | admin2 | admin3 | admin4 | population | admin_string | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
geonameid | ||||||||||||
8871984 | Río de Janeiro | 16.11278 | -91.54250 | PPL | MX | Mexico | Estado de Chiapas | La Trinitaria | 201 | MX05099 | ||
8907915 | Río de Janeiro | 17.49389 | -93.10333 | PPL | MX | Mexico | Estado de Chiapas | Pichucalco | 17 | MX05068 |
🔗 Conclusion
Using the freely available GeoNames Gazetteer data, we're able to write a relatively competent, yet simple, geocoder in Python that's able to lookup thousands of queries per second (above times are only using a single core [i7-7700K]) on a single machine.
This may not even be a competitive offline geocoder (I link some good ones just below), but it is simple, adaptible, and a fun showcase for how easily we can get a usable thing with some Python. Note though, that while Python is amazing for preparing the data, the actual query implementation would probably a lot nicer (and faster) in other languages, where we could just implement the naive version efficiently.
By the way, the original goal was to automatically enrich Twitter data with geographical coordinates, but what I've shown returns potentially many matches per query. It's not too hard to make a selection - adding this here would just make this article too long! Some ideas:
- Sort by population, take first
- Make preference for certain feature codes (e.g. prefer seats of administrative divisions over 'normal' places)
- add more features encoding some kind of ranking to use in sorting
If you're interested in the plot I needed the geocoder for, have a look at https://timlod.github.io/.
🔗 Street-level data
This approach completely relies on the GeoNames data, and can't give what the dataset doesn't provide, the biggest drawback being addresses. Most geocoders allow you to look up streets - this can't do that. However, the approach could possibly be adapted to use e.g. OpenStreetMap data (which is significantly larger and probably more complex).
Once we go there though, we may want to look into pre-built solutions that are competent already, like Nominatim, photon or pelias. Note that those usually are a bit involved to setup and require you to download many GB of index data.