In this post I’ll describe the process of bringing together the region shapes from the Natural Earth dataset with the regions provided in the GeoCityLite database. In the GeoCityLite db, the regions are referenced by a two-letter ID (FIPS10-4 for some countries, ISO3366-2 for others). Initially I thought that those IDs would be same as used in the Geonames admin-level 1 region db, which brought me to the first idea of mapping the regions via name similarity.
Trying to match via region names..
So, the basic idea was to use the region meta-data in the Geonames db to map the Natural Earth regions with the GeoCityLite db. Both Geonames and Natural Earth store different versions of the region names that could be thrown into a word similarity measurement like Levenshtein to compute similarities. Before I actually started to matches the regions I grabbed some statistics on the distribution of countries and region in both datasets. Here’s a summary of the results:
In all datasets combined, a total of 237 countries is included. Of those countries, only 90 have the same number of region polygons and region meta-data (shown in green). The red slices represent countries that still have polygons and meta-data available, but there is either a surplus in region meta-data (46) or in region polygons (35). Finally, the blue slices represent countries that have no region polygons at all (60) or that have polygons but no meta-data (6).
To learn a little more about those countries, I displayed the results on a world map:
However, I started the matching and created this more detailed map of coherence between Geonames and Natural Earth regions. Green means perfect matching (no regions left on both sides) while dark red means that almost no region could be mapped properly. The map reveals the differences between France and UK. In one case, only some island regions are missing while in the other case there’s a complete mismatch between regions.
..and what I learned from this
The good side of trying out the region name matching was that this made me aware of the UK case, which fundamentally changed my initial assumptions. To begin with, here’s the region map of the UK regions in the Natural Earth shapefile:
The first thing I learned is that the Natural Earth regions are very detailed in some cases. In fact, they’re too detailed to be useful for our purpose. The regions simply get too small to be clicked. The second thing I learned is that Geonames and Natural Earth have different definitions of administrative level 1 regions. While Geonames sees England, Scotland, Wales and Northern Ireland as first level regions, Natural Earth goes down to a very detailed level shown above. Then I checked if the GeoCityLite database also stores just the four UK “regions”, which would be very sad because they’re too general. And surprise, GeoCityLite uses different regions for UK.
The implications are that we need a completely different way of matching and that we need to merge regions in some countries to a reasonable degree.
Merging too detailed regions
To find out which countries regions need to be merged I looked at the countries that have the highest number of regions. This involves United Kingdom, Slovenia, Philippines, Macedonia and Uganda.
To merge the regions I used meta-data I (fortunately) found in the Natural Earth shapefile. For instance, the UK provinces (e.g. Derby) stored the name of the region they belong to (e.g. East Midlands). The actual merging was done by my good friend the Python Polygon package.
However, for Macedonia, the region names turned out to be incomplete, so I had to match the regions by hand using an overlay image I took from Wikipedia. Here’s a screencast that shows me doing this:
At the end, I got a CSV files that stores all regions that need to be joined along with the new region id. The data is used both for rendering the maps and in the next step of matching the GeoIP regions. For instance, here’s how the merged regions look like for the United Kingdom:
And, finally, matching the regions
Since the definitions of what’s an admin-level 1 region seem to vary across Natural Earth and GeoCityLite, I needed to use a more low-level approach to match the regions. Fortunately, the GeoCityLite db stores plenty of locations for each regions, which I used to match.
For every source region (GeoCityLite) I randomly selected five locations. Then I did point-in-polygon tests for each location with each of the target regions (Natural Earth), and the region that contains most of the locations probably is the one that matches the region.
I checked the accuracy by counting the number of matched regions:
- regions that matches to no target region: 73
- regions that match to exactly one region: 2434
- regions that match to two regions: 437
- regions that match to three or more regions: ~140
The cause of the missing matches could be that some regions are simply missing in the map (e.g. small island polygons). I think those cases can be ignored. Another reason could be that a country has been split up recently (like Sudan) and thus some regions couldn’t be found.
The cause of the multiple matches is that some GeoCityLite regions are more general (or larger) than the regions in the Natural Earth map and one solution would be to merge the affected regions.
Investigating the multiple matches
So, again, let’s look at the details of what we’ve done. Here’s a close-up view on a case in Northern Germany where multiple map regions where matched to the same GeoLiteCity region. Obviously, in this case the cause is imprecision of the Natural Earth shapefile.
Here’s another example, showing the Northern Switzerland where a region matched to four map regions. We can see that there are errors in the GeoLiteCity locations, too.
Looking at those examples, I think the best idea is to simply take the region that matches the most locations. We’ll hopefully get some more feedback from local Piwik users.
As suspected, the locations for South Sudan still are assigned to Sudan in the GeoLiteCity database, which is going to be fixed.
At the end, I’m quite happy with this solution. Next time, I will start to write the JS code that “renders” the SVG maps using RaphaelJS and that also allows to display some data – either region-based or location-based – in the map.
I further investigated the “missing” regions that can’t be matched to any of the map polygons. For instance, here’s a close-up view on Southern England, where some regions couldn’t be matched. The reason is that the locations are a little bit outside the polygon.
The core of the problem is that the data is not accurate enough: either the region outline or the location coordinates are wrong. An easy fix is to build this inaccuracy into the matching algorithm. Since we don’t know the exact coordinates of a given GeoIP location, the same location could as well be moved (slightly) in any direction. The algorithm now also checks the neighborhood for each point, which reduced the number of missing regions to 14.
The remaining missing regions are due to missing polygons for small island parts of countries like the United States or Finland: