Blue Skies Above Us

Just here to have fun.


6-Minute Read


The goal here is to filter OpenStreetMap data extracts to only the features you need to use. The result is a smaller, more manageable dataset.


For my map wall art I’m making maps of many of the major cities in the United States (and will do other areas once I’m done with that). To be efficient, I’m using continent-wide datasets from OpenStreetMap (so I can just zoom to an area and export PDFs). My maps are fairly minimal, and the datasets contain a lot of data that I don’t need. This post is a simple tutorial about how to do an SQL query of the data as you convert it to a database table in order to reduce the size of the output. This isn’t a secret or anything, but I only found a couple examples of this process, so I thought I’d add my own.

Geofabrik’s OSM Data Extracts

Geofabrik’s OSM data extracts are large files. The North America extract is 9.4 GB. The extract is an osm.pbf file. My original understanding of the pbf format was that it is basically a compressed version of the OSM data, and it is recommended to export the data to a database, such as Spatialite or Postgresql. Then, the data will be split into appropriate data types, such as line, point, and multipolygon, and then can be symbolized appropriately.

If you convert this extract to Spatialite, it produces a 50 GB file. I’m not a database expert, but this is far from lite, and the data takes a really long time to initially load in QGIS. After that it’s actually functional and speedy. But still, it feels wrong, I don’t want to have 50 GB files on my computer. And I plan to use European data, which is a 22 GB PBF file, and would grow into an enormous database table.

So, Postgresql is a better solution. I choose to use GDAL (ogr2ogr) for this. osm2pgsql is another tool, which I tried and didn’t like, though I flushed it from my memory and don’t recall why.

As I initially prepared to convert the PBF to postgresql, I thought about how the extract contains a lot of data that I don’t need. For example, my maps don’t have any buildings, and they don’t include any points. They are basically just roads and water and land boundaries. And sure, I could query the data after it’s been converted and then either delete the unused stuff or copy the desired stuff into a new table. But after reading these two resources - How to Convert osm.pbf files to esri shapefiles and OGR-SQL - I realized that I can include an SQL select command in the conversion command (note: those two resources have LOTS of examples, so check them out)!

The Conversion

ogr2ogr -sql "select * 
  from multipolygons 
  where natural 
  OR leisure IN('marina') 
  OR hstore_get_value(other_tags, 'waterway') = 'riverbank' 
  OR hstore_get_value(other_tags, 'waterway') = 'riverbed'" 
  -f PostgreSQL PG:"dbname='osm_na' 
  -nln land_water

I’ll unpack the codeblock. First I want to note that the OSM data is unique (in my experience) in that it contains multipolygons as well as “closed ways.” Closed ways are basically, as far as I can tell, lines that become multipolygons by virtue of being closed. There’s probably more to it than that. But I found that with ogr2ogr you can either convert the closed ways into a multipolygon or a line. This tripped me up at first, because I assumed the feature types in the PBF file were already defined. Plus, it seems inconsistent and unnecessary for the data to be structured into these two very similar feature types.

The code selects the various water polygons and closed way waterway and converts them into multipolygons in a postgresql database. As you can see, there are lots of different types of waterways, and it took me a fair amount of trial and error to find them all (basically, zooming to cities, and then realizing that some random chunk of a river is tagged as a riverbank instead of a waterway).

Starting at the end of this chunk, this is taking data from the 9.4 GB north-america-latest.osm.pbf file, and the final output is a postgresql table called land_water within the database osm_na.

Now to the beginning: we immediately run an sql query, selecting our data. What’s interesting is that the select * from multipolygons, while part of the query, actually informs the conversion, telling it which feature type the output should be. For example, if your output type is going to be a line, you’d do select * from lines. The “closed ways” can be converted either to line or multipolygon! Oh, and do you know which of these tags are closed ways rather than multipolygons? Cuz I don’t.

This is the strangest part: when data in PBF files have “too many” fields, ogr2ogr concatenates the values of a bunch of those fields into an “hstore” (what?) formatted single field called other_tags. I found this to be incredibly confusing. “Hey, why can’t I query a field/value that had identified on the OSM webmap?” It’s because that field is now tucked away in some other field with a bunch of junk, and the querying syntax is neither remotely intuitive nor well-documented.

hstore_get_value(other_tags, 'waterway') = 'riverbank'

Additionally, ogr2ogr has a poorly-documented config file called osmconfig.ini (example). You can use it to specify which fields you don’t want relegated to other_tags. And I’ll be the first person on the internet to tell you that this file is probably located at /usr/share/gdal/ on linux. According to that last link, you can make a copy in another folder, and tell gdal to use the copy. But I tried that and it didn’t work! And because any update to gdal would probably overwrite the original config file, I wasn’t able to keep my favorite fields out of other_tags. So instead I used that nutso query you see above.

So, that’s about it! I ran similar queries/conversions for roads and water lines. Additionally, I got some polygons from this OSM source, because the coastlines in the PBF file contain big administrative buffers that I had trouble removing. And that source has them clipped to the actual natural boundaries.

I know there’s nothing new in this post. But sometimes it helps to have a few posts out there explaining these things. Again, see the links in this post for other examples.

What is it?


Nothing yet.

Recent Posts



Founded 2005. Over the years I've posted writing, comics, ringtones, and stuff about maps, bikes, programming, pinball. And I had a robust music blog mostly about '90s hardcore punk (category = music).