Dewey Dunnington has an outstanding blog post for those who are interested in the technical side of the cloud-native data paradigm. The post looks into how pruning1 works with GeoParquet 1.0 and 1.1. Pruning is what makes cloud-native data especially fast to view and query – over the internet (using so-called HTTP range requests), but also when used locally.
Dewey’s blog post compares different approaches to efficiently querying GeoParquet files, using:
The post also provides the Python code – with interspersed SQL, of course – for each approach.
The support of the two GeoParquet versions and also the ease-of-use varies slightly between what Dewey calls the “don’t worry, we take care of the details” solutions and solutions “that need a little help”.
Highly recommended for anybody interested in cloud-native data7, very interesting details in this post!
Footnotes
In essence, when querying data, pruning refers to skipping or not transferring or reading data that does not affect a query’s result. Cloud-native data excels at enabling pruning.↩︎
SedonaDBis an open-source single-node analytical database engine with geospatial as a first-class citizen.↩︎Sedona Sparkis a distributed batch analytics and processing engine on Apache Spark clusters.↩︎DuckDBis an open-source column-oriented Relational Database Management System (RDBMS) designed for analytics use-cases.↩︎GeoPandasis an extension to the popular data science libraryPandasthat enables support for geospatial data.↩︎GDAL, short for “Geospatial Data Abstraction Library” is a translator library for raster and vector geospatial data formats that underpins much of the modern geospatial data stack.↩︎Let me quote Dewey’s blog post to emphasize that many people should be interested: “Basically, GeoParquet was designed to allow traditional GIS software to take advantage of a decade plus of heavy investment in the Parquet format and the software that reads and writes it. Two spot examples are that (1) software that reads traditional GIS formats is not particularly good at splitting the work up so that all the cores on your computer are put to good use and (2) most formats are not particularly good at being plonked onto a web server and have sections of them selectively queried (which can often save data producers from standing up a 24/7 web service).”↩︎