Querying the Most Granular Demographics Dataset

Click to learn more about author Florian Grüning.

There is a plethora of use cases that require detailed population data. For example, having a detailed breakdown of the demographic structure is a significant factor in predicting real estate prices. Also, humanitarian projects such as vaccination campaigns or rural electrification plans highly depend on good population data.

It is very challenging to find high-quality and up-to-date data on a global scale for these use cases. Usually, census data is published every four years, which makes those datasets outdated quickly. Arguably the best datasets out there for population densities and demographics are published by Facebook under their Data for Good initiative. They combine official census data with their internal data and leverage machine learning algorithms for image recognition to determine buildings’ location and type.

Using those different sources can give a detailed statistical breakdown of demographic groups in 1-arcsecond blocks, a resolution of approximately 30 meters.

Each square contains statistical values for the following demographic groups:

Total
Female
Male
Children under 5
Youth 15 – 24
Elderly 60 plus
Women of reproductive age 15 – 49

Facebook delivers for each country a file per demographic group, either as a GeoTIFF or CSV. The CSV contains the latitude and longitude of the cell and the respective population value.

Just working with a static CSV file can be cumbersome. That is why we created an open-source wrapper that exposes the data over an API. You can directly download the data for entire countries over a CLI. We preprocess the data to make it easily queryable. For that, we are leveraging the power of Uber’s H3 spatial indexing.

Thanks to the H3 indexing, it is easy to build queries on top of the database. Using either H3 cells or coordinate pairs, you can retrieve the population based on a point, a given radius, or polygon. That way, it is straightforward to aggregate the population on a zip code level, for example.

We aggregate the squares into H3 cells at resolution 11 and store them in a MongoDB with the aggregated values for each demographic group. Using JS streams and MongoDB’s aggregation pipelines, the memory usage stays low, and you can process millions of rows on your local machine.

For quick data exploration and visualization, you can directly create datasets compatible with Kepler.gl or Unfolded.ai to make beautiful maps. We published an example map for Malta. It is directly visible where the highly populated regions are and where the heart of the city is.

By having Facebook’s population data now directly queryable, it is much faster to create predictive models or visualizations so data teams can spend time on the value-adding tasks. That is also the main reason why we are building an open-source community for third-party data integration. So if you want to get your hands on more connectors like these, star us on Github and join our Slack community.

TAKE OUR DATA MANAGEMENT CERTIFICATION PREP COURSES

Data Topics

Querying the Most Granular Demographics Dataset

Leave a Reply Cancel reply