Geospatial data analysis to explore with geopandas and duckdb

nimda December 15, 2025

0 8 11 minutes read

Geospatial data analysis to explore with geopandas and duckdb

In this article, I will show you how to use two popular Python libraries to perform some geospatial analysis of road accident data within the UK.

I was the first finder Duckdbalap database is fast, after it was discovered, but it has only recently seen that, by extension, it has provided a large number of useful geospatial functions.

Geopandas it was new to me. It is a Python library that makes working with spatial data feel like working with standard pandas, but with geometry (points, passefiles, geometries together, and quickly visualize layers with matplotlib.

Looking forward to trying out the two skills, I planned to research a practical mini project that would be interesting and a learning aid.

Long story short, I decided to try using the libraries to determine the safest city in the UK to drive or walk around. You could probably do everything I'm going to show using geopandas yourself, but unfortunately, I don't usually have duckdb, so I used both.

A few set rules.

The accident and temporal data I will be using comes from an official UK national source. However, I will be focusing on the details of only six UK cities: London, Edinburgh, Cardiff, Glasgow, Birmingham, and Manchester.

My approach to finding the “safest” city would be to calculate the total number of road related injuries in each city over a five year period and then divide this number by the area of each city in KM². This will be my “safety index”, and that number is there, the city is safe.

Finding our city boundary data

This was the hardest part of the whole process.

Rather than treating “the city” as a single administrative polygon, which leads to different types depending on the areas of the city, I represented each Built-up area (Bua) the feet of the feet. I did this using National statistics office (locations) Map data layer and include all parts of Bua that fall within the logical administrative boundary of this area. The mask comes from the border officer and is chosen to show each broad urban area:

London → The Location of London (Regions Dec 2023).
Manchester → union of 10 greater manchester Home Districts (boy) in May 2024.
Birmingham → a union of 7 children from the West Midlands includes authority boys.
He spat → union of 8 Glasgow City Regional Councils.
Edinburgh and Trash → their only boy.

How polygons are created in code

I downloaded the border data directly from the UK office for national statistics (ANS) Arcgis Precerver in Geojson format using the library. In each city we started building a lazy from the official parts of the administration: The London County (Dec 2023) in London; this page Union of Local Authorities (That, May 2024) For Greater Manchester (10 lads), West Midlands Core (7 lads) and Glasgow city region (8 Courcing); once an unmarried boy Edinburgh and Cardiff.

Next, I ask Built-in areas (Bua 2022) a layer of polygons that divide the box to bind candies, to keep only those that meeting the mask, too explain (Combine) the results to create one multipart polygon for each city (“Bua 2022 aggregate”). The data is stored and organized internally EPSG: 4326 (WGS84), ie latitude and longitude are expressed as degrees. When reporting locations, we use EPSG: 27700 (OSGB) and calculate the area in square kilometers to avoid distortion.

Data for each city boundary is tracked in a geojson file and loaded into Python using geopandas and libraries.

To show that the boundary data we have is correct, the layers of individual cities are placed in a single geodataframe, driven to a fixed reference system (EPSG: 4326), and arranged on top of the clean UK framework taken from natural world data (via githib mirror). In order to focus on the mainland only, we put the UK frame into a bounding box of cities, excluding the most remote overseas locations. The area of each city is also calculated.

Limitation Information License

All border details I used Open data with favorable terms for reuse.

London, Edinburgh, Cardiff, Glasgow, Birmingham & Manchester

Source: Office for National Statistics (ANS) – Regions and federal authorities (May 2023), UK BGC and Regions (December 2023) en BGC.
License: Open Government License v3.0 (OGL).
Conditions: You are free to use, modify, and share the data (or for commercial purposes) as long as you provide identification.

UK framework

Source: Natural World –Admin 0 – Countries.
License: Public domain (no limit).Cing:
The natural world.Admin 0 – Countries. Community domain. Available at:

City Limits Code

Here is the Python code you can use to download the data files for each city and verify the boundaries on the map.

Before running the main code, please make sure you have installed the following libraries. You can use peach or any other method you prefer in this case.

geopandas
matplotlib
requests
pandas
duckdb
jupyter

import requests, json
import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt

# ---------- ONS endpoints ----------
LAD24_FS = "
REGION23_FS = "
BUA_FS   = "

# ---------- Helpers ----------
def arcgis_geojson(url, params):
    r = requests.get(url, params={**params, "f": "geojson"}, timeout=90)
    r.raise_for_status()
    return r.json()

def sql_quote(s: str) -> str:
    return "'" + s.replace("'", "''") + "'"

def fetch_lads_by_names(names):
    where = "LAD24NM IN ({})".format(",".join(sql_quote(n) for n in names))
    data = arcgis_geojson(LAD24_FS, {
        "where": where,
        "outFields": "LAD24NM",
        "returnGeometry": "true",
        "outSR": "4326"
    })
    gdf = gpd.GeoDataFrame.from_features(data.get("features", []), crs="EPSG:4326")
    if gdf.empty:
        raise RuntimeError(f"No LADs found for: {names}")
    return gdf

def fetch_lad_by_name(name):
    return fetch_lads_by_names([name])

def fetch_region_by_name(name):
    data = arcgis_geojson(REGION23_FS, {
        "where": f"RGN23NM={sql_quote(name)}",
        "outFields": "RGN23NM",
        "returnGeometry": "true",
        "outSR": "4326"
    })
    gdf = gpd.GeoDataFrame.from_features(data.get("features", []), crs="EPSG:4326")
    if gdf.empty:
        raise RuntimeError(f"No Region feature for: {name}")
    return gdf

def fetch_buas_intersecting_bbox(minx, miny, maxx, maxy):
    data = arcgis_geojson(BUA_FS, {
        "where": "1=1",
        "geometryType": "esriGeometryEnvelope",
        "geometry": json.dumps({
            "xmin": float(minx), "ymin": float(miny),
            "xmax": float(maxx), "ymax": float(maxy),
            "spatialReference": {"wkid": 4326}
        }),
        "inSR": "4326",
        "spatialRel": "esriSpatialRelIntersects",
        "outFields": "BUA22NM,BUA22CD,Shape__Area",
        "returnGeometry": "true",
        "outSR": "4326"
    })
    return gpd.GeoDataFrame.from_features(data.get("features", []), crs="EPSG:4326")

def aggregate_bua_by_mask(mask_gdf: gpd.GeoDataFrame, label: str) -> gpd.GeoDataFrame:
    """
    Fetch BUA 2022 polygons intersecting a mask (LAD/Region union) and dissolve to one polygon.
    Uses Shapely 2.x union_all() to build the mask geometry.
    """
    # Union the mask polygon(s)
    mask_union = mask_gdf.geometry.union_all()

    # Fetch candidate BUAs via mask bbox, then filter by real intersection with the union
    minx, miny, maxx, maxy = gpd.GeoSeries([mask_union], crs="EPSG:4326").total_bounds
    buas = fetch_buas_intersecting_bbox(minx, miny, maxx, maxy)
    if buas.empty:
        raise RuntimeError(f"No BUAs intersecting bbox for {label}")
    buas = buas[buas.intersects(mask_union)]
    if buas.empty:
        raise RuntimeError(f"No BUAs actually intersect mask for {label}")

    dissolved = buas[["geometry"]].dissolve().reset_index(drop=True)
    dissolved["city"] = label
    return dissolved[["city", "geometry"]]

# ---------- Group definitions ----------
GM_10 = ["Manchester","Salford","Trafford","Stockport","Tameside",
         "Oldham","Rochdale","Bury","Bolton","Wigan"]

WMCA_7 = ["Birmingham","Coventry","Dudley","Sandwell","Solihull","Walsall","Wolverhampton"]

GLASGOW_CR_8 = ["Glasgow City","East Dunbartonshire","West Dunbartonshire",
                "East Renfrewshire","Renfrewshire","Inverclyde",
                "North Lanarkshire","South Lanarkshire"]

EDINBURGH = "City of Edinburgh"
CARDIFF   = "Cardiff"

# ---------- Build masks ----------
london_region = fetch_region_by_name("London")     # Region mask for London
gm_lads   = fetch_lads_by_names(GM_10)             # Greater Manchester (10)
wmca_lads = fetch_lads_by_names(WMCA_7)            # West Midlands (7)
gcr_lads  = fetch_lads_by_names(GLASGOW_CR_8)      # Glasgow City Region (8)
edi_lad   = fetch_lad_by_name(EDINBURGH)           # Single LAD
cdf_lad   = fetch_lad_by_name(CARDIFF)             # Single LAD

# ---------- Aggregate BUAs by each mask ----------
layers = []

london_bua = aggregate_bua_by_mask(london_region, "London (BUA 2022 aggregate)")
london_bua.to_file("london_bua_aggregate.geojson", driver="GeoJSON")
layers.append(london_bua)

man_bua = aggregate_bua_by_mask(gm_lads, "Manchester (BUA 2022 aggregate)")
man_bua.to_file("manchester_bua_aggregate.geojson", driver="GeoJSON")
layers.append(man_bua)

bham_bua = aggregate_bua_by_mask(wmca_lads, "Birmingham (BUA 2022 aggregate)")
bham_bua.to_file("birmingham_bua_aggregate.geojson", driver="GeoJSON")
layers.append(bham_bua)

glas_bua = aggregate_bua_by_mask(gcr_lads, "Glasgow (BUA 2022 aggregate)")
glas_bua.to_file("glasgow_bua_aggregate.geojson", driver="GeoJSON")
layers.append(glas_bua)

edi_bua = aggregate_bua_by_mask(edi_lad, "Edinburgh (BUA 2022 aggregate)")
edi_bua.to_file("edinburgh_bua_aggregate.geojson", driver="GeoJSON")
layers.append(edi_bua)

cdf_bua = aggregate_bua_by_mask(cdf_lad, "Cardiff (BUA 2022 aggregate)")
cdf_bua.to_file("cardiff_bua_aggregate.geojson", driver="GeoJSON")
layers.append(cdf_bua)

# ---------- Combine & report areas ----------
cities = gpd.GeoDataFrame(pd.concat(layers, ignore_index=True), crs="EPSG:4326")

# ---------- Plot UK outline + the six aggregates ----------
# UK outline (Natural Earth 1:10m, simple countries)
ne_url = "
world = gpd.read_file(ne_url)
uk = world[world["ADMIN"] == "United Kingdom"].to_crs(4326)

# Crop frame to our cities
minx, miny, maxx, maxy = cities.total_bounds
uk_crop = uk.cx[minx-5 : maxx+5, miny-5 : maxy+5]

fig, ax = plt.subplots(figsize=(9, 10), dpi=150)
uk_crop.boundary.plot(ax=ax, color="black", linewidth=1.2)
cities.plot(ax=ax, column="city", alpha=0.45, edgecolor="black", linewidth=0.8, legend=True)

# Label each polygon using an interior point
label_pts = cities.representative_point()
for (x, y), name in zip(label_pts.geometry.apply(lambda p: (p.x, p.y)), cities["city"]):
    ax.text(x, y, name, fontsize=8, ha="center", va="center")

ax.set_title("BUA 2022 Aggregates: London, Manchester, Birmingham, Glasgow, Edinburgh, Cardiff", fontsize=12)
ax.set_xlim(minx-1, maxx+1)
ax.set_ylim(miny-1, maxy+1)
ax.set_aspect("equal", adjustable="box")
ax.set_xlabel("Longitude"); ax.set_ylabel("Latitude")
plt.tight_layout()
plt.show()

After running this code, you should have 6 Geojson files saved in your current directory, and you should also see something like this, which tells us that our City files contain valid data.

Finding our risk data

Our final piece of the data puzzle is accident data. The UK government publishes reports on car accident data covering five-year periods. The latest data covers the period from 2019 to 2024. This data set covers the whole of the UK, so we will need to analyze the data for the six cities we are interested in.

To view or download accident data in CSV format (source: Department of Transport – Road Safety Data), click on the link below

As with city limits data, this is published under the Open Government license v3.0 (Ogl 3.0) and as such has the same license terms.

The Accumation Set data contains a large number of fields, but for our purposes, we only want 3 of them:

latitude
longitude
number_of_casualties

Getting our damage count for each city is now a 3 step process.

1 / Loading crash data set to Duckdb

If you haven't come across Duckdb before, TL; DR is a fast, in-memory (also persistent) database written in C++ designed for Analytical SQL workloads.

One of the main reasons I like it is its speed. It is one of the fastest data analytics libraries I have used. It can also be seen through the use of adverbs such as geopatial one, which we will use now.

Now we can load crash data like this.

import requests
import duckdb

# Remote CSV (last 5 years collisions)
url = ""
local_file = "collisions_5yr.csv"

# Download the file
r = requests.get(url, stream=True)
r.raise_for_status()
with open(local_file, "wb") as f:
    for chunk in r.iter_content(chunk_size=8192):
        f.write(chunk)

print(f"Downloaded {local_file}")

# Connect once
con = duckdb.connect(database=':memory:')

# Install + load spatial on THIS connection
con.execute("INSTALL spatial;")
con.execute("LOAD spatial;")

# Create accidents table with geometry
con.execute("""
    CREATE TABLE accidents AS
    SELECT
        TRY_CAST(Latitude AS DOUBLE) AS latitude,
        TRY_CAST(Longitude AS DOUBLE) AS longitude,
        TRY_CAST(Number_of_Casualties AS INTEGER) AS number_of_casualties,
        ST_Point(TRY_CAST(Longitude AS DOUBLE), TRY_CAST(Latitude AS DOUBLE)) AS geom
    FROM read_csv_auto('collisions_5yr.csv', header=True, nullstr='NULL')
    WHERE 
        TRY_CAST(Latitude AS DOUBLE) IS NOT NULL AND 
        TRY_CAST(Longitude AS DOUBLE) IS NOT NULL AND
        TRY_CAST(Number_of_Casualties AS INTEGER) IS NOT NULL
""")

# Quick preview
print(con.execute("DESCRIBE accidents").df())
print(con.execute("SELECT * FROM accidents LIMIT 5").df())

You should see the following result.

Downloaded collisions_5yr.csv
            column_name column_type null   key default extra
0              latitude      DOUBLE  YES  None    None  None
1             longitude      DOUBLE  YES  None    None  None
2  number_of_casualties     INTEGER  YES  None    None  None
3                  geom    GEOMETRY  YES  None    None  None
    latitude  longitude  number_of_casualties  
0  51.508057  -0.153842                     3   
1  51.436208  -0.127949                     1   
2  51.526795  -0.124193                     1   
3  51.546387  -0.191044                     1   
4  51.541121  -0.200064                     2   

                                                geom  
0  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...  
1  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...  
2  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...  
3  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...  
4  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...

2 / Loading city boundary data using Duckdb's Duckdb functions

The function we will use to do this is called ST_BADE, which can import and export various types of geospatial files using the GDAL library.

city_files = {
    "London": "london_bua_aggregate.geojson",
    "Edinburgh": "edinburgh_bua_aggregate.geojson",
    "Cardiff": "cardiff_bua_aggregate.geojson",
    "Glasgow": "glasgow_bua_aggregate.geojson",
    "Manchester": "manchester_bua_aggregate.geojson",
    "Birmingham": "birmingham_bua_aggregate.geojson"
}

for city, file in city_files.items():
    con.execute(f"""
        CREATE TABLE {city.lower()} AS
        SELECT
            '{city}' AS city,
            geom
        FROM ST_Read('{file}')
    """)

con.execute("""
    CREATE TABLE cities AS
    SELECT * FROM london
    UNION ALL SELECT * FROM edinburgh
    UNION ALL SELECT * FROM cardiff
    UNION ALL SELECT * FROM glasgow
    UNION ALL SELECT * FROM manchester
    UNION ALL SELECT * FROM birmingham
    
    
""")

3 / The next step is to join the accidents in the polygons of the city and count the victims of injuries.

The main Geospatial function we are using this time is called ST_Thin. Returns true if the first geometry point is within the second boundary.

import duckdb

casualties_per_city = con.execute("""
    SELECT 
        c.city, 
        SUM(a.number_of_casualties) AS total_casualties,
        COUNT(*) AS accident_count
    FROM accidents a
    JOIN cities c
      ON ST_Within(a.geom, c.geom)
    GROUP BY c.city
    ORDER BY total_casualties DESC
""").df()

print("Casualties per city:")
print(casualties_per_city)

Note that I ran the above query with a powerful desktop, and it took a few minutes to return the results, so please be patient. However, in the end, you should see a result like this.

Casualties per city:

         city  total_casualties  accident_count
0      London          134328.0          115697
1  Birmingham           14946.0           11119
2  Manchester            4518.0            3502
3     Glasgow            3978.0            3136
4   Edinburgh            3116.0            2600
5     Cardiff            1903.0            1523

Breaking down

Not surprisingly, London has, overall, the most significant number of casualties. But with its size, is it dangerous or not so dangerous to drive or be a traveler passing through other cities?

It is clear that there has been a problem with money due to injuries in Manchester and Glasgow. They should both be bigger, based on their city sizes. The suggestion is that this could be because of the many busy roads and metro fringe roads (m8 / m74 / m62 / m61 / m62; m61 / m67; I will leave that investigation as a student reader!

In order to determine our driving safety, we need to know the size of each city so we can calculate the rate of each injury.

Fortunately, DuckDB has a function for that. ISt_AREA indicates the geometric area.

# --- Compute areas in km² (CRS84 -> OSGB 27700) ---
print("nCalculating areas in km^2...")
areas = con.execute("""
SELECT
  city,
  ST_Area(
    ST_MakeValid(
      ST_Transform(
          -- ST_Simplify(geom, 0.001), -- Experiment with epsilon value (e.g., 0.001 degrees)
          geom, 
          'OGC:CRS84','EPSG:27700'
      )
    )
  ) / 1e6 AS area_km2
FROM cities 
ORDER BY area_km2 DESC;
""").df()

print("City Areas:")
print(areas.round(2))

I found this extract, which seems to be incorrect.

          city  area_km2
0      London   1321.45
1  Birmingham    677.06
2  Manchester    640.54
3     Glasgow    481.49
4   Edinburgh    123.00
5     Cardiff     96.08

We now have all the information we need to announce which city has the safest drivers in the UK. Remember, add the “security_index” value, it's safe.

   city         area_km2     casualties          safety_index (casualties/area)
0  London       1321.45      134328              101.65
1  Birmingham   677.06       14946               22.07
2  Manchester   640.54        4518                7.05
3  Glasgow      481.49        3978                8.26
4  Edinburgh    123.00        3116               25.33
5  Cardiff       96.08        1903               19.08

I am not comfortable including the results of Manchester and Glasgow because of the doubts in their calculation of injuries mentioned earlier.

Taking that, and because I am the Boss of this article, I announce Cardiff as the winner of the award for the safest city from the Perestrian driver. What do you think of these results? Do you live in one of the cities I watched? If so, do the results support your driving or parenting experience there?

To put it briefly

We explored the possibility of performing exploratory data analysis on geospatial data. Our goal was for you to decide which of the top six cities are the safest to be a traveler or not. Using a combination of geopandas and duckdb, we were able to:

Use Geopandas to download City Sournary data from the official government website that accurately represents the size of each city.
Download and send the POST-PROPPORT wide CSV of 5-Accidents Finding the three fields of interest, namely latitude, longitude and number of road accident injuries.
Join accident data to city boundary data in latitude / longitude using Duckdb functions to get the total number of injuries for each city over 5 years.
He used Duckd's Duckd functions to calculate the size in KM² of each city.
A safety index for each city is calculated, which is the number of injuries in each city divided by its size. We disregarded two results of our results because of doubts about the accuracy of some data. We calculated that Cardiff had the lowest safety number and was considered one of the safest cities we had tested.

Given the right input data, geospatial analysis using the tools I describe can be very useful in many industries. Think traffic and risk analysis (as I have shown), flood, deforestation, forest fire and drought analysis come to mind. Basically, any system or process connected to the Spatial coordinate system is ripe for the analysis of experimental data by anyone.

Source link

nimda December 15, 2025

0 8 11 minutes read