Tutorial

How to Speed Up GeoPandas: Tips for Large Datasets

GeoPandas works well for many GIS tasks, but performance often drops when datasets get larger. Common problem areas include:

Problem statement

GeoPandas works well for many GIS tasks, but performance often drops when datasets get larger. Common problem areas include:

  • reading large Shapefiles or GeoJSON files
  • spatial joins against many features
  • overlays on complex polygons
  • repeated reprojection
  • row-by-row loops with iterrows() or apply()
  • high memory usage from loading unnecessary columns and geometries

In practice, this shows up as slow scripts, memory errors, or workflows that are fast on small test data but unusable on real project data.

If you need GeoPandas performance optimization, the main goal is to reduce how much data GeoPandas has to load, compare, and write at each step.

Quick answer

The fastest ways to speed up GeoPandas on large datasets are:

  • read only the columns you need
  • filter rows as early as possible
  • use faster formats like Parquet or GeoPackage for intermediate outputs
  • avoid Python loops and apply() when vectorized operations are available
  • confirm spatial indexes are available for joins and filters
  • simplify or repair geometries when appropriate
  • reproject only when a task actually requires it
  • break large workflows into smaller saved steps

Step-by-step solution

1. Identify where GeoPandas is slow

Before changing code, measure which step is slow: file reading, filtering, geometry operations, joins, or export.

import time
import geopandas as gpd

start = time.perf_counter()
parcels = gpd.read_file("data/parcels.gpkg")
print(f"Read time: {time.perf_counter() - start:.2f} seconds")

start = time.perf_counter()
parcels = parcels[parcels["land_use"] == "residential"]
print(f"Filter time: {time.perf_counter() - start:.2f} seconds")

start = time.perf_counter()
parcels = parcels.to_crs("EPSG:3857")
print(f"Reprojection time: {time.perf_counter() - start:.2f} seconds")

This helps you avoid optimizing the wrong part of the workflow.

For memory-heavy layers, also check dataset size after loading:

parcels.info(memory_usage="deep")

2. Reduce the amount of data loaded into memory

A common large-dataset problem is loading far more data than you need.

Read only the columns you need

If your GeoPandas engine and file format support it, use columns= with read_file() to avoid loading unnecessary fields.

import geopandas as gpd
import time

start = time.perf_counter()
full = gpd.read_file("data/parcels.gpkg")
print(f"Full read: {time.perf_counter() - start:.2f} seconds")

start = time.perf_counter()
small = gpd.read_file(
    "data/parcels.gpkg",
    columns=["parcel_id", "land_use", "geometry"]
)
print(f"Selected columns read: {time.perf_counter() - start:.2f} seconds")

This is especially useful when the source has many text fields you do not use.

Filter features early

Do not run expensive spatial operations on the full dataset if you can reduce it first.

import geopandas as gpd

parcels = gpd.read_file(
    "data/parcels.gpkg",
    columns=["parcel_id", "land_use", "geometry"]
)
schools = gpd.read_file("data/schools.gpkg")

# Reduce input size before spatial join
residential = parcels[parcels["land_use"] == "residential"].copy()

joined = gpd.sjoin(residential, schools, predicate="intersects", how="inner")

For testing, run the workflow on a smaller subset first:

test_subset = parcels.head(5000).copy()

3. Use faster file formats for intermediate results

File format has a large effect on GeoPandas memory usage and runtime.

  • Shapefile: widely compatible, but slower and limited
  • GeoJSON: text-heavy and often slow for large layers
  • GeoPackage: usually better than Shapefile for many workflows
  • Parquet: often the best choice for repeated Python-based analysis workflows

Save cleaned intermediate results so you do not repeat expensive steps.

cleaned = residential.to_crs("EPSG:3857")

cleaned.to_parquet("data/intermediate/residential_3857.parquet")
# or
cleaned.to_file("data/intermediate/residential_3857.gpkg", driver="GPKG")

reloaded = gpd.read_parquet("data/intermediate/residential_3857.parquet")

If you repeatedly read the same processed layer, this can save a lot of time.

4. Improve geometry operation performance

Complex geometries make overlays, joins, clipping, and reprojection slower.

Simplify geometries when acceptable

If you do not need full boundary precision for a first-pass analysis, simplify before heavy operations.

districts = gpd.read_file("data/districts.gpkg").to_crs("EPSG:3857")

districts["geometry"] = districts.geometry.simplify(
    tolerance=10,
    preserve_topology=True
)

Use this carefully. Simplified boundaries can change spatial results near edges.

Check invalid geometries before overlay or joins

Invalid polygons can slow operations or cause failures. Use a repair method that fits your GeoPandas and Shapely version.

buildings = gpd.read_file("data/buildings.gpkg")

invalid = ~buildings.geometry.is_valid
buildings.loc[invalid, "geometry"] = buildings.loc[invalid, "geometry"].buffer(0)

buffer(0) is a common workaround, but it can change geometry. Validate results before continuing.

Reproject only when needed

CRS changes are expensive on large datasets. Reproject once, not repeatedly.

roads = gpd.read_file("data/roads.gpkg")
if roads.crs is None or roads.crs.to_epsg() != 3857:
    roads = roads.to_crs("EPSG:3857")

5. Speed up spatial joins and spatial filters

A slow GeoPandas spatial join is often caused by too many candidate comparisons.

Confirm a spatial index is available

parcels = gpd.read_file("data/parcels.gpkg")
schools = gpd.read_file("data/schools.gpkg")

print(parcels.has_sindex)
print(schools.has_sindex)

# Build index if needed by accessing it
_ = schools.sindex

GeoPandas uses spatial indexing in many operations, but checking helps when debugging performance.

Limit candidate features before joining

Use a bounding box filter when possible.

xmin, ymin, xmax, ymax = schools.total_bounds
candidate_parcels = parcels.cx[xmin:xmax, ymin:ymax]

joined = gpd.sjoin(candidate_parcels, schools, predicate="intersects", how="inner")

This is a bounding-box prefilter, not an exact spatial match, so run the actual spatial join after it.

Use the correct predicate

A more specific predicate can reduce unnecessary matches.

joined = gpd.sjoin(parcels, schools, predicate="within", how="inner")

Use within, contains, or intersects based on the actual task.

6. Replace slow row-by-row patterns

Python loops are a common source of slow GeoPandas workflows.

Avoid loops over rows

Slow pattern:

areas = []
for _, row in parcels.iterrows():
    areas.append(row.geometry.area)

parcels["area_m2"] = areas

Faster vectorized pattern:

parcels = parcels.to_crs("EPSG:3857")
parcels["area_m2"] = parcels.geometry.area

For attribute logic, use pandas vectorized operations instead of apply() where possible.

parcels["is_large"] = parcels["area_m2"] > 1000

Separating geometry operations from plain attribute logic usually improves performance and keeps code simpler.

7. Manage memory in long workflows

Drop unused columns and temporary outputs as soon as possible.

joined = joined.drop(
    columns=["owner_name", "mailing_address", "index_right"],
    errors="ignore"
)

If the full workflow does not fit in memory, process smaller pieces and save outputs.

subset1 = parcels.iloc[:50000].copy()
subset1.to_parquet("data/intermediate/parcels_part1.parquet")

GeoPandas is an in-memory tool. For very large pipelines, writing intermediate files is often better than keeping everything in one session.

8. Know when GeoPandas is not the right tool for the full job

GeoPandas is strong for analysis, QA, preprocessing, and export. It is not always the best option for every large production pipeline.

If you are working with millions of features or repeated heavy joins, move the largest filtering and join steps into:

  • PostGIS
  • DuckDB with spatial support
  • another spatial database or processing engine

Then use GeoPandas for smaller final steps.

Code examples

Example workflow for faster GeoPandas processing

import geopandas as gpd

# 1. Read only needed columns
parcels = gpd.read_file(
    "data/parcels.gpkg",
    columns=["parcel_id", "land_use", "geometry"]
)
schools = gpd.read_file(
    "data/schools.gpkg",
    columns=["school_id", "geometry"]
)

# 2. Filter early
parcels = parcels[parcels["land_use"] == "residential"].copy()

# 3. Match CRS once
if parcels.crs != schools.crs:
    schools = schools.to_crs(parcels.crs)

# 4. Check and repair invalid geometry if needed
invalid = ~parcels.geometry.is_valid
parcels.loc[invalid, "geometry"] = parcels.loc[invalid, "geometry"].buffer(0)

# 5. Run spatial join
result = gpd.sjoin(parcels, schools, predicate="intersects", how="inner")

# 6. Keep only needed output fields
result = result[["parcel_id", "school_id", "geometry"]]

# 7. Export to a fast reusable format
result.to_parquet("data/output/residential_parcels_near_schools.parquet")

Explanation

GeoPandas slows down mainly because of three things:

  • large geometries: complex polygons increase comparison and reprojection cost
  • text-heavy formats: GeoJSON and Shapefile are usually slower to parse than Parquet or GeoPackage
  • Python-level iteration: loops and apply() add overhead compared with vectorized operations

Early filtering has a large effect because every later step runs on fewer rows. If you reduce a 1,000,000-feature layer to 100,000 features before a spatial join, you also reduce index building, geometry comparisons, and output size.

For large datasets, the most practical improvements usually come from reducing input size, avoiding repeated work, and saving intermediate outputs in a faster format.

Edge cases or notes

  • CRS issues: spatial joins and distance or area calculations require matching CRS. Reproject once, not repeatedly. If you need help, see How to Fix CRS Mismatch in GeoPandas.
  • Invalid geometries: overlays and joins may fail or slow down when polygons are invalid. buffer(0) is only one workaround and can alter shapes.
  • Simplification tradeoff: simplified geometry can change boundaries and affect join or overlay results.
  • Disk speed matters: SSD vs network storage can change file read and write performance significantly.
  • Geometry complexity matters: 10,000 very complex polygons can be slower than 100,000 simple points.
  • Parquet is not a universal interchange format: it is most useful when your workflow stays mainly in Python or other modern data tools.

Internal links

For the broader workflow, see GeoPandas basics for spatial data processing.

Related tasks:

If you are troubleshooting related failures, see:

FAQ

Why is GeoPandas so slow with large datasets?

Usually because too much data is loaded at once, geometries are complex, file formats are inefficient, or the workflow uses loops instead of vectorized operations.

Does GeoPandas use a spatial index automatically?

GeoPandas uses spatial indexing in many operations such as spatial joins, but behavior depends on your environment and installed backends. It is still useful to check has_sindex and reduce candidate features before the operation.

Which file format is faster than Shapefile for GeoPandas workflows?

GeoPackage is usually better than Shapefile, and Parquet is often faster for repeated Python-based analysis workflows.

Should I use apply() for GeoPandas performance-sensitive tasks?

Usually no. Prefer vectorized GeoPandas geometry methods and pandas column operations when possible.

When should I switch from GeoPandas to PostGIS or another database?

If your workflow involves millions of features, repeated heavy joins, or memory failures, a spatial database is often a better place for the largest filtering and join steps.

Keep exploring with more guides in this category.