How to Speed Up GeoPandas: Tips for Large Datasets
GeoPandas works well for many GIS tasks, but performance often drops when datasets get larger. Common problem areas include:
Problem statement
GeoPandas works well for many GIS tasks, but performance often drops when datasets get larger. Common problem areas include:
- reading large Shapefiles or GeoJSON files
- spatial joins against many features
- overlays on complex polygons
- repeated reprojection
- row-by-row loops with
iterrows()orapply() - high memory usage from loading unnecessary columns and geometries
In practice, this shows up as slow scripts, memory errors, or workflows that are fast on small test data but unusable on real project data.
If you need GeoPandas performance optimization, the main goal is to reduce how much data GeoPandas has to load, compare, and write at each step.
Quick answer
The fastest ways to speed up GeoPandas on large datasets are:
- read only the columns you need
- filter rows as early as possible
- use faster formats like Parquet or GeoPackage for intermediate outputs
- avoid Python loops and
apply()when vectorized operations are available - confirm spatial indexes are available for joins and filters
- simplify or repair geometries when appropriate
- reproject only when a task actually requires it
- break large workflows into smaller saved steps
Step-by-step solution
1. Identify where GeoPandas is slow
Before changing code, measure which step is slow: file reading, filtering, geometry operations, joins, or export.
import time
import geopandas as gpd
start = time.perf_counter()
parcels = gpd.read_file("data/parcels.gpkg")
print(f"Read time: {time.perf_counter() - start:.2f} seconds")
start = time.perf_counter()
parcels = parcels[parcels["land_use"] == "residential"]
print(f"Filter time: {time.perf_counter() - start:.2f} seconds")
start = time.perf_counter()
parcels = parcels.to_crs("EPSG:3857")
print(f"Reprojection time: {time.perf_counter() - start:.2f} seconds")
This helps you avoid optimizing the wrong part of the workflow.
For memory-heavy layers, also check dataset size after loading:
parcels.info(memory_usage="deep")
2. Reduce the amount of data loaded into memory
A common large-dataset problem is loading far more data than you need.
Read only the columns you need
If your GeoPandas engine and file format support it, use columns= with read_file() to avoid loading unnecessary fields.
import geopandas as gpd
import time
start = time.perf_counter()
full = gpd.read_file("data/parcels.gpkg")
print(f"Full read: {time.perf_counter() - start:.2f} seconds")
start = time.perf_counter()
small = gpd.read_file(
"data/parcels.gpkg",
columns=["parcel_id", "land_use", "geometry"]
)
print(f"Selected columns read: {time.perf_counter() - start:.2f} seconds")
This is especially useful when the source has many text fields you do not use.
Filter features early
Do not run expensive spatial operations on the full dataset if you can reduce it first.
import geopandas as gpd
parcels = gpd.read_file(
"data/parcels.gpkg",
columns=["parcel_id", "land_use", "geometry"]
)
schools = gpd.read_file("data/schools.gpkg")
# Reduce input size before spatial join
residential = parcels[parcels["land_use"] == "residential"].copy()
joined = gpd.sjoin(residential, schools, predicate="intersects", how="inner")
For testing, run the workflow on a smaller subset first:
test_subset = parcels.head(5000).copy()
3. Use faster file formats for intermediate results
File format has a large effect on GeoPandas memory usage and runtime.
- Shapefile: widely compatible, but slower and limited
- GeoJSON: text-heavy and often slow for large layers
- GeoPackage: usually better than Shapefile for many workflows
- Parquet: often the best choice for repeated Python-based analysis workflows
Save cleaned intermediate results so you do not repeat expensive steps.
cleaned = residential.to_crs("EPSG:3857")
cleaned.to_parquet("data/intermediate/residential_3857.parquet")
# or
cleaned.to_file("data/intermediate/residential_3857.gpkg", driver="GPKG")
reloaded = gpd.read_parquet("data/intermediate/residential_3857.parquet")
If you repeatedly read the same processed layer, this can save a lot of time.
4. Improve geometry operation performance
Complex geometries make overlays, joins, clipping, and reprojection slower.
Simplify geometries when acceptable
If you do not need full boundary precision for a first-pass analysis, simplify before heavy operations.
districts = gpd.read_file("data/districts.gpkg").to_crs("EPSG:3857")
districts["geometry"] = districts.geometry.simplify(
tolerance=10,
preserve_topology=True
)
Use this carefully. Simplified boundaries can change spatial results near edges.
Check invalid geometries before overlay or joins
Invalid polygons can slow operations or cause failures. Use a repair method that fits your GeoPandas and Shapely version.
buildings = gpd.read_file("data/buildings.gpkg")
invalid = ~buildings.geometry.is_valid
buildings.loc[invalid, "geometry"] = buildings.loc[invalid, "geometry"].buffer(0)
buffer(0) is a common workaround, but it can change geometry. Validate results before continuing.
Reproject only when needed
CRS changes are expensive on large datasets. Reproject once, not repeatedly.
roads = gpd.read_file("data/roads.gpkg")
if roads.crs is None or roads.crs.to_epsg() != 3857:
roads = roads.to_crs("EPSG:3857")
5. Speed up spatial joins and spatial filters
A slow GeoPandas spatial join is often caused by too many candidate comparisons.
Confirm a spatial index is available
parcels = gpd.read_file("data/parcels.gpkg")
schools = gpd.read_file("data/schools.gpkg")
print(parcels.has_sindex)
print(schools.has_sindex)
# Build index if needed by accessing it
_ = schools.sindex
GeoPandas uses spatial indexing in many operations, but checking helps when debugging performance.
Limit candidate features before joining
Use a bounding box filter when possible.
xmin, ymin, xmax, ymax = schools.total_bounds
candidate_parcels = parcels.cx[xmin:xmax, ymin:ymax]
joined = gpd.sjoin(candidate_parcels, schools, predicate="intersects", how="inner")
This is a bounding-box prefilter, not an exact spatial match, so run the actual spatial join after it.
Use the correct predicate
A more specific predicate can reduce unnecessary matches.
joined = gpd.sjoin(parcels, schools, predicate="within", how="inner")
Use within, contains, or intersects based on the actual task.
6. Replace slow row-by-row patterns
Python loops are a common source of slow GeoPandas workflows.
Avoid loops over rows
Slow pattern:
areas = []
for _, row in parcels.iterrows():
areas.append(row.geometry.area)
parcels["area_m2"] = areas
Faster vectorized pattern:
parcels = parcels.to_crs("EPSG:3857")
parcels["area_m2"] = parcels.geometry.area
For attribute logic, use pandas vectorized operations instead of apply() where possible.
parcels["is_large"] = parcels["area_m2"] > 1000
Separating geometry operations from plain attribute logic usually improves performance and keeps code simpler.
7. Manage memory in long workflows
Drop unused columns and temporary outputs as soon as possible.
joined = joined.drop(
columns=["owner_name", "mailing_address", "index_right"],
errors="ignore"
)
If the full workflow does not fit in memory, process smaller pieces and save outputs.
subset1 = parcels.iloc[:50000].copy()
subset1.to_parquet("data/intermediate/parcels_part1.parquet")
GeoPandas is an in-memory tool. For very large pipelines, writing intermediate files is often better than keeping everything in one session.
8. Know when GeoPandas is not the right tool for the full job
GeoPandas is strong for analysis, QA, preprocessing, and export. It is not always the best option for every large production pipeline.
If you are working with millions of features or repeated heavy joins, move the largest filtering and join steps into:
- PostGIS
- DuckDB with spatial support
- another spatial database or processing engine
Then use GeoPandas for smaller final steps.
Code examples
Example workflow for faster GeoPandas processing
import geopandas as gpd
# 1. Read only needed columns
parcels = gpd.read_file(
"data/parcels.gpkg",
columns=["parcel_id", "land_use", "geometry"]
)
schools = gpd.read_file(
"data/schools.gpkg",
columns=["school_id", "geometry"]
)
# 2. Filter early
parcels = parcels[parcels["land_use"] == "residential"].copy()
# 3. Match CRS once
if parcels.crs != schools.crs:
schools = schools.to_crs(parcels.crs)
# 4. Check and repair invalid geometry if needed
invalid = ~parcels.geometry.is_valid
parcels.loc[invalid, "geometry"] = parcels.loc[invalid, "geometry"].buffer(0)
# 5. Run spatial join
result = gpd.sjoin(parcels, schools, predicate="intersects", how="inner")
# 6. Keep only needed output fields
result = result[["parcel_id", "school_id", "geometry"]]
# 7. Export to a fast reusable format
result.to_parquet("data/output/residential_parcels_near_schools.parquet")
Explanation
GeoPandas slows down mainly because of three things:
- large geometries: complex polygons increase comparison and reprojection cost
- text-heavy formats: GeoJSON and Shapefile are usually slower to parse than Parquet or GeoPackage
- Python-level iteration: loops and
apply()add overhead compared with vectorized operations
Early filtering has a large effect because every later step runs on fewer rows. If you reduce a 1,000,000-feature layer to 100,000 features before a spatial join, you also reduce index building, geometry comparisons, and output size.
For large datasets, the most practical improvements usually come from reducing input size, avoiding repeated work, and saving intermediate outputs in a faster format.
Edge cases or notes
- CRS issues: spatial joins and distance or area calculations require matching CRS. Reproject once, not repeatedly. If you need help, see How to Fix CRS Mismatch in GeoPandas.
- Invalid geometries: overlays and joins may fail or slow down when polygons are invalid.
buffer(0)is only one workaround and can alter shapes. - Simplification tradeoff: simplified geometry can change boundaries and affect join or overlay results.
- Disk speed matters: SSD vs network storage can change file read and write performance significantly.
- Geometry complexity matters: 10,000 very complex polygons can be slower than 100,000 simple points.
- Parquet is not a universal interchange format: it is most useful when your workflow stays mainly in Python or other modern data tools.
Internal links
For the broader workflow, see GeoPandas basics for spatial data processing.
Related tasks:
If you are troubleshooting related failures, see:
- How to fix invalid geometries in GeoPandas
- Fixing Memory Errors in GeoPandas When Working with Large Files
- GeoPandas Not Reading Shapefile: Common Causes and Fixes
FAQ
Why is GeoPandas so slow with large datasets?
Usually because too much data is loaded at once, geometries are complex, file formats are inefficient, or the workflow uses loops instead of vectorized operations.
Does GeoPandas use a spatial index automatically?
GeoPandas uses spatial indexing in many operations such as spatial joins, but behavior depends on your environment and installed backends. It is still useful to check has_sindex and reduce candidate features before the operation.
Which file format is faster than Shapefile for GeoPandas workflows?
GeoPackage is usually better than Shapefile, and Parquet is often faster for repeated Python-based analysis workflows.
Should I use apply() for GeoPandas performance-sensitive tasks?
Usually no. Prefer vectorized GeoPandas geometry methods and pandas column operations when possible.
When should I switch from GeoPandas to PostGIS or another database?
If your workflow involves millions of features, repeated heavy joins, or memory failures, a spatial database is often a better place for the largest filtering and join steps.
Related articles
Keep exploring with more guides in this category.
How to Fix CRS Mismatch in GeoPandas
How to identify and fix CRS mismatch issues in GeoPandas using set_crs() and to_crs() before spatial operations.
Read article →
GeoPandas Not Reading Shapefile: Common Causes and Fixes
Common causes and fixes for GeoPandas failing to read a shapefile, including missing files, paths, and encoding issues.
Read article →
Fixing Memory Errors in GeoPandas When Working with Large Files
A GeoPandas memory error usually appears when you try to read, process, or export a large spatial dataset and Python runs out of available RAM. Sometimes this appears as `MemoryError`. In other cases, the process becomes extremely slow, the
Read article →