Fixing Memory Errors in GeoPandas When Working with Large Files
A GeoPandas memory error usually appears when you try to read, process, or export a large spatial dataset and Python runs out of available RAM. Sometimes this appears as `MemoryError`. In other cases, the process becomes extremely slow, the
Problem statement
A GeoPandas memory error usually appears when you try to read, process, or export a large spatial dataset and Python runs out of available RAM. Sometimes this appears as MemoryError. In other cases, the process becomes extremely slow, the kernel crashes, or the script exits during a heavy operation.
This commonly happens when working with:
- large shapefiles or GeoJSON files
- datasets with many attribute columns
- very complex polygons or multipart geometries
- memory-heavy operations such as
to_crs(),sjoin(),overlay(),dissolve(), andbuffer() - large exports to GeoJSON or shapefile
The practical goal is not just to avoid a crash. It is to reduce memory use enough to finish the GIS task reliably.
Quick answer
To fix a GeoPandas out-of-memory problem:
- test a small subset first to confirm the file and workflow are valid
- keep only the columns you need
- reduce data early before expensive operations
- prefer GeoPackage or Parquet over GeoJSON for repeated work
- drop unused fields before joins, overlays, and dissolves
- avoid unnecessary GeoDataFrame copies
- simplify or repair geometries before heavy processing when appropriate
- process large datasets in batches by tile, region, or group
- write intermediate results to disk instead of holding everything in memory
Step-by-step solution
1) Check where the memory error happens
First, identify the step that fails: reading, processing, or writing.
Reading the file
If the error happens in read_file(), common causes include:
- a very large GeoJSON
- a shapefile with many fields
- complex polygon geometry
- an inefficient source format for the job
import geopandas as gpd
path = "data/large_parcels.geojson"
gdf = gpd.read_file(path)
print(len(gdf))
If this fails immediately, focus on reading less data or changing formats.
Processing the data
If the file loads successfully but later fails, the memory spike is usually caused by operations such as to_crs(), sjoin(), overlay(), dissolve(), or buffer().
import geopandas as gpd
parcels = gpd.read_file("data/parcels.gpkg")
flood = gpd.read_file("data/flood_zones.gpkg")
result = parcels.overlay(flood, how="intersection")
Writing the output
Large exports can also fail, especially to GeoJSON or shapefile.
import geopandas as gpd
result = gpd.read_parquet("work/result.parquet")
result.to_file("output/result.geojson", driver="GeoJSON")
If reading and processing work but export fails, keep intermediate data in a more efficient format and export the final result last.
2) Check file size and structure before loading everything
A large file on disk usually needs more memory after it is parsed into geometry and attribute objects.
Inspect file size on disk
from pathlib import Path
path = Path("data/large_buildings.geojson")
size_mb = path.stat().st_size / (1024 * 1024)
print(f"{size_mb:.1f} MB on disk")
Check geometry type and column count
Too many fields and complex geometry both increase memory use.
import geopandas as gpd
gdf = gpd.read_file("data/sample_area.gpkg")
print(gdf.geom_type.value_counts())
print(gdf.columns)
Test with a small sample
A small spatial sample helps confirm the workflow before running the full job.
import geopandas as gpd
bbox = (-74.05, 40.68, -73.85, 40.85) # minx, miny, maxx, maxy
sample = gpd.read_file("data/large_buildings.gpkg", bbox=bbox)
print(len(sample))
This can reduce how much data is read when the file format and driver support spatial filtering.
3) Reduce memory use when reading large spatial files
Keep only needed columns
If you only need a few fields, reduce the GeoDataFrame immediately after reading.
import geopandas as gpd
gdf = gpd.read_file("data/parcels.gpkg")
gdf = gdf[["parcel_id", "land_use", "geometry"]]
This does not always reduce memory during the initial read, but it lowers later processing cost.
Use spatial filtering when supported
Bounding box filtering is useful for large workflows, especially when you only need one study area.
import geopandas as gpd
study_bbox = (500000, 4500000, 510000, 4510000)
roads = gpd.read_file("data/roads.gpkg", bbox=study_bbox)
This can reduce how much data is read when the file format and driver support spatial filtering.
Prefer more efficient formats
If you repeatedly use the same large GeoJSON or shapefile, convert it once and reuse the converted version.
import geopandas as gpd
gdf = gpd.read_file("data/large_parcels.geojson")
gdf.to_file("data/large_parcels.gpkg", driver="GPKG")
gdf.to_parquet("data/large_parcels.parquet")
GeoPackage works well for GIS desktop workflows. Parquet is often a better choice for repeated Python analysis.
4) Reduce memory use during processing
Drop unused columns before heavy operations
Before joins or overlays, remove fields you do not need.
parcels = parcels[["parcel_id", "geometry"]]
flood = flood[["risk_class", "geometry"]]
Avoid unnecessary copies
Every extra copy increases memory use.
roads = roads.to_crs(3857)
roads = roads[roads["class"] != "service"]
Reuse the same variable when earlier intermediate data is no longer needed.
Simplify or clean geometries
If full geometry precision is not required, simplify first.
import geopandas as gpd
zones = gpd.read_file("data/zones.gpkg")
zones["geometry"] = zones.geometry.simplify(tolerance=5, preserve_topology=True)
Repair invalid geometries before expensive operations.
zones = zones[zones.geometry.notna()].copy()
zones["geometry"] = zones.geometry.buffer(0)
Reproject only when necessary
Do not reproject every layer at the start unless the analysis actually needs it.
import geopandas as gpd
parcels = gpd.read_file("data/parcels.gpkg")
small = parcels[parcels["land_use"] == "residential"]
small_3857 = small.to_crs(3857)
Filtering first and reprojecting later often lowers peak memory use.
Process data in smaller batches
Split by region, district, tile, or attribute group. If you batch with a district bounding box, expect extra features near the edges unless you clip or filter them afterward.
import geopandas as gpd
districts = gpd.read_file("data/districts.gpkg")
for district_id in districts["district_id"].unique():
district = districts[districts["district_id"] == district_id]
bbox = district.total_bounds
subset = gpd.read_file("data/buildings.gpkg", bbox=bbox)
subset.to_parquet(f"temp/buildings_{district_id}.parquet")
5) Use safer approaches for memory-heavy operations
Filter both layers before a spatial join
Reduce both inputs before running the join.
import geopandas as gpd
study_bbox = (500000, 4500000, 510000, 4510000)
addresses = gpd.read_file("data/addresses.gpkg", bbox=study_bbox)
schools = gpd.read_file("data/schools.gpkg", bbox=study_bbox)
joined = gpd.sjoin(
addresses[["addr_id", "geometry"]],
schools[["school_id", "geometry"]],
predicate="intersects"
)
Use simpler operations instead of overlay when possible
If you only need a yes/no spatial relationship, a spatial join may be enough. Full overlay() is heavier because it creates new geometries.
Reduce fields before dissolve
Keep only the grouping field and geometry before dissolve().
import geopandas as gpd
land = gpd.read_file("data/landcover.gpkg")
land = land[["class_name", "geometry"]]
dissolved = land.dissolve(by="class_name")
Code examples
Convert a large source file to a better working format
import geopandas as gpd
gdf = gpd.read_file("data/large_roads.shp")
gdf.to_file("work/large_roads.gpkg", driver="GPKG")
gdf.to_parquet("work/large_roads.parquet")
Write intermediate subsets to disk
import geopandas as gpd
roads = gpd.read_parquet("work/large_roads.parquet")
subset = roads[roads["class"] == "primary"]
subset.to_parquet("temp/primary_roads.parquet")
del subset
Test a smaller workflow before the full run
import geopandas as gpd
parcels = gpd.read_file("data/parcels.gpkg")
sample = parcels.head(5000).copy()
sample["geometry"] = sample.geometry.simplify(2, preserve_topology=True)
sample.to_parquet("temp/sample_parcels.parquet")
Explanation
GeoPandas stores geometry as in-memory objects, and those objects usually take much more memory than the file size on disk suggests. Text-based formats such as GeoJSON are especially expensive because the text must be parsed into geometry and attribute structures first. Heavy operations such as overlay(), buffer(), and dissolve() also create temporary geometry objects, which increases peak memory use.
That is why fixing a large-file GeoPandas memory issue usually means one of these changes:
- load less data
- reduce geometry complexity
- avoid expensive operations when a simpler one will work
- process the data in chunks instead of all at once
- use a more efficient storage format for intermediate work
In practice, the most useful fixes are usually:
- converting GeoJSON to GeoPackage or Parquet
- filtering by study area before large operations
- dropping unused columns early
- delaying
to_crs()until it is needed - writing intermediate outputs to disk
Edge cases or notes
Spatial filtering is not always exact or always available
bbox= filtering depends on the file format and driver support. It is useful for reducing read size, but behavior can vary. Also, a bounding box returns features that intersect the box, which may include extra features near the boundary.
CRS mismatches can trigger unnecessary processing
If two layers use different coordinate reference systems, you may need to_crs() before a join or overlay. But reproject only the layers and steps that require it. Reprojecting a large dataset too early can trigger a memory spike.
Invalid geometries increase processing cost
Invalid polygons can make overlay(), buffer(), and dissolve() slower and heavier. If a memory error appears during geometry processing, check geometry validity before rerunning the workflow.
Common pitfalls
- reading huge GeoJSON files when a GeoPackage version would work better
- keeping several intermediate GeoDataFrames in memory
- exporting large results directly to GeoJSON
- using full
overlay()whensjoin()or a simple spatial filter is enough - assuming a 1 GB file only needs 1 GB of RAM
If the data is valid but still too large, redesign the workflow around subsets, tiles, or regions.
Internal links
For background, see How GeoPandas Handles Vector Data in Python.
Related tasks:
Troubleshooting and related performance guides:
- Fix Slow Spatial Join in GeoPandas
- How to Fix CRS Mismatch in GeoPandas
- GeoPandas Not Reading Shapefile: Common Causes and Fixes
- How to Speed Up GeoPandas: Tips for Large Datasets
FAQ
Why does GeoPandas use much more memory than the file size on disk?
Because the file is parsed into in-memory geometry objects and attribute data. Text formats like GeoJSON are especially inefficient, and heavy operations create temporary objects that increase peak memory use.
What file format is best when GeoPandas runs out of memory?
For repeated Python workflows, Parquet is usually a strong choice. For desktop GIS workflows, GeoPackage is often a better working format than GeoJSON or shapefile.
How can I process a large shapefile in GeoPandas without loading everything at once?
Use spatial filtering where supported, split the work by tile or region, and write intermediate outputs to disk instead of keeping the whole workflow in memory.
Does simplifying geometry help fix a GeoPandas memory error?
Yes, when geometry complexity is part of the problem. Simplifying polygons or lines reduces vertex count and can make overlay, buffering, and export lighter.
Related articles
Keep exploring with more guides in this category.
How to Fix CRS Mismatch in GeoPandas
How to identify and fix CRS mismatch issues in GeoPandas using set_crs() and to_crs() before spatial operations.
Read article →
GeoPandas Not Reading Shapefile: Common Causes and Fixes
Common causes and fixes for GeoPandas failing to read a shapefile, including missing files, paths, and encoding issues.
Read article →
How to Speed Up GeoPandas: Tips for Large Datasets
GeoPandas works well for many GIS tasks, but performance often drops when datasets get larger. Common problem areas include:
Read article →