Tutorial

Fixing Memory Errors in GeoPandas When Working with Large Files

A GeoPandas memory error usually appears when you try to read, process, or export a large spatial dataset and Python runs out of available RAM. Sometimes this appears as `MemoryError`. In other cases, the process becomes extremely slow, the

Problem statement

A GeoPandas memory error usually appears when you try to read, process, or export a large spatial dataset and Python runs out of available RAM. Sometimes this appears as MemoryError. In other cases, the process becomes extremely slow, the kernel crashes, or the script exits during a heavy operation.

This commonly happens when working with:

  • large shapefiles or GeoJSON files
  • datasets with many attribute columns
  • very complex polygons or multipart geometries
  • memory-heavy operations such as to_crs(), sjoin(), overlay(), dissolve(), and buffer()
  • large exports to GeoJSON or shapefile

The practical goal is not just to avoid a crash. It is to reduce memory use enough to finish the GIS task reliably.

Quick answer

To fix a GeoPandas out-of-memory problem:

  • test a small subset first to confirm the file and workflow are valid
  • keep only the columns you need
  • reduce data early before expensive operations
  • prefer GeoPackage or Parquet over GeoJSON for repeated work
  • drop unused fields before joins, overlays, and dissolves
  • avoid unnecessary GeoDataFrame copies
  • simplify or repair geometries before heavy processing when appropriate
  • process large datasets in batches by tile, region, or group
  • write intermediate results to disk instead of holding everything in memory

Step-by-step solution

1) Check where the memory error happens

First, identify the step that fails: reading, processing, or writing.

Reading the file

If the error happens in read_file(), common causes include:

  • a very large GeoJSON
  • a shapefile with many fields
  • complex polygon geometry
  • an inefficient source format for the job
import geopandas as gpd

path = "data/large_parcels.geojson"
gdf = gpd.read_file(path)
print(len(gdf))

If this fails immediately, focus on reading less data or changing formats.

Processing the data

If the file loads successfully but later fails, the memory spike is usually caused by operations such as to_crs(), sjoin(), overlay(), dissolve(), or buffer().

import geopandas as gpd

parcels = gpd.read_file("data/parcels.gpkg")
flood = gpd.read_file("data/flood_zones.gpkg")

result = parcels.overlay(flood, how="intersection")

Writing the output

Large exports can also fail, especially to GeoJSON or shapefile.

import geopandas as gpd

result = gpd.read_parquet("work/result.parquet")
result.to_file("output/result.geojson", driver="GeoJSON")

If reading and processing work but export fails, keep intermediate data in a more efficient format and export the final result last.

2) Check file size and structure before loading everything

A large file on disk usually needs more memory after it is parsed into geometry and attribute objects.

Inspect file size on disk

from pathlib import Path

path = Path("data/large_buildings.geojson")
size_mb = path.stat().st_size / (1024 * 1024)
print(f"{size_mb:.1f} MB on disk")

Check geometry type and column count

Too many fields and complex geometry both increase memory use.

import geopandas as gpd

gdf = gpd.read_file("data/sample_area.gpkg")
print(gdf.geom_type.value_counts())
print(gdf.columns)

Test with a small sample

A small spatial sample helps confirm the workflow before running the full job.

import geopandas as gpd

bbox = (-74.05, 40.68, -73.85, 40.85)  # minx, miny, maxx, maxy
sample = gpd.read_file("data/large_buildings.gpkg", bbox=bbox)
print(len(sample))

This can reduce how much data is read when the file format and driver support spatial filtering.

3) Reduce memory use when reading large spatial files

Keep only needed columns

If you only need a few fields, reduce the GeoDataFrame immediately after reading.

import geopandas as gpd

gdf = gpd.read_file("data/parcels.gpkg")
gdf = gdf[["parcel_id", "land_use", "geometry"]]

This does not always reduce memory during the initial read, but it lowers later processing cost.

Use spatial filtering when supported

Bounding box filtering is useful for large workflows, especially when you only need one study area.

import geopandas as gpd

study_bbox = (500000, 4500000, 510000, 4510000)
roads = gpd.read_file("data/roads.gpkg", bbox=study_bbox)

This can reduce how much data is read when the file format and driver support spatial filtering.

Prefer more efficient formats

If you repeatedly use the same large GeoJSON or shapefile, convert it once and reuse the converted version.

import geopandas as gpd

gdf = gpd.read_file("data/large_parcels.geojson")
gdf.to_file("data/large_parcels.gpkg", driver="GPKG")
gdf.to_parquet("data/large_parcels.parquet")

GeoPackage works well for GIS desktop workflows. Parquet is often a better choice for repeated Python analysis.

4) Reduce memory use during processing

Drop unused columns before heavy operations

Before joins or overlays, remove fields you do not need.

parcels = parcels[["parcel_id", "geometry"]]
flood = flood[["risk_class", "geometry"]]

Avoid unnecessary copies

Every extra copy increases memory use.

roads = roads.to_crs(3857)
roads = roads[roads["class"] != "service"]

Reuse the same variable when earlier intermediate data is no longer needed.

Simplify or clean geometries

If full geometry precision is not required, simplify first.

import geopandas as gpd

zones = gpd.read_file("data/zones.gpkg")
zones["geometry"] = zones.geometry.simplify(tolerance=5, preserve_topology=True)

Repair invalid geometries before expensive operations.

zones = zones[zones.geometry.notna()].copy()
zones["geometry"] = zones.geometry.buffer(0)

Reproject only when necessary

Do not reproject every layer at the start unless the analysis actually needs it.

import geopandas as gpd

parcels = gpd.read_file("data/parcels.gpkg")
small = parcels[parcels["land_use"] == "residential"]

small_3857 = small.to_crs(3857)

Filtering first and reprojecting later often lowers peak memory use.

Process data in smaller batches

Split by region, district, tile, or attribute group. If you batch with a district bounding box, expect extra features near the edges unless you clip or filter them afterward.

import geopandas as gpd

districts = gpd.read_file("data/districts.gpkg")

for district_id in districts["district_id"].unique():
    district = districts[districts["district_id"] == district_id]
    bbox = district.total_bounds
    subset = gpd.read_file("data/buildings.gpkg", bbox=bbox)
    subset.to_parquet(f"temp/buildings_{district_id}.parquet")

5) Use safer approaches for memory-heavy operations

Filter both layers before a spatial join

Reduce both inputs before running the join.

import geopandas as gpd

study_bbox = (500000, 4500000, 510000, 4510000)

addresses = gpd.read_file("data/addresses.gpkg", bbox=study_bbox)
schools = gpd.read_file("data/schools.gpkg", bbox=study_bbox)

joined = gpd.sjoin(
    addresses[["addr_id", "geometry"]],
    schools[["school_id", "geometry"]],
    predicate="intersects"
)

Use simpler operations instead of overlay when possible

If you only need a yes/no spatial relationship, a spatial join may be enough. Full overlay() is heavier because it creates new geometries.

Reduce fields before dissolve

Keep only the grouping field and geometry before dissolve().

import geopandas as gpd

land = gpd.read_file("data/landcover.gpkg")
land = land[["class_name", "geometry"]]
dissolved = land.dissolve(by="class_name")

Code examples

Convert a large source file to a better working format

import geopandas as gpd

gdf = gpd.read_file("data/large_roads.shp")
gdf.to_file("work/large_roads.gpkg", driver="GPKG")
gdf.to_parquet("work/large_roads.parquet")

Write intermediate subsets to disk

import geopandas as gpd

roads = gpd.read_parquet("work/large_roads.parquet")
subset = roads[roads["class"] == "primary"]
subset.to_parquet("temp/primary_roads.parquet")
del subset

Test a smaller workflow before the full run

import geopandas as gpd

parcels = gpd.read_file("data/parcels.gpkg")
sample = parcels.head(5000).copy()
sample["geometry"] = sample.geometry.simplify(2, preserve_topology=True)
sample.to_parquet("temp/sample_parcels.parquet")

Explanation

GeoPandas stores geometry as in-memory objects, and those objects usually take much more memory than the file size on disk suggests. Text-based formats such as GeoJSON are especially expensive because the text must be parsed into geometry and attribute structures first. Heavy operations such as overlay(), buffer(), and dissolve() also create temporary geometry objects, which increases peak memory use.

That is why fixing a large-file GeoPandas memory issue usually means one of these changes:

  • load less data
  • reduce geometry complexity
  • avoid expensive operations when a simpler one will work
  • process the data in chunks instead of all at once
  • use a more efficient storage format for intermediate work

In practice, the most useful fixes are usually:

  • converting GeoJSON to GeoPackage or Parquet
  • filtering by study area before large operations
  • dropping unused columns early
  • delaying to_crs() until it is needed
  • writing intermediate outputs to disk

Edge cases or notes

Spatial filtering is not always exact or always available

bbox= filtering depends on the file format and driver support. It is useful for reducing read size, but behavior can vary. Also, a bounding box returns features that intersect the box, which may include extra features near the boundary.

CRS mismatches can trigger unnecessary processing

If two layers use different coordinate reference systems, you may need to_crs() before a join or overlay. But reproject only the layers and steps that require it. Reprojecting a large dataset too early can trigger a memory spike.

Invalid geometries increase processing cost

Invalid polygons can make overlay(), buffer(), and dissolve() slower and heavier. If a memory error appears during geometry processing, check geometry validity before rerunning the workflow.

Common pitfalls

  • reading huge GeoJSON files when a GeoPackage version would work better
  • keeping several intermediate GeoDataFrames in memory
  • exporting large results directly to GeoJSON
  • using full overlay() when sjoin() or a simple spatial filter is enough
  • assuming a 1 GB file only needs 1 GB of RAM

If the data is valid but still too large, redesign the workflow around subsets, tiles, or regions.

Internal links

For background, see How GeoPandas Handles Vector Data in Python.

Related tasks:

Troubleshooting and related performance guides:

FAQ

Why does GeoPandas use much more memory than the file size on disk?

Because the file is parsed into in-memory geometry objects and attribute data. Text formats like GeoJSON are especially inefficient, and heavy operations create temporary objects that increase peak memory use.

What file format is best when GeoPandas runs out of memory?

For repeated Python workflows, Parquet is usually a strong choice. For desktop GIS workflows, GeoPackage is often a better working format than GeoJSON or shapefile.

How can I process a large shapefile in GeoPandas without loading everything at once?

Use spatial filtering where supported, split the work by tile or region, and write intermediate outputs to disk instead of keeping the whole workflow in memory.

Does simplifying geometry help fix a GeoPandas memory error?

Yes, when geometry complexity is part of the problem. Simplifying polygons or lines reduces vertex count and can make overlay, buffering, and export lighter.

Keep exploring with more guides in this category.