DEM Pit Filling Algorithms: Python Implementation and Validation

Q: What is the difference between pit filling and depression carving?

Pit filling raises cells to the lowest spill point, preserving surrounding terrain. Depression carving lowers cells to force drainage downhill. Filling is preferred for most workflows because it minimally modifies the DEM; carving can create artificial channels and flatten natural slopes.

Q: When should I use Priority-Flood instead of Wang & Liu?

Use Priority-Flood (Barnes et al. 2014) for any dataset above roughly 5 million cells. Its O(N log N) complexity and min-heap implementation are dramatically faster than Wang & Liu's iterative approach on large grids, and it produces less artificial terrain modification in low-relief areas.

Q: Can I skip pit filling if I am using a D-Infinity routing algorithm?

No. D-Infinity and other multi-directional routing algorithms still stagnate inside true topographic depressions. All flow routing methods require a hydrologically conditioned DEM with continuous drainage paths to the raster boundary or valid basin outlets.

Q: How do I handle endorheic basins when filling a DEM?

Priority-Flood fills all depressions to their spill point, including legitimate closed basins such as playas and saline lakes. Before running fill, identify and mask true endorheic basins using authoritative basin boundary datasets, then process masked areas separately or flag them in the provenance record.

Raw digital elevation models contain topographic depressions that block every flow routing algorithm. These depressions — called pits or sinks — arise from sensor noise, interpolation artifacts, radar shadowing, and genuine geomorphic features like kettle lakes. Handling them correctly is the first non-negotiable step in the Hydrology Data Preparation & DEM Processing pipeline, sitting upstream of coordinate reference system alignment, flow direction computation, and watershed delineation. Choose the wrong algorithm or apply it without validation and every hydrological metric derived downstream inherits the error. For spatial resolution considerations that interact with sink frequency, see spatial resolution trade-offs in DEM preprocessing.

This page covers the three principal depression-filling algorithms, decision logic for selecting among them, a production-ready richdem implementation, and a formal validation protocol. For high-resolution LiDAR-specific considerations — where micro-depressions representing real hydrological features must be preserved — see best practices for filling sinks in high-resolution LiDAR data.

Prerequisites & Environment Setup

Hydrological raster operations demand strict data integrity, predictable memory allocation, and reproducible dependency management. Install the required stack into an isolated environment before touching production data.

bash

# conda-forge provides pre-compiled richdem wheels on Linux, macOS, and Windows
conda create -n hydro python=3.11
conda activate hydro
conda install -c conda-forge richdem rasterio numpy scipy

System resource notes:

16 GB RAM minimum for regional LiDAR-derived DEMs at 1–3 m resolution
32–64 GB recommended for statewide or multi-HUC8 mosaics
SSD scratch space at least 3× the input DEM size (filled copy + difference raster + flow direction output)

Input data specifications before running any fill:

Requirement	Rationale
Single-band raster, `float32` or `float64`	Integer elevation causes precision loss in spill-point arithmetic
Projected CRS (UTM, Albers, State Plane)	Geographic coordinates produce incorrect slope and area calculations
Explicit nodata value documented in raster header	Prevents nodata cells from being raised during fill
Histogram check: no extreme outliers	Spurious max/min values create artificial depressions thousands of metres deep
Verified against source tile checksums	Silent data corruption causes reproducibility failures

Review your source specifications in SRTM and LiDAR Data Acquisition to anticipate depression frequency, vertical accuracy, and sensor-specific noise patterns before selecting fill parameters.

Algorithm Mechanics

Three algorithms dominate production DEM preprocessing. They differ in computational complexity, memory footprint, and how aggressively they modify terrain.

Priority-Flood (Barnes et al., 2014)

Priority-Flood uses a min-heap priority queue seeded from raster boundary cells. The algorithm processes cells in strict elevation order, propagating inward from the raster edge. Any cell found lower than its already-processed neighbour is raised to match that neighbour’s elevation — the minimum change needed to guarantee drainage continuity.

Complexity: O(N log N) where N is total cell count. For a 10,000 × 10,000 raster (100 M cells) this is approximately 2.3 billion operations — fast enough that a 64-bit workstation completes it in minutes.

Why it produces minimal terrain change: Because cells are processed lowest-first, the algorithm only raises a depression to its lowest possible spill point. It never raises terrain higher than strictly necessary. Downstream flow paths are created by the minimum modification that removes the sink.

Edge cases:

Flat areas after filling: multiple cells raised to the same elevation produce undefined D8 flow direction assignments. Apply a gradient enforcement step (Barnes’ epsilon variant or rd.ResolveFlats) after filling.
Endorheic basins: true closed basins (playas, dry lakes) are filled identically to artifact sinks. Where real internal drainage must be preserved, mask these features before running fill.

Wang & Liu (2006)

An iterative algorithm that scans the raster repeatedly, raising each cell to match its lowest downhill neighbour until no further changes occur. Simpler to implement but O(N²) in the worst case, which makes it prohibitively slow on grids larger than roughly 5 million cells.

Wang & Liu tends to over-fill in gently sloping terrain because the iterative neighbour comparison cannot distinguish between minimum-spill-point filling and aggressive smoothing. It remains useful for small watersheds (under 500 km² at 10 m resolution) where algorithmic simplicity and interpretability outweigh performance.

Planchon & Darboux (2002)

A recursive algorithm that initialises all non-boundary cells to a large fill value, then propagates elevation downward from boundary edges. Elegant in theory but carries significant call-stack overhead on large grids, and its memory requirement scales poorly with raster extent. Modern libraries rarely expose it as a first-class option.

Algorithm Selection Decision Guide

The diagram below summarises the key decision points when choosing a fill algorithm for a given dataset.

Algorithm comparison

Algorithm	Complexity	Memory	Terrain Modification	Best for
Priority-Flood	O(N log N)	O(N)	Minimal (spill-point only)	All production workflows
Wang & Liu	O(N²) worst	O(N)	Can over-fill flat terrain	Small DEMs, teaching/prototyping
Planchon & Darboux	O(N) amortised	O(N) high constant	Moderate	Rarely used in production
Breach + Fill hybrid	O(N log N)	O(N)	Minimal + carved channels	LiDAR, urban drainage

Step-by-Step Workflow

Step 1: Validate input raster

python

import logging
import rasterio
import numpy as np
from pathlib import Path

logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
logger = logging.getLogger(__name__)


def validate_dem(path: Path) -> dict:
    """Return metadata dict; raise on any disqualifying condition."""
    with rasterio.open(path) as src:
        meta = {
            "crs": src.crs,
            "dtype": src.dtypes[0],
            "nodata": src.nodata,
            "width": src.width,
            "height": src.height,
            "cell_count": src.width * src.height,
        }
        if src.crs is None:
            raise ValueError(f"No CRS defined in {path}. Run CRS alignment first.")
        if src.crs.is_geographic:
            logger.warning(
                "Geographic CRS detected (%s). Reproject to a projected CRS before "
                "computing areas or slopes.", src.crs
            )
        if src.nodata is None:
            logger.warning(
                "No nodata value defined; nodata cells may be treated as valid elevation."
            )
        arr = src.read(1)
        valid = arr[arr != src.nodata] if src.nodata is not None else arr.ravel()
        meta["elev_min"] = float(valid.min())
        meta["elev_max"] = float(valid.max())
        meta["elev_range_m"] = meta["elev_max"] - meta["elev_min"]
    logger.info(
        "DEM validated: %d × %d cells | elev %.1f–%.1f m | CRS %s",
        meta["width"], meta["height"], meta["elev_min"], meta["elev_max"],
        meta["crs"].to_epsg(),
    )
    return meta

Step 2: Choose algorithm based on grid size

python

def select_algorithm(cell_count: int, resolution_m: float, source: str = "unknown") -> str:
    """
    Return recommended fill algorithm identifier given dataset characteristics.

    Parameters
    ----------
    cell_count   : total cells (width × height)
    resolution_m : native cell size in metres
    source       : 'lidar', 'srtm', 'copernicus', or 'unknown'
    """
    if cell_count > 5_000_000:
        algo = "priority_flood"
        logger.info("Large grid (%d M cells): selecting Priority-Flood.", cell_count // 1_000_000)
    elif source.lower() == "lidar" and resolution_m < 2.0:
        algo = "priority_flood"
        logger.info("Sub-2 m LiDAR: selecting Priority-Flood with flat-area resolution.")
    else:
        algo = "priority_flood"  # always the safe default
        logger.info("Standard grid: Priority-Flood selected (safe default).")
    return algo

Step 3: Execute fill with `richdem`

python

import richdem as rd


def fill_depressions(input_path: Path, output_path: Path, resolve_flats: bool = True) -> None:
    """
    Fill topographic depressions using Priority-Flood.

    Parameters
    ----------
    input_path    : path to validated input DEM (float32 GeoTIFF)
    output_path   : destination for hydrologically conditioned DEM
    resolve_flats : if True, apply Barnes epsilon gradient to flat areas after filling
    """
    logger.info("Loading DEM from %s", input_path)
    with rasterio.open(input_path) as src:
        profile = src.profile.copy()
        nodata = src.nodata if src.nodata is not None else -9999.0
        dem_array = src.read(1).astype(np.float64)
        profile.update(
            dtype="float32",
            compress="lzw",
            tiled=True,
            blockxsize=512,
            blockysize=512,
            nodata=nodata,
        )

    # Isolate valid data from nodata before handing to richdem
    valid_mask = dem_array != nodata
    dem_array[~valid_mask] = -9999.0

    rd_dem = rd.rdarray(dem_array, no_data=-9999.0)
    logger.info(
        "Running Priority-Flood fill (%d × %d cells)", rd_dem.shape[1], rd_dem.shape[0]
    )
    filled = rd.FillDepressions(rd_dem, in_place=False)

    if resolve_flats:
        logger.info("Resolving flat areas with epsilon gradient.")
        rd.ResolveFlats(filled, in_place=True)

    filled_array = np.array(filled, dtype=np.float32)
    filled_array[~valid_mask] = nodata  # restore nodata footprint exactly

    output_path.parent.mkdir(parents=True, exist_ok=True)
    with rasterio.open(output_path, "w", **profile) as dst:
        dst.write(filled_array, 1)
    logger.info("Filled DEM written to %s", output_path)

Step 4: Generate validation artefacts

python

def write_difference_raster(
    original_path: Path, filled_path: Path, diff_path: Path
) -> dict:
    """
    Subtract original from filled DEM. Returns statistics for QA logging.
    Positive values indicate cells that were raised; negatives should be zero.
    """
    with rasterio.open(original_path) as src_o, rasterio.open(filled_path) as src_f:
        orig = src_o.read(1).astype(np.float32)
        fill = src_f.read(1).astype(np.float32)
        profile = src_o.profile.copy()
        nodata = src_o.nodata or -9999.0
        valid = orig != nodata

    diff = fill - orig
    diff[~valid] = nodata

    stats = {
        "cells_modified": int((diff[valid] > 0).sum()),
        "max_raise_m": float(diff[valid].max()),
        "mean_raise_m": (
            float(diff[valid][diff[valid] > 0].mean())
            if (diff[valid] > 0).any() else 0.0
        ),
        "fraction_modified": float((diff[valid] > 0).mean()),
    }
    logger.info(
        "Difference raster stats: %d cells raised | max raise %.3f m | %.2f%% of valid area",
        stats["cells_modified"], stats["max_raise_m"], stats["fraction_modified"] * 100,
    )

    profile.update(dtype="float32", nodata=nodata)
    with rasterio.open(diff_path, "w", **profile) as dst:
        dst.write(diff, 1)
    logger.info("Difference raster written to %s", diff_path)
    return stats

Production-Ready Code

The function below integrates all steps with structured logging, provenance metadata, and a guard against destructive over-filling. Drop it into any geospatial pipeline without modification.

python

import hashlib
import json
import logging
import time
from datetime import datetime, timezone
from pathlib import Path

import numpy as np
import rasterio
import richdem as rd

logger = logging.getLogger(__name__)


def process_dem_pit_filling(
    input_path: str | Path,
    output_dir: str | Path,
    resolve_flats: bool = True,
    max_raise_threshold_m: float = 50.0,
) -> dict:
    """
    End-to-end DEM depression filling with validation and provenance logging.

    Parameters
    ----------
    input_path            : path to raw DEM (float32/float64 GeoTIFF, projected CRS)
    output_dir            : directory for filled DEM, difference raster, and provenance JSON
    resolve_flats         : apply Barnes epsilon gradient post-fill (strongly recommended)
    max_raise_threshold_m : abort if any single cell is raised by more than this value;
                            indicates unexpected data quality issues

    Returns
    -------
    dict with output paths and QA statistics
    """
    input_path = Path(input_path)
    output_dir = Path(output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)

    stem = input_path.stem
    filled_path = output_dir / f"{stem}_filled.tif"
    diff_path = output_dir / f"{stem}_fill_diff.tif"
    prov_path = output_dir / f"{stem}_provenance.json"

    # --- Input checksum ---
    sha256 = hashlib.sha256(input_path.read_bytes()).hexdigest()
    logger.info("Input checksum (SHA-256): %s", sha256)

    # --- Load and validate ---
    t0 = time.monotonic()
    with rasterio.open(input_path) as src:
        profile = src.profile.copy()
        nodata = src.nodata if src.nodata is not None else -9999.0
        dem_array = src.read(1).astype(np.float64)
        crs_wkt = src.crs.to_wkt() if src.crs else None
        transform_str = str(src.transform)

    valid_mask = dem_array != nodata
    cell_count = int(valid_mask.sum())
    logger.info(
        "Loaded %s — %d valid cells | nodata=%.0f | CRS defined: %s",
        input_path.name, cell_count, nodata, crs_wkt is not None,
    )

    # --- Fill ---
    dem_array[~valid_mask] = -9999.0
    rd_dem = rd.rdarray(dem_array, no_data=-9999.0)
    logger.info("Executing Priority-Flood fill...")
    filled_rd = rd.FillDepressions(rd_dem, in_place=False)

    if resolve_flats:
        rd.ResolveFlats(filled_rd, in_place=True)
        logger.info("Flat areas resolved with epsilon gradient.")

    filled_array = np.array(filled_rd, dtype=np.float32)

    # --- Guard: check maximum raise ---
    diff = filled_array - dem_array.astype(np.float32)
    max_raise = float(diff[valid_mask].max())
    if max_raise > max_raise_threshold_m:
        raise RuntimeError(
            f"Maximum cell raise ({max_raise:.1f} m) exceeds threshold "
            f"({max_raise_threshold_m} m). Check for nodata leakage or extreme outliers."
        )

    filled_array[~valid_mask] = nodata
    diff[~valid_mask] = nodata

    # --- Write outputs ---
    profile.update(
        dtype="float32", compress="lzw", tiled=True,
        blockxsize=512, blockysize=512, nodata=nodata,
    )
    with rasterio.open(filled_path, "w", **profile) as dst:
        dst.write(filled_array, 1)
    logger.info("Filled DEM -> %s", filled_path)

    with rasterio.open(diff_path, "w", **profile) as dst:
        dst.write(diff, 1)
    logger.info("Difference raster -> %s", diff_path)

    elapsed = time.monotonic() - t0
    qa_stats = {
        "cells_raised": int((diff[valid_mask] > 0).sum()),
        "max_raise_m": max_raise,
        "fraction_modified": float((diff[valid_mask] > 0).mean()),
        "elapsed_seconds": round(elapsed, 2),
    }

    # --- Provenance ---
    provenance = {
        "input": str(input_path.resolve()),
        "input_sha256": sha256,
        "output_filled": str(filled_path.resolve()),
        "output_diff": str(diff_path.resolve()),
        "algorithm": "Priority-Flood (Barnes et al. 2014) via richdem",
        "resolve_flats": resolve_flats,
        "crs_wkt": crs_wkt,
        "transform": transform_str,
        "nodata": nodata,
        "qa": qa_stats,
        "timestamp_utc": datetime.now(timezone.utc).isoformat(),
    }
    prov_path.write_text(json.dumps(provenance, indent=2))
    logger.info("Provenance written -> %s | elapsed %.1f s", prov_path, elapsed)

    return {
        "filled_path": filled_path,
        "diff_path": diff_path,
        "provenance_path": prov_path,
        "qa": qa_stats,
    }

Validation Protocol

Filling a DEM is only half the task. The following protocol verifies hydrological soundness before the conditioned surface enters flow routing or watershed delineation.

1. Difference raster analysis

Inspect the difference raster produced by write_difference_raster. Any cell with a negative value indicates the filled DEM is lower than the original — this should never occur with a correctly implemented fill. Large contiguous raised areas (over 1 m in low-relief terrain, over 5 m in mountainous terrain) indicate either a data outlier in the original or an aggressive fill that may be smoothing legitimate closed basins.

2. Flow direction check

Run a D8 flow direction algorithm on the filled raster. Every valid cell must carry a defined flow direction (no undefined or stagnant cells). If rd.FlowAccumulation returns zero-accumulation areas inside the domain, flat-area resolution was skipped or failed.

3. Stream network overlay

Extract a stream network using a drainage area threshold appropriate to your terrain (see stream threshold tuning for guidance) and overlay it against the National Hydrography Dataset or equivalent authoritative vector hydrography. Systematic lateral offsets trace to CRS misalignment; missing tributaries indicate over-filling or an inappropriate threshold.

4. Edge spill verification

All flow accumulation paths must terminate at the raster boundary (or at designated endorheic outlets). Interior termination points that are not designated basin sinks indicate residual depressions that were not filled.

5. Statistical thresholds (reference benchmarks)

Metric	Acceptable range	Action if exceeded
`fraction_modified`	< 5% of valid cells	Review nodata mask and outlier cells
`max_raise_m`	< 20 m (coarse DEM) / < 5 m (LiDAR)	Check for data voids treated as valid cells
Cells with undefined D8 direction	0	Re-run with `resolve_flats=True`
Stream network offset vs NHD	< 1 cell width	Verify CRS and datum alignment

Common Failure Modes & Optimization

Memory exhaustion on large grids

richdem loads the entire array into RAM. A 30 m DEM covering the continental United States at float64 exceeds 50 GB. Tile the input using gdal_translate into overlapping sub-basins with at least 500 m overlap buffers, fill each tile independently, then mosaic outputs. Use rasterio.merge with the first merge strategy to handle seam areas.

Flat-area stagnation

Priority-Flood guarantees continuous drainage paths but cannot assign a unique steepest-descent direction when multiple cells are raised to the same elevation. The result is undefined D8 direction in filled depressions. Always set resolve_flats=True or call rd.ResolveFlats separately. Skipping this step causes flow accumulation to stagnate and D-Infinity routing patterns to behave erratically.

Nodata treated as valid elevation

If the raster’s nodata value is not explicitly declared in the GeoTIFF header, richdem treats nodata cells as extreme low-elevation terrain and fills toward them, creating artificial drainage corridors across void regions. Always specify no_data explicitly in the rd.rdarray constructor and confirm the value matches the raster’s actual nodata encoding.

Endorheic basin destruction

Priority-Flood fills all depressions to their spill point, including legitimate closed basins (playas, saline lakes, internally drained catchments in arid regions). Before running fill, identify and mask true endorheic basins using authoritative basin boundary datasets. Process masked areas separately or flag them in the provenance record.

Projection artifacts producing phantom depressions

Reprojecting a DEM from geographic to projected coordinates with bilinear or cubic resampling can introduce sub-cell elevation noise that creates phantom depressions. Reproject with bicubic spline interpolation and run a smoothing pass before filling, or accept that Priority-Flood will remove these artifact pits automatically. Review resampling DEMs without losing hydrologic connectivity if resampling preceded your fill step.

Integer overflow in large accumulation grids

When flow accumulation values are stored as int32, cells draining large basins (over 2.1 billion cells upstream) will overflow. Use float64 or int64 for the accumulation raster when working with continental-scale mosaics.

When to Use This vs. Alternatives

Priority-Flood via richdem is the correct default for the vast majority of DEM preprocessing tasks. The following cases call for a different approach or additional preprocessing:

Sub-2 m LiDAR at high feature density: micro-depressions from road culverts, vernal pools, and building footprints must be distinguished from artifact pits. Use a breach-and-fill hybrid (WhiteboxTools BreachDepressions) before or instead of filling. See best practices for filling sinks in high-resolution LiDAR data for the detailed workflow.
Resampled DEMs where spatial resolution trade-offs were made: if your DEM has been resampled from a finer source, review resampling DEMs without losing hydrologic connectivity before pit filling to ensure resampling artifacts do not inflate pit counts.
Multi-directional routing (D-Infinity or MFD): fill is still required, but flat-area resolution is even more critical because divergent flow algorithms are more sensitive to flat plateau artefacts than D8. See multiple flow direction methods for the downstream implications.
Very small study areas (under 100 km²): Wang & Liu is acceptable for teaching purposes and produces identical results to Priority-Flood at manageable runtimes.

Frequently Asked Questions

What is the difference between pit filling and depression carving?

Pit filling raises cells to the lowest spill point, preserving surrounding terrain. Depression carving lowers cells to force drainage downhill. Filling is preferred for most workflows because it minimally modifies the DEM; carving can create artificial channels and flatten natural slopes.

When should I use Priority-Flood instead of Wang & Liu?

Use Priority-Flood (Barnes et al. 2014) for any dataset above roughly 5 million cells. Its O(N log N) complexity and min-heap implementation are dramatically faster than Wang & Liu’s iterative approach on large grids, and it produces less artificial terrain modification in low-relief areas.

Can I skip pit filling if I am using a D-Infinity routing algorithm?

No. D-Infinity and other multi-directional routing algorithms still stagnate inside true topographic depressions. All flow routing methods require a hydrologically conditioned DEM with continuous drainage paths to the raster boundary or valid basin outlets.

How do I handle endorheic basins when filling a DEM?

Priority-Flood fills all depressions to their spill point, including legitimate closed basins such as playas and saline lakes. Before running fill, identify and mask true endorheic basins using authoritative basin boundary datasets. Process masked areas separately or flag them in the provenance record to prevent artificially connecting internally-drained catchments to the broader stream network.

Hydrology Data Preparation & DEM Processing — parent section covering the full DEM conditioning pipeline
Best Practices for Filling Sinks in High-Resolution LiDAR Data — LiDAR-specific pit strategy, breach-fill hybrids, and micro-depression preservation
Coordinate Reference System Alignment — must complete before pit filling to avoid phantom depressions from reprojection artifacts
D8 Flow Direction Implementation — the immediate downstream step after pit filling
Stream Threshold Tuning — how flow accumulation thresholds determine stream network density after a conditioned DEM is produced
Multiple Flow Direction Methods — divergent routing approaches that require the same hydrological conditioning as D8

Explore deeper