feat(waterdata): add waterdata.xarray module returning CF datasets by thodson-usgs · Pull Request #297 · DOI-USGS/dataretrieval-python

thodson-usgs · 2026-05-30T13:15:21Z

Supersedes #281 (closed when its branch worktree-waterdata-drop-hash-ids was renamed to xarray-extension).

Summary

Adds dataretrieval.waterdata.xarray, a module that mirrors the Water Data
time-series getters but returns CF-conventions xarray.Datasets with series
metadata populated, instead of bare DataFrames.

from dataretrieval.waterdata import xarray as wdx
# dense=True shows the readable gridded view; the default is a ragged array
ds = wdx.get_daily(monitoring_location_id="USGS-05407000", parameter_code="00060",
                   time="2024-06-01/2024-06-05", dense=True)

discharge (monitoring_location_id, time)
    long_name:     Discharge, cubic feet per second
    units:         ft3 s-1
    cell_methods:  time: mean
    standard_name: water_volume_transport_in_river_channel
coords: monitoring_location_id (cf_role=timeseries_id), time, longitude, latitude
attrs: Conventions=CF-1.11, institution, source, references(URL)

Coverage

get_daily, get_continuous, get_latest_continuous, get_latest_daily,
get_nearest_continuous, get_peaks, get_field_measurements, get_samples,
and (preliminary) get_stats_por / get_stats_date_range.

Layout

The default is a CF contiguous ragged array (featureType = "timeSeries"):
every observation is concatenated along a single obs dimension, one
(monitoring location, parameter, statistic) series per timeseries instance,
with row_size linking them. Only real observations are stored (no NaN fill),
so it scales to large, very ragged multi-site pulls. Pass dense=True for
the alternative (monitoring_location_id, time) grid — one named variable per
parameter, NaN-filled — ergonomic for a few overlapping series but memory-costly
for ragged collections.

How it works

CF attributes are derived from columns the getter already returns:
unit_of_measure → units (UDUNITS where mapped), statistic_id →
cell_methods, parameter_code → standard_name / vertical_datum /
usgs_parameter_code. Only the human-readable parameter name comes from a
small, cached parameter_code-keyed metadata lookup.
the timeseries identity carries cf_role=timeseries_id (the synthesized
timeseries_id coordinate when ragged, monitoring_location_id when dense),
with longitude / latitude per site from point geometry, qualifier /
approval_status as ancillary variables, and hydrologic_unit_code /
state_name when the metadata call already provides them.
xarray is an optional dependency (pip install dataretrieval[xarray]);
it is not imported by dataretrieval.waterdata, so the core package stays
xarray-free.

Design note: the plain getters are unchanged

An earlier iteration of this branch made the get_* getters drop hash/UUID
columns by default. That was reverted: the hash-dropping now lives entirely
inside the xarray builders, which surface only the columns they convert, so
opaque per-record UUIDs and per-series join keys never reach the Dataset. The
DataFrame-returning getters and their public API are untouched. The wrappers
accept (and ignore) an include_hash argument for call-compatibility; it does
not apply to the xarray path.

Status

Draft. Known gaps to harden before merge:

the statistics conversion is a preliminary flat layout (not yet a
percentile / day-of-year structure);
broader coverage for mixed-unit groups and properties= subsets (both
currently guarded with a warning / empty-Dataset fallback).

NaT-time rows are dropped with a warning; a failed (supplementary) metadata
lookup degrades to a dataset without parameter names rather than discarding the
data; the per-process metadata cache is bounded (FIFO) with a public
clear_metadata_cache() opt-out; and the doc extra installs xarray +
netCDF4 so the demo notebook renders in the docs build.

Add dataretrieval.waterdata.xarray, optional-dependency wrappers that mirror the Water Data time-series getters but return CF-conventions xarray.Dataset objects instead of bare DataFrames. - Ragged (CF contiguous ragged array) layout by default; pass dense=True for the NaN-filled (monitoring_location_id, time) grid with one named variable per parameter. - CF metadata is derived from columns the getters already return (unit_of_measure -> units, statistic_id -> cell_methods, parameter_code -> standard_name/vertical_datum), plus a cached parameter-name lookup; sites carry cf_role=timeseries_id with lon/lat. - Coverage: get_daily, get_continuous, get_latest_continuous, get_latest_daily, get_nearest_continuous, get_peaks, get_field_measurements, get_samples, and preliminary get_stats_por / get_stats_date_range. - xarray is an optional extra (pip install dataretrieval[xarray]); the core package never imports it. Hash-valued ID columns are dropped inside the xarray builders, so the plain getters are left untouched. CF vocabulary maps live in waterdata.types (xarray-free, plain data). Adds a demo notebook + docs entry and offline converter unit tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

thodson-usgs mentioned this pull request May 30, 2026

feat(waterdata): add waterdata.xarray module returning CF datasets #281

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(waterdata): add waterdata.xarray module returning CF datasets#297

feat(waterdata): add waterdata.xarray module returning CF datasets#297
thodson-usgs wants to merge 1 commit into
DOI-USGS:mainfrom
thodson-usgs:xarray-extension

thodson-usgs commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thodson-usgs commented May 30, 2026

Summary

Coverage

Layout

How it works

Design note: the plain getters are unchanged

Status

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant