Skip to content

feat(waterdata): add waterdata.xarray module returning CF datasets#297

Draft
thodson-usgs wants to merge 1 commit into
DOI-USGS:mainfrom
thodson-usgs:xarray-extension
Draft

feat(waterdata): add waterdata.xarray module returning CF datasets#297
thodson-usgs wants to merge 1 commit into
DOI-USGS:mainfrom
thodson-usgs:xarray-extension

Conversation

@thodson-usgs
Copy link
Copy Markdown
Collaborator

Supersedes #281 (closed when its branch worktree-waterdata-drop-hash-ids was renamed to xarray-extension).

Summary

Adds dataretrieval.waterdata.xarray, a module that mirrors the Water Data
time-series getters but returns CF-conventions xarray.Datasets with series
metadata populated, instead of bare DataFrames.

from dataretrieval.waterdata import xarray as wdx
# dense=True shows the readable gridded view; the default is a ragged array
ds = wdx.get_daily(monitoring_location_id="USGS-05407000", parameter_code="00060",
                   time="2024-06-01/2024-06-05", dense=True)
discharge (monitoring_location_id, time)
    long_name:     Discharge, cubic feet per second
    units:         ft3 s-1
    cell_methods:  time: mean
    standard_name: water_volume_transport_in_river_channel
coords: monitoring_location_id (cf_role=timeseries_id), time, longitude, latitude
attrs: Conventions=CF-1.11, institution, source, references(URL)

Coverage

get_daily, get_continuous, get_latest_continuous, get_latest_daily,
get_nearest_continuous, get_peaks, get_field_measurements, get_samples,
and (preliminary) get_stats_por / get_stats_date_range.

Layout

The default is a CF contiguous ragged array (featureType = "timeSeries"):
every observation is concatenated along a single obs dimension, one
(monitoring location, parameter, statistic) series per timeseries instance,
with row_size linking them. Only real observations are stored (no NaN fill),
so it scales to large, very ragged multi-site pulls. Pass dense=True for
the alternative (monitoring_location_id, time) grid — one named variable per
parameter, NaN-filled — ergonomic for a few overlapping series but memory-costly
for ragged collections.

How it works

  • CF attributes are derived from columns the getter already returns:
    unit_of_measureunits (UDUNITS where mapped), statistic_id
    cell_methods, parameter_codestandard_name / vertical_datum /
    usgs_parameter_code. Only the human-readable parameter name comes from a
    small, cached parameter_code-keyed metadata lookup.
  • the timeseries identity carries cf_role=timeseries_id (the synthesized
    timeseries_id coordinate when ragged, monitoring_location_id when dense),
    with longitude / latitude per site from point geometry, qualifier /
    approval_status as ancillary variables, and hydrologic_unit_code /
    state_name when the metadata call already provides them.
  • xarray is an optional dependency (pip install dataretrieval[xarray]);
    it is not imported by dataretrieval.waterdata, so the core package stays
    xarray-free.

Design note: the plain getters are unchanged

An earlier iteration of this branch made the get_* getters drop hash/UUID
columns by default. That was reverted: the hash-dropping now lives entirely
inside the xarray builders, which surface only the columns they convert, so
opaque per-record UUIDs and per-series join keys never reach the Dataset. The
DataFrame-returning getters and their public API are untouched. The wrappers
accept (and ignore) an include_hash argument for call-compatibility; it does
not apply to the xarray path.

Status

Draft. Known gaps to harden before merge:

  • the statistics conversion is a preliminary flat layout (not yet a
    percentile / day-of-year structure);
  • broader coverage for mixed-unit groups and properties= subsets (both
    currently guarded with a warning / empty-Dataset fallback).

NaT-time rows are dropped with a warning; a failed (supplementary) metadata
lookup degrades to a dataset without parameter names rather than discarding the
data; the per-process metadata cache is bounded (FIFO) with a public
clear_metadata_cache() opt-out; and the doc extra installs xarray +
netCDF4 so the demo notebook renders in the docs build.

Add dataretrieval.waterdata.xarray, optional-dependency wrappers that
mirror the Water Data time-series getters but return CF-conventions
xarray.Dataset objects instead of bare DataFrames.

- Ragged (CF contiguous ragged array) layout by default; pass dense=True
  for the NaN-filled (monitoring_location_id, time) grid with one named
  variable per parameter.
- CF metadata is derived from columns the getters already return
  (unit_of_measure -> units, statistic_id -> cell_methods,
  parameter_code -> standard_name/vertical_datum), plus a cached
  parameter-name lookup; sites carry cf_role=timeseries_id with lon/lat.
- Coverage: get_daily, get_continuous, get_latest_continuous,
  get_latest_daily, get_nearest_continuous, get_peaks,
  get_field_measurements, get_samples, and preliminary
  get_stats_por / get_stats_date_range.
- xarray is an optional extra (pip install dataretrieval[xarray]); the
  core package never imports it. Hash-valued ID columns are dropped
  inside the xarray builders, so the plain getters are left untouched.

CF vocabulary maps live in waterdata.types (xarray-free, plain data).
Adds a demo notebook + docs entry and offline converter unit tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant