feat(waterdata): add waterdata.xarray module returning CF datasets#297
Draft
thodson-usgs wants to merge 1 commit into
Draft
feat(waterdata): add waterdata.xarray module returning CF datasets#297thodson-usgs wants to merge 1 commit into
thodson-usgs wants to merge 1 commit into
Conversation
Add dataretrieval.waterdata.xarray, optional-dependency wrappers that mirror the Water Data time-series getters but return CF-conventions xarray.Dataset objects instead of bare DataFrames. - Ragged (CF contiguous ragged array) layout by default; pass dense=True for the NaN-filled (monitoring_location_id, time) grid with one named variable per parameter. - CF metadata is derived from columns the getters already return (unit_of_measure -> units, statistic_id -> cell_methods, parameter_code -> standard_name/vertical_datum), plus a cached parameter-name lookup; sites carry cf_role=timeseries_id with lon/lat. - Coverage: get_daily, get_continuous, get_latest_continuous, get_latest_daily, get_nearest_continuous, get_peaks, get_field_measurements, get_samples, and preliminary get_stats_por / get_stats_date_range. - xarray is an optional extra (pip install dataretrieval[xarray]); the core package never imports it. Hash-valued ID columns are dropped inside the xarray builders, so the plain getters are left untouched. CF vocabulary maps live in waterdata.types (xarray-free, plain data). Adds a demo notebook + docs entry and offline converter unit tests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds
dataretrieval.waterdata.xarray, a module that mirrors the Water Datatime-series getters but returns CF-conventions
xarray.Datasets with seriesmetadata populated, instead of bare DataFrames.
Coverage
get_daily,get_continuous,get_latest_continuous,get_latest_daily,get_nearest_continuous,get_peaks,get_field_measurements,get_samples,and (preliminary)
get_stats_por/get_stats_date_range.Layout
The default is a CF contiguous ragged array (
featureType = "timeSeries"):every observation is concatenated along a single
obsdimension, one(monitoring location, parameter, statistic)series pertimeseriesinstance,with
row_sizelinking them. Only real observations are stored (no NaN fill),so it scales to large, very ragged multi-site pulls. Pass
dense=Trueforthe alternative
(monitoring_location_id, time)grid — one named variable perparameter, NaN-filled — ergonomic for a few overlapping series but memory-costly
for ragged collections.
How it works
unit_of_measure→units(UDUNITS where mapped),statistic_id→cell_methods,parameter_code→standard_name/vertical_datum/usgs_parameter_code. Only the human-readable parameter name comes from asmall, cached
parameter_code-keyed metadata lookup.cf_role=timeseries_id(the synthesizedtimeseries_idcoordinate when ragged,monitoring_location_idwhen dense),with
longitude/latitudeper site from point geometry,qualifier/approval_statusas ancillary variables, andhydrologic_unit_code/state_namewhen the metadata call already provides them.xarrayis an optional dependency (pip install dataretrieval[xarray]);it is not imported by
dataretrieval.waterdata, so the core package staysxarray-free.
Design note: the plain getters are unchanged
An earlier iteration of this branch made the
get_*getters drop hash/UUIDcolumns by default. That was reverted: the hash-dropping now lives entirely
inside the xarray builders, which surface only the columns they convert, so
opaque per-record UUIDs and per-series join keys never reach the
Dataset. TheDataFrame-returning getters and their public API are untouched. The wrappers
accept (and ignore) an
include_hashargument for call-compatibility; it doesnot apply to the xarray path.
Status
Draft. Known gaps to harden before merge:
percentile / day-of-year structure);
properties=subsets (bothcurrently guarded with a warning / empty-
Datasetfallback).NaT-time rows are dropped with a warning; a failed (supplementary) metadata
lookup degrades to a dataset without parameter names rather than discarding the
data; the per-process metadata cache is bounded (FIFO) with a public
clear_metadata_cache()opt-out; and thedocextra installsxarray+netCDF4so the demo notebook renders in the docs build.