02 - SSI on an xarray grid with bounded soil moisture¶

Demonstrates the N-D entry point rsd.standardize_xr and the bounded-variable pathway (bounds=) used for soil moisture and other physically bounded quantities.

We build a small synthetic 5x5 grid of monthly soil moisture in [0, sat], run standardize_xr(..., bounds=(0, sat)) so the logit pre-transform handles the physical bound, and confirm the result is a DataArray with the same dimensions as the input.

No external data; runs in a couple of seconds.

In [1]:

Copied!

import numpy as np
import pandas as pd
import xarray as xr

import rsd

print(f"rsd version: {rsd.__version__}")
import numpy as np
import pandas as pd
import xarray as xr

import rsd

print(f"rsd version: {rsd.__version__}")

rsd version: 0.1.dev0+d20260520

1. Build a synthetic bounded soil-moisture grid¶

5 latitude x 5 longitude pixels, 50 years of monthly values. Soil moisture is drawn from a beta distribution (which lives on (0, 1)) and scaled to a saturation of sat = 0.4 m^3/m^3. Seasonal variation in the beta shape parameter creates a realistic wet/dry cycle.

In [2]:

Copied!





rng = np.random.default_rng(0)
n_years = 50
n_lat, n_lon = 5, 5
sat = 0.4   # m^3/m^3

times = pd.date_range("1970-01-01", periods=n_years * 12, freq="MS")
months = np.array([t.month for t in times])

data = np.empty((len(times), n_lat, n_lon))
for i in range(n_lat):
    for j in range(n_lon):
        data[:, i, j] = sat * np.array(
            [rng.beta(2 + 0.2 * m, 5 - 0.2 * m) for m in months]
        )

sm = xr.DataArray(
    data,
    dims=["time", "lat", "lon"],
    coords={
        "time": times,
        "lat": np.linspace(-2, 2, n_lat),
        "lon": np.linspace(-2, 2, n_lon),
    },
    name="soil_moisture",
    attrs={"units": "m3 m-3", "long_name": "synthetic soil moisture"},
)
# Compact header summary instead of dumping the full array.
print(f"xarray.DataArray  {sm.name!r}  dims={sm.dims}  shape={sm.shape}  dtype={sm.dtype}")
print(f"  coords: time({len(sm.time)})  lat({len(sm.lat)})  lon({len(sm.lon)})")
print(f"  attrs:  {sm.attrs}")
rng = np.random.default_rng(0)
n_years = 50
n_lat, n_lon = 5, 5
sat = 0.4   # m^3/m^3

times = pd.date_range("1970-01-01", periods=n_years * 12, freq="MS")
months = np.array([t.month for t in times])

data = np.empty((len(times), n_lat, n_lon))
for i in range(n_lat):
    for j in range(n_lon):
        data[:, i, j] = sat * np.array(
            [rng.beta(2 + 0.2 * m, 5 - 0.2 * m) for m in months]
        )

sm = xr.DataArray(
    data,
    dims=["time", "lat", "lon"],
    coords={
        "time": times,
        "lat": np.linspace(-2, 2, n_lat),
        "lon": np.linspace(-2, 2, n_lon),
    },
    name="soil_moisture",
    attrs={"units": "m3 m-3", "long_name": "synthetic soil moisture"},
)
# Compact header summary instead of dumping the full array.
print(f"xarray.DataArray  {sm.name!r}  dims={sm.dims}  shape={sm.shape}  dtype={sm.dtype}")
print(f"  coords: time({len(sm.time)})  lat({len(sm.lat)})  lon({len(sm.lon)})")
print(f"  attrs:  {sm.attrs}")

xarray.DataArray  'soil_moisture'  dims=('time', 'lat', 'lon')  shape=(600, 5, 5)  dtype=float64
  coords: time(600)  lat(5)  lon(5)
  attrs:  {'units': 'm3 m-3', 'long_name': 'synthetic soil moisture'}

2. Compute SSI-3 with an explicit reference¶

We split the 50-year series into a 20-year reference baseline (1970-1989) and use the full series as the target. This is the canonical reference-vs-target pattern the package is designed for (example 01 demonstrates the same pattern in 1-D): the reference defines the CDF; target values are scored against it on a fixed, shared scale.

Two parameters distinguish this from the SPI example:

bounds=(0.0, sat) activates the logit pre-transform. Each per-timestep value is mapped from the bounded interval (0, sat) into unbounded logit space before accumulation, so the rest of the pipeline (accumulation, deseasonalisation, tail fitting) sees real-valued data with no boundary effects.
agg="mean" keeps soil moisture in physical units (a 3-month mean is still a soil-moisture value). Sum is more natural for fluxes such as precipitation.

Months are extracted automatically from the time coordinate; no need to pass months_target/months_reference here.

In [3]:

Copied!





# Use the first 20 years (1970-1989) as the reference; the full
# 50-year series is the target. The reference defines the fixed CDF;
# target values are evaluated against it.
sm_ref = sm.isel(time=slice(0, 20 * 12))

ssi = rsd.standardize_xr(
    target=sm,
    reference=sm_ref,
    method="rsd",
    scale=3,
    agg="mean",
    bounds=(0.0, sat),
    parallel=True,    # use dask for per-chunk parallelism
).compute()

# Compact header summary instead of dumping the full array.
print(f"xarray.DataArray  dims={ssi.dims}  shape={ssi.shape}  dtype={ssi.dtype}")
print(f"  attrs: {ssi.attrs}")
# Use the first 20 years (1970-1989) as the reference; the full
# 50-year series is the target. The reference defines the fixed CDF;
# target values are evaluated against it.
sm_ref = sm.isel(time=slice(0, 20 * 12))

ssi = rsd.standardize_xr(
    target=sm,
    reference=sm_ref,
    method="rsd",
    scale=3,
    agg="mean",
    bounds=(0.0, sat),
    parallel=True,    # use dask for per-chunk parallelism
).compute()

# Compact header summary instead of dumping the full array.
print(f"xarray.DataArray  dims={ssi.dims}  shape={ssi.shape}  dtype={ssi.dtype}")
print(f"  attrs: {ssi.attrs}")

xarray.DataArray  dims=('time', 'lat', 'lon')  shape=(600, 5, 5)  dtype=float32
  attrs: {'method': 'rsd', 'scale': 3, 'agg': 'mean', 'reference_period': '1970-01-01 to 1989-12-01'}

3. Sanity-check¶

Output should be a float32 DataArray with the same dimensions and coordinates as the target input. The first scale - 1 = 2 time positions are NaN by construction (the rolling window is not yet full). Inside the reference window the z-scores should be near standard-normal; outside the reference window they are still approximately N(0,1) here because the synthesis is stationary (reference and target are drawn from the same distribution).

In [4]:

Copied!





print(f"dtype: {ssi.dtype}")
print(f"shape: {ssi.shape}  (time, lat, lon)")
print(f"first 2 time positions all NaN:  {bool(ssi.isel(time=slice(0, 2)).isnull().all())}")
post_warmup = ssi.isel(time=slice(2, None))
print(f"post-warmup mean (~0): {float(post_warmup.mean()):+.3f}")
print(f"post-warmup std  (~1): {float(post_warmup.std()):+.3f}")
print(f"dtype: {ssi.dtype}")
print(f"shape: {ssi.shape}  (time, lat, lon)")
print(f"first 2 time positions all NaN:  {bool(ssi.isel(time=slice(0, 2)).isnull().all())}")
post_warmup = ssi.isel(time=slice(2, None))
print(f"post-warmup mean (~0): {float(post_warmup.mean()):+.3f}")
print(f"post-warmup std  (~1): {float(post_warmup.std()):+.3f}")

dtype: float32
shape: (600, 5, 5)  (time, lat, lon)
first 2 time positions all NaN:  True
post-warmup mean (~0): -0.022
post-warmup std  (~1): +1.060

A note on `agg="sum"` with bounds¶

The logit pre-transform is applied before accumulation, so both agg="sum" and agg="mean" work cleanly with bounded inputs. The scientific choice between the two depends on what you want to standardise: an average soil-moisture state (mean) or an accumulated water-content quantity (sum). Both produce valid SSI fields under the bounded path.

What next¶

See 03_diagnostics_showcase.ipynb for rsd.diagnose, which checks the exchangeability assumption that the RSD pooling step relies on.
Full API reference at https://hydro-rsd.readthedocs.io.