Water flows, so does information¶

The GitHub repository needed to reproduce this notebook can be found here. You can also execute the notebook on Binder by clicking here.

In this notebook I will show that the information about the streamflow at a point on a river can be transfered to any point on a river of the same hydrological basin, not only downstream but also upstream. It comes from the fact that a part of the water present in the streamflow of a river comes from other rivers that flow into it. This hierarchical relationship can be used by algorithms to pass information. Possible applications can be found in situations where available data is not sufficient to unambiguously calibrate a model, for instance when sparse measurements of water level are available, as it is the case for satellite-based estimates.

Introduction¶

A hydrological network is of hierarchical nature. As we can see in the figure below, where the Amazon basin's river network is represented, lots of small streams flow into bigger ones, which get bigger and bigger as they collect more flows (while also being less and less numerous), and so on until only one river remains and reaches the basin's outlet.

import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
%matplotlib inline
from matplotlib.sankey import Sankey
import numpy as np
from numba import jit
from pandas import DataFrame
import sys
sys.path.append('../py')
from models import gr4j, delay
from misc import get_kde, uniform_density, lnprob_from_density, plot_series, dist_map
import mcmc
from IPython.display import Image
Image("../data/hydrosheds_amazon_large.jpg")

We will limit the scope of this study to the simplest form of network - a linear one, where we have only one river that flows through several subbasins. In the figure below, we denote these subbasins by b_n, while the basins that include the upstream subbasins are called B_n and defined by:

B₀ = {b₀}

B₁ = {b₀, b₁}

...

B_n = {b₀, b₁, ..., b_n}

fig = plt.figure(figsize=(17, 5))
ax = fig.add_subplot(1, 1, 1, xticks=[], yticks=[])
sankey = Sankey(ax=ax, unit=None)
sankey.add(flows=[1, -1], trunklength=2, patchlabel='$b_0$')
ws_nb = 5
for i in range(1, ws_nb):
    sankey.add(flows=[1, -1], trunklength=2, patchlabel=f'$b_{i}$', prior=i-1, connect=(1, 0))
sankey.finish()
plt.title('Subbasins along a linear river network')
plt.show()

Let's first load some precipitation data. It doesn't matter very much where it comes from, it just looks like a plausible precipitation time series (see figure below).

p = np.load('../data/p.npy')

We will take a constant potential evapotranspiration, even though this is not very plausible, but it is not very important here. The precipitation and potential evapotranspiration are uniform across all subbasins.

df = DataFrame()
df['p'] = p
df['e'] = 5

days = 365 # show last year's data
columns = {'e': '$E$', 'p': '$P$'}
df[['p', 'e']][-days:].rename(columns=columns).plot(figsize=(17, 5))
plt.legend()
plt.title('Precipitation (P) and potential evapotranspiration (E) over every subbasin')
plt.show()

In order to generate synthetic streamflow data, we choose quite an arbitrary hydrological model, and feed it with the precipitation and potential evapotranspiration data.

x_true = [1000, 0, 100, 10]
g = gr4j(x_true) # arbitrary rainfall-runoff model
df['q_true_0'] = g.run([df.p.values, df.e.values])

This is for our basin B₀. Now we want the stream at its outlet to flow into another subbasin, which itself flows into another subbasin, etc. We will suppose that all subbasins have the same area, and that the time it takes for the water to flow from the outlet of a subbasin to the outlet of the next one is the same for every subbasin, and can be described by a simple delay in the streamflow signal. From this we can compute the streamflow at the outlet of a basin recursively from the streamflow at the outlet of its upstream basin.

d_true = 5 # 5 days for the water from a subbasin to flow into the next
d = delay(d_true)
for i in range(1, ws_nb):
    area_head = i
    area_tail = 1
    df[f'q_true_{i}'] = (d.run(df[f'q_true_{i-1}'].values) * area_head + df.q_true_0.values * area_tail) / (area_head + area_tail)

We want to convert our streamflows to water levels, which will be the measured quantity. This is usually done through a rating curve. Here we will just take the water level to be equal to the square root of the streamflow.

for i in range(ws_nb):
    df[f'h_true_{i}'] = np.sqrt(df[f'q_true_{i}'])

In order to simulate a situation where water level is partially known, we want to remove information from our synthetic water level time series. This can be done by inserting missing values and by adding noise. It is similar to the type of data we can find in satellite estimates. The noise is chosen to be Gaussian around the true value, and the remaining data is spread in time accross the subbasins.

downsample = 30 # keep one water level measure every month for each subbasin
offset = downsample // ws_nb
std = 0.5
for i in range(ws_nb):
    # noise is additive gaussian
    h_noised = np.random.normal(df[f'h_true_{i}'].values, std, df.p.values.size)
    h = np.ones_like(h_noised) * np.nan
    h[offset*i::downsample] = h_noised[offset*i::downsample]
    df[f'h{i}'] = h

The figure below shows that the information we have about the water levels is spread accross the subbasins, not only in space but also in time. The remaining of this study presents the forward-backward algorithm, which is well suited for taking advantage of such a distributed information.

days = 365 # show last year's data
plt.figure(figsize=(17, 5))
for i in range(ws_nb):
    plt.plot(df[f'h_true_{i}'].tail(days), label=f'$True({i})$')
for i in range(ws_nb):
    plt.scatter(df.tail(days).index, df[f'h{i}'].tail(days), label=f'$Measured({i})$')
plt.legend()
plt.title('Water level information spread accross subbasins')
plt.show()

Forward¶

The first step is to pass the information forward (i.e. downstream), just like water flows. We start by calibrating B₀'s model using the available measurements (P, E, H₀) over it. The calibration tries to find the parameters of the model that best fit the observed data, but instead of running an optimization algorithm like the downhill simplex algorithm, we use a Markov chain Monte Carlo algorithm (MCMC). While the former gives us the optimal value of the model parameters given the measurements as is, the latter gives us their probability density by taking uncertainty in the measured streamflow into account. For instance, the poor measurements available might not be sufficient to determine the model parameters precisely, but there are more measurements downstream that might help. The idea is then to use a model of B₁ that makes use of B₀ and b₁'s models, and to calibrate this dual model with the measurements over B₁. Note that the dual model consists of twice as many parameters as a single model (plus one parameter for the propagation delay), which could be an issue if we didn't have any prior information about these parameters. But it is not the case: the calibration of B₀'s model helped us know its parameters better. When this information is used during the calibration of B₁'s dual model, it constrains the possible values of the parameters of b₁'s model. In this process, information passes from B₀ to b₁. Now, if we repeat the same operations for B₁ (whose parameters have information from H₀ and H₁) and b₂, using information from H₂, we can see that each calibration of a model uses information from its upstream subbasins. And if we keep on doing it until the outlet, we effectively make the information flow downstream.

Note that the basic mechanism in which information is exchanged between two basins is the calibration of a dual model, i.e. a model that splits a basin in two submodels. It is through the calibration of this dual model that information can be transfered between the parameters of the two submodels, and consequently between the streamflows simulated by the submodels.

Let's first get the posterior probability distribution for the model parameters of B₀, given a water level H₀. We take a uniform distribution for the prior of these parameters, because we don't know anything about them. For the uncertainty in the water level measurements, we take a Gaussian around the observed values, just like the noise we added on the true water level (but applied to the measured water level since we don't know the true one).

Since the hydrological model can only simulate streamflows, we need a way to transform a streamflow signal to a water level signal. We use the same technique as in this post (probability distribution mapping).

peq = df[['p', 'e']]
peq['h'] = df['h0']
warmup = 365 * 2 # discard two first years for the warmup of the model

# prior probability distribution is uniform
x_range = ((0.1, 1e4), (-1, 1), (0.1, 1e3), (0.1, 1e2))
x_prior = [uniform_density(*r) for r in x_range]
lnprob_prior = [lnprob_from_density(p, *r) for p, r in zip(x_prior, x_range)]

def lnprob(x):
    # prior
    lnp = 0
    for i, v in enumerate(x):
        lnp += lnprob_prior[i](v)
    if not np.isfinite(lnp):
        return -np.inf, np.ones_like(peq.p.values) * np.inf

    h_obs = peq.h.values
    q_sim = gr4j(x).run([peq.p.values, peq.e.values])
    h_sim = np.hstack((np.full(warmup, np.nan), dist_map(q_sim[warmup:], h_obs[warmup:])))
    # remove warmup and missing values
    df = DataFrame({'h_sim': h_sim, 'h_obs': h_obs})[warmup:].dropna()
    std2 = std * std
    # error on measured water level is gaussian with standard deviation std
    return lnp + np.sum(-np.square(df.h_sim.values - df.h_obs.values) / (2 * std2) - np.log(np.sqrt(2 * np.pi * std2))), (q_sim, h_sim)

sample_nb = 2_000 # number of samples generated by MCMC
burnin = sample_nb // 10 # number of burnin samples
x_start = [100, 0, 10, 1] # arbitrary initial parameter values for MCMC
x0 = x_start
sampler = mcmc.Sampler(x0, lnprob)
samples, qh_sim = sampler.run(sample_nb, burnin)
q_sim = np.empty((sample_nb, *peq.p.values.shape))
h_sim = np.empty((sample_nb, *peq.p.values.shape))
for i in range(sample_nb):
    q_sim[i, :] = qh_sim[i][0]
    h_sim[i, :] = qh_sim[i][1]

100%|██████████| 2200/2200 [01:22<00:00, 26.51it/s]

plot_series(ensemble=h_sim[:, -days:], true=df.h_true_0.values[-days:], measure=df.h0.values[-days:], title='Simulated water level at the outlet of $B_0$ using measured water level $H_0$')

plot_series(ensemble=q_sim[:, -days:], true=df.q_true_0.values[-days:], title='Simulated streamflow at the outlet of $B_0$ using measured water level $H_0$')

The figure below shows the posterior probability distribution of B₀'s model parameters (using Kernel Density Estimation). We can see that they are more or less centered on their true value (vertical line), with some uncertainty. It can happen that the model is way off the true value for a particular parameter, and uses another parameter to compensate for it.

def plot_dist(dist, x_true):
    x_nb = len(dist)
    f, ax = plt.subplots(1, x_nb)
    f.set_figwidth(x_nb * 4)
    for i in range(x_nb):
        x, y = dist[i]
        ax[i].plot(x, y)
        ax[i].axvline(x_true[i], color='red', alpha=0.3)
        ax[i].set_title(f'$X_{i+1}$')
    plt.show()

x_b0 = [get_kde(samples[:, i]) for i in range(4)]
plot_dist(x_b0, x_true)

For B₁, with discharge Q₁, we need a model that consists of the models of B₀ (with its propagation delay) and b₁. Thus the lnprob function has to be a little more complex. Also, it has to take into account the prior information on the parameters of B₀'s model.

peq = df[['p', 'e']]
peq['h'] = df['h1']
# prior probability distribution is uniform for subbasin b1
x_range = ((0.1, 1e4), (-1, 1), (0.1, 1e3), (0.1, 1e2))
# prior probability distribution for subbasin b0 comes from previous iteration
x_b1 = [uniform_density(*r) for r in x_range]
d_range = (0, 10)
d_prior = uniform_density(*d_range)
lnprob_prior = [lnprob_from_density(p, *r) for p, r in zip(x_b0, x_range)]
lnprob_prior += [lnprob_from_density(d_prior, *d_range)]
lnprob_prior += [lnprob_from_density(p, *r) for p, r in zip(x_b1, x_range)]

def lnprob(x):
    # prior
    lnp = 0
    for i, v in enumerate(x):
        lnp += lnprob_prior[i](v)
    if not np.isfinite(lnp):
        return -np.inf, np.ones_like(peq.p.values) * np.inf

    h_obs = peq.h.values
    std2 = std * std
    x_head = x[:5]
    g_head = gr4j(x_head)
    x_tail = x[5:]
    g_tail = gr4j(x_tail)
    q_sim = (g_head.run([peq.p.values, peq.e.values]) + g_tail.run([peq.p.values, peq.e.values])) / 2
    h_sim = np.hstack((np.full(warmup, np.nan), dist_map(q_sim[warmup:], h_obs[warmup:])))
    df = DataFrame({'h_sim': h_sim, 'h_obs': h_obs})[warmup:].dropna()
    return lnp + np.sum(-np.square(df.h_sim.values - df.h_obs.values) / (2 * std2) - np.log(np.sqrt(2 * np.pi * std2))), (q_sim, h_sim)

d_start = 1
x0 = [x_b0[i][0][np.argmax(x_b0[i][1])] for i in range(4)] + [d_start] + x_start
sampler = mcmc.Sampler(x0, lnprob)
samples, qh_sim = sampler.run(sample_nb, burnin)
q_sim = np.empty((sample_nb, *peq.p.values.shape))
h_sim = np.empty((sample_nb, *peq.p.values.shape))
for i in range(sample_nb):
    q_sim[i, :] = qh_sim[i][0]
    h_sim[i, :] = qh_sim[i][1]

100%|██████████| 2200/2200 [04:15<00:00,  8.50it/s]

x_nb = 5
f, ax = plt.subplots(1, x_nb)
f.set_figwidth(x_nb * 3.2)
for i in range(x_nb):
    x, y = get_kde(samples[:, i])
    if i < 4:
        p = ax[i].plot(*x_b0[i], alpha=0.1)
        ax[i].axvline(x_true[i], color='r', alpha=0.3)
        ax[i].set_title(f'$X_{i+1}$')
    else:
        ax[i].axvline(d_true, color='r', alpha=0.3)
        ax[i].set_title(f'$d$')
    ax[i].plot(x, y, color=p[0].get_color())
plt.show()

The figure below shows the posterior probability distribution for the parameters of b₁'s model. Information from B₀'s model parameters was used.

x_b1 = [get_kde(samples[:, i]) for i in range(5, 9)]
plot_dist(x_b1, x_true[:4])

Now if we want to continue this process downstream, we could just create a model which includes the models of all its upstream subbasins. But that would not scale, first from a computational point of view, because each model adds 5 new parameters and the MCMC algorithm needs to run the model N times for each sample (where N is the total number of parameters), and then from a problem complexity point of view, because each time we add a parameter we add a dimension to the space that the MCMC algorithm has to explore. Even if the prior information we have on the parameters limits this space, it is always very important to limit the number of parameters for the algorithm to converge.

To prevent the problem complexity from growing, we can reduce our dual model to a single model. Indeed, we should now have a dual model that can produce (i.e. simulate) streamflow data with a better precision than the measured streamflow, since we also used the information in the streamflow of its upstream basin. All we need to do is to get the posterior probability distribution for the parameters of this single model using the simulated streamflow from the dual model. Fortunately, we already computed the streamflow of the dual model previously when we ran the MCMC algorithm. From this ensemble of streamflow time series, we get the streamflow probability density at each time step (using KDE), and use it for the calibration of the single model.

n_kde = 10 # KDE points
q_kde = np.empty((2, n_kde, q_sim.shape[1]))
for i in range(q_kde.shape[2]):
    q_kde[:, :, i] = get_kde(q_sim[:, i], nb=n_kde)

plot_series(ensemble=h_sim[:, -days:], true=df.h_true_1.values[-days:], measure=df.h1.values[-days:], title='Water level at the outlet of $B_1$ (dual model)')

plot_series(ensemble=q_sim[:, -days:], true=df.q_true_1.values[-days:], title='Streamflow at the outlet of $B_1$ (dual model)')

For the calibration of the single model, we compute the posterior probability as the product of the probabilities of the simulated streamflow given the observed streamflow (which has been simulated by the dual model). This implies that the data is independent, and for that to be true we need to take into account that the streamflow signal is autocorrelated, i.e. the streamflow at time t₀ is not independent from the streamflow at time t₁ if t₀ and t₁ are too close. Thus we cannot take the streamflow values at each time step, and instead we must take e.g. one data every six months.

peq = df[['p', 'e']]
sim_step = 365 // 2 # take one streamflow data every 6 months when computing posterior probability

def lnprob(x):
    q_sim = gr4j(x).run([peq.p.values, peq.e.values])
    lnp = 0
    for i in range(warmup, q_sim.size, sim_step):
        lnp += lnprob_from_density(q_kde[:, :, i])(q_sim[i])
    return lnp, q_sim

x0 = x_start
sampler = mcmc.Sampler(x0, lnprob)
samples, q_sim = sampler.run(sample_nb, burnin)

100%|██████████| 2200/2200 [00:43<00:00, 50.92it/s]

q_sim = np.array(q_sim)
plot_series(ensemble=h_sim[:, -days:], true=df.h_true_1.values[-days:], measure=df.h1.values[-days:], title='Water level at the outlet of $B_1$ (single model)')

plot_series(ensemble=q_sim[:, -days:], true=df.q_true_1.values[-days:], title='Streamflow at the outlet of $B_1$ (single model)')

And now we are back in the situation we started in: we can create a dual model that consists of the models of B₁ and b₂, etc. The following is the full algorithm which corresponds to the forward pass.

def get_lnprob(peq, lnprob_prior, area_head, area_tail, q_kde=None):
    def lnprob(x):
        lnp = 0
        for i, v in enumerate(x):
            lnp += lnprob_prior[i](v)
        if not np.isfinite(lnp):
            return -np.inf, np.ones_like(peq.p.values) * np.inf

        x_head = x[:5] # this includes the delay in the gr4j model
        g_head = gr4j(x_head)
        if area_tail > 0:
            x_tail = x[5:]
            g_tail = gr4j(x_tail)
            q_tail = g_tail.run([peq.p.values, peq.e.values])
        else:
            q_tail = 0
        q_sim = (g_head.run([peq.p.values, peq.e.values]) * area_head + q_tail * area_tail) / (area_head + area_tail)
        if q_kde is None:
            # observation is measured water level
            h_obs = peq.h_obs.values
            h_sim = np.hstack((np.full(warmup, np.nan), dist_map(q_sim[warmup:], h_obs[warmup:])))
            df = DataFrame({'h_sim': h_sim, 'h_obs': h_obs})[warmup:].dropna()
            std2 = std * std
            return lnp + np.sum(-np.square(df.h_sim.values - df.h_obs.values) / (2 * std2) - np.log(np.sqrt(2 * np.pi * std2))), q_sim
        else:
            # observation is simulated streamflow
            lnp_q = 0
            for i in range(warmup, q_sim.size, sim_step):
                lnp_q += lnprob_from_density(q_kde[:, :, i])(q_sim[i])
            return lnp + lnp_q, q_sim
    return lnprob

q_ensemble = {}
for ws_i in range(ws_nb):
    peq = df[['p', 'e']]
    peq['h_obs'] = df[f'h{ws_i}']
    if ws_i == 0:
        area_head = 1
        area_tail = 0
        # prior probability distribution is uniform for head basin
        x0 = x_start
        x_prior = [uniform_density(*r) for r in x_range]
        lnprob_prior = [lnprob_from_density(p, *r) for p, r in zip(x_prior, x_range)]
    else:
        area_head = ws_i
        area_tail = 1
        x0 = [xy[0][np.argmax(xy[1])] for xy in x_head]
        x0 += [d_start] + x_start
        # prior probability distribution is uniform for tail basin
        lnprob_prior = [lnprob_from_density(p, *r) for p, r in zip(x_head, x_range)]
        lnprob_prior += [lnprob_from_density(uniform_density(*d_range), *d_range)]
        x_tail = [uniform_density(*r) for r in x_range]
        lnprob_prior += [lnprob_from_density(p, *r) for p, r in zip(x_tail, x_range)]
    lnprob = get_lnprob(peq, lnprob_prior, area_head, area_tail)
    # run MCMC
    sampler = mcmc.Sampler(x0, lnprob)
    samples, q_sim = sampler.run(sample_nb, burnin)
    # get simulated streamflow and uncertainty
    q_sim = np.array(q_sim)
    q_ensemble[f'f{ws_i}'] = q_sim
    # plot updated streamflow
    plot_series(ensemble=q_sim[:, -days:], true=df[f'q_true_{ws_i}'].values[-days:], title=f'Streamflow at the outlet of $B_{ws_i}$')
    if (ws_i > 0) and (ws_i < ws_nb - 1):
        # reduce dual model to single model
        peq = df[['p', 'e']]
        x_prior = [uniform_density(*r) for r in x_range]
        lnprob_prior = [lnprob_from_density(p, *r) for p, r in zip(x_prior, x_range)]
        q_kde = np.empty((2, n_kde, q_sim.shape[1]))
        for i in range(q_kde.shape[2]):
            q_kde[:, :, i] = get_kde(q_sim[:, i], nb=n_kde)
        lnprob = get_lnprob(peq, lnprob_prior, 1, 0, q_kde)
        x0 = x_start
        sampler = mcmc.Sampler(x0, lnprob)
        samples, q_sim = sampler.run(sample_nb, burnin)
    x_head = [get_kde(samples[:, i]) for i in range(4)]

100%|██████████| 2200/2200 [01:21<00:00, 26.98it/s]

100%|██████████| 2200/2200 [04:18<00:00,  6.76it/s]

100%|██████████| 2200/2200 [00:39<00:00, 55.54it/s]
100%|██████████| 2200/2200 [04:02<00:00,  9.08it/s]

100%|██████████| 2200/2200 [00:40<00:00, 54.50it/s]
100%|██████████| 2200/2200 [05:08<00:00,  8.36it/s]

100%|██████████| 2200/2200 [00:32<00:00, 67.42it/s]
100%|██████████| 2200/2200 [04:16<00:00,  7.67it/s]

Backward¶

Passing the information forward results in the downstream basins benefiting from more information than the upstream basins, since the information accumulates at each iteration. So for instance Q₄ is better known than Q₀ after the update. But we can feel that an information on a downstream basin can also constrain an upstream basin. Take the extreme situation where it never rains on a downstream subbasin, then the streamflow at its outlet has to come from the upstream basin.

The solution is to pass the information backward (i.e. upstream). This is less intuitive since water doesn't flow upstream, but from the information point of view it works both ways: when we calibrate b₁'s model using B₀'s model, we can say that the information in Q₀ constrains the possible values of Q₁ further (information passes forward), or we can equivalently say that the information in Q₁ constrains the possible values of Q₀ further (information passes backward). Now if instead of starting from the two first basins we start from the two last ones, and repeat this process by going upstream, we effectively make the information flow backward.

We start by calibrating B₄'s model using the measured water level H₄, and we do the same with B₃ and H₃ (which gives us some information on its parameters). Then we create a dual model consisting of a model for B₃ (with prior information on its parameters thanks to the previous calibration) and a model for b₄ (with no prior information), and we calibrate it with the ensemble of simulated streamflows for B₄ that we previously generated. This gives an ensemble of simulated streamflows for B₃ that have used information from the measured water levels H₃ and H₄. We can now repeat the same process with the dual model consisting of B₂ and b₃, using the ensemble of simulated streamflows for B₃ previously generated, etc. The following is the full algorithm corresponding to the backward pass.

def get_lnprob(peq, lnprob_prior, area_head, area_tail, q_kde=None):
    def lnprob(x):
        lnp = 0
        for i, v in enumerate(x):
            lnp += lnprob_prior[i](v)
        if not np.isfinite(lnp):
            return -np.inf, np.ones_like(peq.p.values) * np.inf
        x_head = x[:4] # don't include delay because we return q_head
        g_head = gr4j(x_head)
        if area_tail > 0:
            x_tail = x[5:]
            g_tail = gr4j(x_tail)
            q_tail = g_tail.run([peq.p.values, peq.e.values])
        else:
            q_tail = 0
        q_head = g_head.run([peq.p.values, peq.e.values])
        if len(x) > 4:
            d = delay(x[4])
            q_head_delayed = d.run(q_head)
        else:
            q_head_delayed = q_head
        q_sim = (q_head_delayed * area_head + q_tail * area_tail) / (area_head + area_tail)
        if q_kde is None:
            # observation is measured water level
            h_obs = peq.h_obs.values
            h_sim = np.hstack((np.full(warmup, np.nan), dist_map(q_sim[warmup:], h_obs[warmup:])))
            df = DataFrame({'h_sim': h_sim, 'h_obs': h_obs})[warmup:].dropna()
            std2 = std * std
            return lnp + np.sum(-np.square(df.h_sim.values - df.h_obs.values) / (2 * std2) - np.log(np.sqrt(2 * np.pi * std2))), q_head
        else:
            # observation is simulated streamflow
            lnp_q = 0
            for i in range(warmup, q_sim.size, sim_step):
                lnp_q += lnprob_from_density(q_kde[:, :, i])(q_sim[i])
            return lnp + lnp_q, q_head
    return lnprob

for ws_i in range(ws_nb-1, 0, -1):
    if ws_i == ws_nb - 1:
        nb = 2
    else:
        nb = 1
    for n in range(nb-1, -1, -1):
        # head basin, no prior information
        peq = df[['p', 'e']]
        peq['h_obs'] = df[f'h{ws_i-1+n}']
        x_prior = [uniform_density(*r) for r in x_range]
        lnprob_prior = [lnprob_from_density(p, *r) for p, r in zip(x_prior, x_range)]
        x0 = x_start
        lnprob = get_lnprob(peq, lnprob_prior, 1, 0)
        # run MCMC
        sampler = mcmc.Sampler(x0, lnprob)
        samples, q_sim = sampler.run(sample_nb, burnin)
        q_sim = np.array(q_sim)
        if n == 1:
            # outlet basin: no information passing
            q_ensemble[f'b{ws_nb-1}'] = q_sim
            plot_series(ensemble=q_sim[:, -days:], true=df[f'q_true_{ws_nb-1}'].values[-days:], title=f'Streamflow at the outlet of $B_{ws_nb-1}$')
            q_kde = np.empty((2, n_kde, q_sim.shape[1]))
            for i in range(q_kde.shape[2]):
                q_kde[:, :, i] = get_kde(q_sim[:, i], nb=n_kde)
        else:
            # head basin of a dual calibration has prior information
            x_head = [get_kde(samples[:, i]) for i in range(4)]

    peq = df[['p', 'e']]
    area_head = ws_i
    area_tail = 1
    x0 = [xy[0][np.argmax(xy[1])] for xy in x_head] # most likely value
    x0 += [d_start] + x_start
    # prior probability distribution is uniform for tail basin
    lnprob_prior = [lnprob_from_density(p, *r) for p, r in zip(x_head, x_range)]
    lnprob_prior += [lnprob_from_density(d_prior, *d_range)]
    x_tail = [uniform_density(*r) for r in x_range]
    lnprob_prior += [lnprob_from_density(p, *r) for p, r in zip(x_tail, x_range)]
    lnprob = get_lnprob(peq, lnprob_prior, area_head, area_tail, q_kde)
    # run MCMC
    sampler = mcmc.Sampler(x0, lnprob)
    samples, q_sim = sampler.run(sample_nb, burnin)
    # get simulated streamflow and uncertainty
    q_sim = np.array(q_sim)
    # KDE of streamflow time series will be used at next iteration
    q_kde = np.empty((2, n_kde, q_sim.shape[1]))
    for i in range(q_kde.shape[2]):
        q_kde[:, :, i] = get_kde(q_sim[:, i], nb=n_kde)
    q_ensemble[f'b{ws_i-1}'] = q_sim
    # plot updated streamflow
    plot_series(ensemble=q_sim[:, -days:], true=df[f'q_true_{ws_i-1}'].values[-days:], title=f'Streamflow at the outlet of $B_{ws_i-1}$')

100%|██████████| 2200/2200 [01:26<00:00, 15.14it/s]

100%|██████████| 2200/2200 [01:30<00:00, 24.23it/s]
100%|██████████| 2200/2200 [03:09<00:00, 14.44it/s]

100%|██████████| 2200/2200 [01:23<00:00, 26.43it/s]
100%|██████████| 2200/2200 [03:09<00:00, 13.22it/s]

100%|██████████| 2200/2200 [01:20<00:00, 25.87it/s]
100%|██████████| 2200/2200 [02:49<00:00, 11.58it/s]

100%|██████████| 2200/2200 [01:17<00:00, 28.25it/s]
100%|██████████| 2200/2200 [02:49<00:00, 13.02it/s]

Just like the forward pass accumulates information downstream, the backward pass accumulates information upstream, i.e. B₀'s streamflow is better known than B₄'s streamflow. The information gathered by the two passes has to be combined, which is what the smoothing step does.

Smoothing¶

The final step is to merge the forward and the backward passes, since they are complementary in their use of information. This is done by combining their ensemble of simulated streamflows at each time step.

for ws_i in range(ws_nb):
    q_sim = np.vstack((q_ensemble[f'f{ws_i}'], q_ensemble[f'b{ws_i}']))
    q_sim[q_sim==np.inf] = np.nan
    q_mean = np.nanmean(q_sim, axis=0)
    plot_series(q_sim[:, -days:], df[f'q_true_{ws_i}'].values[-days:], mean=q_mean[-days:], title=f'Updated streamflow at the outlet of $B_{ws_i}$')

Conclusion¶

This notebook showed how the forward-backward algorithm can be applied to hydrology in the context of distributed information, when sparse water level measurements are spread over the hydrological network, and when these measurements are of poor precision. The power of the bayesian framework can be leveraged to take advantage of this situation and pass the information downstream (forward) and upstream (backward), resulting in an efficient use of the overall information. Satellite measurements, and in particular satellite-based water level estimates, are well suited to this kind of processing. With the growing number of so-called virtual stations from e.g. Theia's Hydroweb, and precipitation products from e.g. NASA's Global Precipitation Measurement mission, hydrological modelling based only on satellite data becomes possible.