<!DOCTYPE html>

mcmc

Bayesian methods for hydrological model parameter inference

In my last post, I used a Monte Carlo method in order to infer the GR4J model parameters when the streamflow was known as a water level, as opposed to a discharge. Because the hydrological model is only able to simulate a discharge, an extra step was needed to convert the observed water level to an equivalent discharge, leading to a high level of uncertainty. The Monte Carlo method that I implemented was very CPU intensive. In this notebook, I will show that it is possible to take advantage of much more efficient methods (MCMC or Markov Chain Monte Carlo), provided that we embrace the Bayesian framework. Fortunately, Python comes with a very nice library for Bayesian inference: PyMC3.

In [1]:
from pymc3 import Model, Uniform, Normal, sample
from pymc3.distributions import Continuous
import pymc3 as pm
from theano import as_op
import theano.tensor as tt
import pandas as pd
from pandas import DataFrame
import numpy as np
import matplotlib.pyplot as plt
from mcmc import gr4j, delay, dist_map
from scipy import stats
from IPython.display import SVG

plt.style.use('ggplot')
%matplotlib notebook

In the GROSS modelling system, the key step for inferring the streamflow on any point of a basin's river network is the splitting of this basin into two subbassins, as shown in the figure below:

In [2]:
SVG(filename='sb.svg') 
Out[2]:
SB0 SB1

In this figure, the whole basin (B) consists of two subbasins, SB0 in blue and SB1 in green. The river represented with a dashed line drains water from SB1 to its outlet (let's call the corresponding discharge Q1). Similarly, water from SB0 is drained to its outlet with a discharge Q0. At the outlet of B, the discharge (Q) is the sum of Q0 and Q1.

In GROSS, instead of fitting a hydrological model to B using an observed discharge Q, we fit two models at once, one to SB0 and the other to SB1, so that their simulated discharges match the observed water level H at the outlet of B. This model fitting involves the following steps:

  • simulate a discharge at the outlet of SB1 (Qsim1),
  • propagate Qsim1 to the outlet of B through a delay model (Qsim1_del),
  • simulate a discharge at the outlet of SB0 (Qsim0),
  • sum the two simulated discharges Qsim0 and Qsim1_del (Qsim),
  • convert the simulated discharge Qsim to an equivalent water level using distribution mapping (Hsim),
  • assess the quality of the simulated water level Hsim given the observed water level H at the outlet of B.

It surely results in a lot of uncertainty! Let's see how we can use a Bayesian framework to solve this problem. First, let's generate some data. We will start from the same dataset as here, which corresponds to 40 years of observed precipitation and potential evapotranspiration at a daily time step over a basin (B). We won't use the observed discharge for this basin, but instead re-create it by simulation, so that we can see if/how the Bayesian parameter inference works.

In [3]:
# load some observed data for P and E (we won't use Q but instead compute it synthetically)
peq = pd.read_pickle('peq.pkl')

# precipitation is the same for SB0 and SB1
p0 = peq.p.values.copy()
p1 = p0.copy()

# potential evapotranspiration is the same for SB0 and SB1
e0 = peq.e.values.copy()
e1 = e0.copy()

# discharge at the outlet of SB0 (due to the precipitation on SB0, not B)
true_param0 = [300, 0.5, 50, 5]
q0 = gr4j(true_param0).run([p0, e0])[0]

# discharge at the outlet of SB1, takes 3.3 days to reach the outlet of B
true_param1 = [100, -0.5, 30, 3, 3.3]
q1 = gr4j(true_param1[:4]).run([p1, e1])[0]
q1_del = delay(true_param1[4]).run(q1)

# true water level at the outlet of SB1, given an arbitrary rating curve
h1_true = (q1 ** 0.3) * 0.2 + 0.5

# observed water level at the outlet of SB1
noise_level = 0.1
noise1 = noise_level * np.random.randn(len(q1))
h1 = h1_true + noise1

# discharge at the outlet of B, assuming SB0 and SB1 have the same area
q = (q0 + q1_del) / 2

# true water level at the outlet of B, given an arbitrary rating curve
h_true = (q ** 0.5) * 0.1 + 2

# observed water level at the outlet of B
noise = noise_level * np.random.randn(len(q))
h = h_true + noise

df = DataFrame(index=peq.index)
df['h_true'] = h_true
df['h'] = h
df['h1_true'] = h1_true
df['h1'] = h1
In [4]:
df.loc['1985':'1986'].plot(figsize=(12, 5))
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb7bca82940>

We can see that the water level signals are very noisy, which can be the case when they are measured from a satellite. Now that we have all the data we need, let's set up our Bayesian model. This is pretty straightforward with PyMC3. We call it the dual model because it involves 2 GR4J models. Note that because we don't have any prior information about the GR4J model parameters, we give them a uniform distribution.

In [5]:
# 2 years of warmup for the GR4J model
warmup = 365 * 2

q_mod0 = gr4j().set_pe([p0, e0])
q_mod1 = gr4j().set_pe([p1, e1])

# Deterministic model for the simulation of water level with two GR4J models
@as_op(itypes=[tt.dscalar, tt.dscalar, tt.dscalar, tt.dscalar, tt.dscalar, tt.dscalar, tt.dscalar, tt.dscalar, tt.dscalar], otypes=[tt.dvector])
def dual_gr4j(x1_0, x2_0, x3_0, x4_0, x1_1, x2_1, x3_1, x4_1, d1):
    q_sim0 = q_mod0.set_x([x1_0, x2_0, x3_0, x4_0]).run()[0]
    q_sim1 = q_mod1.set_x([x1_1, x2_1, x3_1, x4_1]).run()[0]
    q_sim1_del = delay(float(d1)).run(q_sim1)
    q_sim = (q_sim0 + q_sim1_del) / 2
    h_sim = dist_map(q_sim[warmup:], h[warmup:])
    return h_sim

# PyMC3 model
dual_mcmc = Model()
with dual_mcmc:
    # priors for unknown model parameters

    # SB0
    x1_0 = Uniform('x1_0', 0.1, 1000)
    x2_0 = Uniform('x2_0', -1, 1)
    x3_0 = Uniform('x3_0', 0.1, 300)
    x4_0 = Uniform('x4_0', 0.1, 10)

    # SB1
    x1_1 = Uniform('x1_1', 0.1, 1000)
    x2_1 = Uniform('x2_1', -1, 1)
    x3_1 = Uniform('x3_1', 0.1, 300)
    x4_1 = Uniform('x4_1', 0.1, 10)
    d1   = Uniform('d1', 0, 20)

    # simulated water level time series
    h_sim = dual_gr4j(x1_0, x2_0, x3_0, x4_0, x1_1, x2_1, x3_1, x4_1, d1)

    # likelihood of observed water level
    h_obs = Normal('h_obs', mu=h_sim, sd=noise_level, observed=h[warmup:])

We use the Metropolis algorithm for the sampling method, and draw 100 samples from the posterior distribution, from which we discard the first half (burn-in). The reason for drawing such a small amount of samples is that in GROSS, this operation is going to be done a lot of times, possibily thousands of times, so it has to be as quick as possible. This will also test the efficiency of the MCMC algorithm.

In [6]:
sample_nb = 100
with dual_mcmc:
    step = pm.Metropolis([x1_0, x2_0, x3_0, x4_0, x1_1, x2_1, x3_1, x4_1, d1])
    trace = sample(sample_nb, step=step, njobs=1)
100%|██████████| 100/100 [00:33<00:00,  2.69it/s]
In [7]:
_ = pm.traceplot(trace)

As we can see from the figure above and the table below, the difference between the true parameters and the inferred parameters is quite large.

In [8]:
burnin = int(sample_nb / 2)
inf_param0 = [np.median(trace[burnin:]['x' + str(i + 1) + '_0']) for i in range(4)]
inf_param1 = [np.median(trace[burnin:]['x' + str(i + 1) + '_1']) for i in range(4)] + [np.median(trace[burnin:]['d1'])]
df = DataFrame(columns=['X1', 'X2', 'X3', 'X4', 'delay'])
df.loc['True parameters for SB0'] = true_param0 + ['']
df.loc['Inferred param. for SB0'] = inf_param0 + ['']
df.loc['True parameters for SB1'] = true_param1
df.loc['Inferred param. for SB1'] = inf_param1
df
Out[8]:
X1 X2 X3 X4 delay
True parameters for SB0 300.000000 0.500000 50.000000 5.000000
Inferred param. for SB0 124.512830 -0.359621 35.441697 6.858287
True parameters for SB1 100.000000 -0.500000 30.000000 3.000000 3.3
Inferred param. for SB1 308.330469 -0.329015 28.290769 3.745556 1.8248

The following figure shows the differences between the inferred discharges (based on the inferred parameters) and the true discharges.

In [9]:
df = DataFrame(index=peq.index)
df['q_sim0'] = q_mod0.set_x(inf_param0).run()[0]
df['q_sim1'] = q_mod1.set_x(inf_param1[:4]).run()[0]
df['q_sim1_del'] = delay(inf_param1[4]).run(df.q_sim1.values)
df['q_sim'] = (df.q_sim0 + df.q_sim1_del) / 2
df['q0'] = q0
df['q1'] = q1
df['q'] = q
df = df.loc['1953/02':'1953/05']

plt.figure(figsize=(12, 5))
plt.fill_between(df.index, df.q_sim0, df.q0, color='b', label='SB0')
plt.fill_between(df.index, df.q_sim1, df.q1, color='g', label='SB1')
plt.fill_between(df.index, df.q_sim, df.q, color='r', label='B')
plt.legend()
plt.show()

We can see that even though the inferred discharges for SB0 and SB1 are quite far from the truth, the discharge for B is inferred quite precisely. This is because the dual model was free to combine the internal discharges Qsim0 an Qsim1, as long as the resulting water level Hsim was close to H. In other words, the model was able to compensate for the errors. To prevent this, we need to constrain the model with more data.

Note: I could also show that a diversity of information can constrain the model enough. Indeed, the fact that the precipitation is the same for SB0 and SB1 doesn't allow the model to discriminate between the two subbasins. The greater the difference between the precipitations, the easier for the model to infer the parameters.

With satellite water level data, we often have access to points that are located inside the basin. Here, that would be the water level at the outlet of SB1 (H1). It means that we can infer the GR4J parameters for SB1 with H1 independently, and then use that information to further constrain the inferrence of the GR4J parameters for both SB0 and SB1 (using H). In Bayesian words, that would be called "updating": we can use new (independent) data to compute a posterior distribution, which is used as a prior distribution for a subsequent estimation. This new data (H1) allows estimating the posterior distribution of SB1's GR4J parameters, which is then used as the prior distribution for the dual model, instead of the uniform distribution.

So let's use H1 to estimate SB1's GR4J parameters. For now we don't care about SB0, we can set up a Bayesian model which only involves SB1's GR4J model. We will call this Bayesian model the single model.

In [10]:
# Deterministic model for the simulation of water level with one GR4J model
@as_op(itypes=[tt.dscalar, tt.dscalar, tt.dscalar, tt.dscalar], otypes=[tt.dvector])
def single_gr4j(x1_1, x2_1, x3_1, x4_1):
    q_sim1 = q_mod1.set_x([x1_1, x2_1, x3_1, x4_1]).run()[0]
    h_sim1 = dist_map(q_sim1[warmup:], h1[warmup:])
    return h_sim1

# PyMC3 model
single_mcmc = Model()
with single_mcmc:
    # priors for unknown model parameters

    # for SB1
    x1_1 = Uniform('x1_1', 0.1, 1000)
    x2_1 = Uniform('x2_1', -1, 1)
    x3_1 = Uniform('x3_1', 0.1, 300)
    x4_1 = Uniform('x4_1', 0.1, 10)

    # simulated water level time series
    h_sim1 = single_gr4j(x1_1, x2_1, x3_1, x4_1)

    # likelihood of observed water level
    h_obs1 = Normal('h_obs1', mu=h_sim1, sd=0.1, observed=h1[warmup:])
In [11]:
with single_mcmc:
    step = pm.Metropolis([x1_1, x2_1, x3_1, x4_1])
    trace = sample(sample_nb, step=step, njobs=1)
100%|██████████| 100/100 [00:07<00:00, 12.57it/s]
In [12]:
_ = pm.traceplot(trace)
In [13]:
inf_param1 = [np.median(trace[burnin:]['x' + str(i + 1) + '_1']) for i in range(4)]
df = DataFrame(columns=['X1', 'X2', 'X3', 'X4'])
df.loc['True parameters for SB1'] = true_param1[:4]
df.loc['Inferred param. for SB1'] = inf_param1
df
Out[13]:
X1 X2 X3 X4
True parameters for SB1 100.00000 -0.500000 30.000000 3.000000
Inferred param. for SB1 109.01368 0.441685 29.816628 2.842258

We can see that the algorithm did a much better job at estimating SB1's GR4J parameters. When we used the dual model, we had one observation (H) which agregated the effects of several processes, described by two GR4J models. Now we have one observation (H1) which is the result of only one process, described by one GR4J model. So we see that here it is not only because of new data that we can estimate our parameters with greater precision, but also because we used this data with a simpler model, thus drastically reducing the dimensionality.

Now let's inject this new information in our dual model. We use Kernel Density Estimation to compute a probability distribution from the posterior samples. In PyMC3, we have to wrap this probability distribution in a custom Continuous distribution class.

In [14]:
def from_posterior(param):
    class From_posterior(Continuous):
        def __init__(self, *args, **kwargs):
            self.from_posterior_logp = _from_posterior_logp(kwargs.pop('param'))
            super(From_posterior, self).__init__(*args, **kwargs)
        def logp(self, value):
            return self.from_posterior_logp(value)

    class From_posterior_logp:
        def __init__(self, param):
            t = trace[param][burnin:]
            tmin, tmax = np.min(t), np.max(t)
            self.x = np.linspace(tmin, tmax, 100)
            self.y = stats.gaussian_kde(t)(self.x)
            self.y /= np.sum(self.y)
        def from_posterior_logp(self, value):
            return np.array(np.log(np.interp(value, self.x, self.y, left=0, right=0)))

    def _from_posterior_logp(param):
        @as_op(itypes=[tt.dscalar], otypes=[tt.dscalar])
        def _(value):
            return From_posterior_logp(param).from_posterior_logp(value)
        return _

    return From_posterior(param, testval=np.median(trace[param][burnin:]), param=param)
In [15]:
# PyMC3 model
dual_mcmc = Model()
with dual_mcmc:
    # priors for unknown model parameters
    # SB0: no prior information
    x1_0 = Uniform('x1_0', 0.1, 1000)
    x2_0 = Uniform('x2_0', -1, 1)
    x3_0 = Uniform('x3_0', 0.1, 300)
    x4_0 = Uniform('x4_0', 0.1, 10)

    # SB1: prior information coming from posterior of the previous model
    x1_1 = from_posterior('x1_1')
    x2_1 = from_posterior('x2_1')
    x3_1 = from_posterior('x3_1')
    x4_1 = from_posterior('x4_1')
    d1   = Uniform('d1', 0, 20)

    # simulated water level time series
    h_sim = dual_gr4j(x1_0, x2_0, x3_0, x4_0, x1_1, x2_1, x3_1, x4_1, d1)

    # likelihood (sampling distribution) of observed water level
    # it consists of the measurements errors, but the GR4J model's errors are also in there
    h_obs = Normal('h_obs', mu=h_sim, sd=0.1, observed=h[warmup:])
In [16]:
with dual_mcmc:
    step = pm.Metropolis([x1_0, x2_0, x3_0, x4_0, x1_1, x2_1, x3_1, x4_1, d1])
    new_trace = sample(sample_nb, step=step, njobs=1)
100%|██████████| 100/100 [00:29<00:00,  3.49it/s]
In [17]:
_ = pm.traceplot(new_trace)
In [18]:
inf_param0 = [np.median(new_trace[burnin:]['x' + str(i + 1) + '_0']) for i in range(4)]
inf_param1 = [np.median(new_trace[burnin:]['x' + str(i + 1) + '_1']) for i in range(4)] + [np.median(new_trace[burnin:]['d1'])]
df = DataFrame(columns=['X1', 'X2', 'X3', 'X4', 'delay'])
df.loc['True parameters for SB0'] = true_param0 + ['']
df.loc['Inferred param. for SB0'] = inf_param0 + ['']
df.loc['True parameters for SB1'] = true_param1
df.loc['Inferred param. for SB1'] = inf_param1
df
Out[18]:
X1 X2 X3 X4 delay
True parameters for SB0 300.000000 0.500000 50.000000 5.000000
Inferred param. for SB0 378.913362 -0.113497 42.025996 4.869276
True parameters for SB1 100.000000 -0.500000 30.000000 3.000000 3.3
Inferred param. for SB1 109.156344 0.149166 29.816628 2.845177 3.44742

The inferred parameters are now much closer to the true parameters. X2 is always tricky because it reflects a transfer of water from the subbasin to its neighbors, and this information is lost when mapping the discharge to the water level. But overall the estimations have improved a lot.

In the figure below, wa can see that the discharge estimations are much more precise for the subbasins SB0 and SB1.

In [19]:
df = DataFrame(index=peq.index)
df['q_sim0'] = q_mod0.set_x(inf_param0).run()[0]
df['q_sim1'] = q_mod1.set_x(inf_param1[:4]).run()[0]
df['q_sim1_del'] = delay(inf_param1[4]).run(df.q_sim1.values)
df['q_sim'] = (df.q_sim0 + df.q_sim1_del) / 2
df['q0'] = q0
df['q1'] = q1
df['q'] = q
df = df.loc['1953/02':'1953/05']

plt.figure(figsize=(12, 5))
plt.fill_between(df.index, df.q_sim0, df.q0, color='b', label='SB0')
plt.fill_between(df.index, df.q_sim1, df.q1, color='g', label='SB1')
plt.fill_between(df.index, df.q_sim, df.q, color='r', label='B')
plt.legend()
plt.show()

Conclusion

In this notebook I showed that the Bayesian framework can be effectively used for hydrological model parameterization. PyMC3 provides a simple API for quickly setting up a Bayesian model, and I have barely scratched the surface of what can be done with it. I have been impressed by the efficiency of its sampling algorithms, with as few as 100 samples. For GROSS, this has a huge impact on the overall performances. Also, the hierarchical structure of a river network can now be taken into account by passing belief information on the hydrological model parameters downstream.