<!DOCTYPE html>

hydroweb

Satellite-based water level

Water level can be measured from satellites using a radar altimeter. When applied to rivers, such data carries information about streamflow. Hydroweb collects altimetry data from various satellites (Envisat, Jason-2 and Jason-3). Hydroweb's data is freely available but you need to create an account. I downloaded all the data relative to the Amazon river basin. There are actually two types of data: research and operational. I will first take a look at the data altogether.

In [1]:
import os
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.basemap import Basemap
import pandas as pd
from pandas import DataFrame
import rasterio
from tqdm import tqdm
import pickle

%matplotlib inline
In [2]:
def get_heights():
    heights = {'research': [], 'operational': []}
    data_path = '../data/amazon/hydroweb/'
    files = os.listdir(data_path)
    for name in tqdm(files):
        with open(data_path + name, 'rt') as f:
            data = f.read().split('\n')
        line = data[0]
        latlon = line[line.find(';lat=') + 1:line.find(';ref=')]
        vs_type = line[line.find(';type=') + 6:line.find(';diff=')]
        i, done = 1, False
        while not done:
            line = data[i]
            if line[0] == '#':
                i += 1
            else:
                done = True
        df = pd.read_csv(data_path + name, sep=';', skiprows=i, header=None, names=['1', 'date', 'time', latlon, '5', '6', '7']).drop(['1', '5', '6', '7'], 1)
        df['datetime'] = pd.to_datetime(df['date'] + df['time'])
        s = df.set_index(['datetime'])[latlon]
        heights[vs_type].append(s)
    return heights
In [3]:
heights = get_heights()
100%|██████████| 840/840 [00:03<00:00, 278.80it/s]
In [4]:
def plot_in_space(heights, alpha):
    lat0, lat1, lon0, lon1 = -30, 15, -90, -30
    plt.close('all')
    fig = plt.figure()
    ax = fig.add_subplot(111)
    m = Basemap(projection='cyl', resolution='l', area_thresh=10000, llcrnrlat=lat0, urcrnrlat=lat1, llcrnrlon=lon0, urcrnrlon=lon1)
    m.drawcoastlines()
    m.drawcountries()
    lat = [float(s.name[4:s.name.find(';')]) for s in heights]
    lon = [float(s.name[s.name.find(';lon=') + 5:]) for s in heights]
    ax.scatter(lon, lat, color='red', alpha=alpha)
    plt.show()
In [5]:
plot_in_space(heights['research'] + heights['operational'], 0.1)

Spatially, there is actually quite a lot of data. Let's see what we have temporally by plotting the normalized water levels.

In [6]:
def plot_in_time(heights, alpha):
    plt.close('all')
    plt.figure(figsize=(12, 3))
    for h in heights:
        s = h.values
        s -= np.min(s)
        s /= np.max(s)
        plt.plot(h.index, s, color='blue', alpha=alpha)
    plt.show()
In [7]:
plot_in_time(heights['research'] + heights['operational'], 0.01)

As you can see in the figure above, a lot of data lies between 2003 and 2011, and a lot less from 2011 on. The former corresponds to the research data, and the latter to the operational data. Let's look at the sampling period of research data:

In [8]:
def get_sampling_period(heights):
    delta = []
    for h in heights:
        delta.append(np.mean(h.index.values[1:] - h.index.values[:-1]))
    return np.mean(delta).astype('timedelta64[D]')
In [9]:
print(get_sampling_period(heights['research']))
35 days

Research data is very poorly sampled temporally, not even one data per month. Now let's look at operational data:

In [10]:
print(get_sampling_period(heights['operational']))
10 days

Operational data is sampled much more frequently, but there are much less virtual stations:

In [11]:
plot_in_space(heights['operational'], 0.5)

Research data was acquired by the Envisat satellite, and operational data by the Jason-2/3 satellites. They have different characteristics which can explain the differences in spatial/temporal sampling. But given that the larger the river, the better the quality of the water level measurements, it may not be a good idea to keep the research data, which covers a lot more territory but also much smaller rivers. So let's focus on operational data, whose 11 virtual station water levels are represented here:

In [12]:
plot_in_time(heights['operational'], 0.3)

It is also a good idea to keep a relatively small number of virtual stations, because we will see that the position of the virtual stations as given by Hydroweb don't necessarily match the HydroSHEDS's drainage network and flow accumulation data, which we will use for GROSS. Some visual adjusting will be required to find the virtual station on the flow accumulation map. To do that, I showed the flow accumulation map centered on each virtual station. The left column corresponds to the original position of the virtual stations, marked as a white dot. We can see that most of the time, it is not located on the river. So I manually applied a correction in order to move the position by a number of x/y pixels. The right column shows the corrected position of the virtual stations. Sometimes it is trivial, because there is only one river around, but when the virtual station is near a confluence, then we need to carefully see where it is supposed to be located. Fortunately, Hydroweb provides a map of the virtual stations, overlaid with satellite imagery. This makes it possible to match with the flow accumulation map.

In [13]:
with rasterio.drivers():
    with rasterio.open('tmp/sa_acc_15s/sa_acc_15s/w001001.adf') as src:
        data, = src.read()
        acc_aff = src.affine
In [14]:
lat = [float(s.name[4:s.name.find(';')]) for s in heights['operational']]
lon = [float(s.name[s.name.find(';lon=') + 5:]) for s in heights['operational']]
correction = [[13, 0], [2, 0], [-9, 0], [1, -1], [2, 0], [-1, -1], [0, -1], [17, 0], [-4, 0], [-5, 9], [0, 0]]
new_latlon = []
for i, lonlat in enumerate(list(zip(lon, lat))):
    x, y = [int(round(j, 5)) for j in ~acc_aff * lonlat]
    plt.close('all')
    f, (ax1, ax2) = plt.subplots(1, 2)
    f.set_size_inches(10, 10)
    d = 20
    ax1.imshow(data[y-d:y+d, x-d:x+d], interpolation='nearest')
    ax1.scatter(d, d, color='white')
    ax1.set_title('lat, lon = ' + str(lonlat[1]) + ', ' + str(lonlat[0]))
    dy, dx = correction[i]
    lonlat2 = acc_aff * [x + dx, y + dy]
    new_latlon.append(lonlat2[::-1])
    ax2.imshow(data[y-d+dy:y+d+dy, x-d+dx:x+d+dx], interpolation='nearest')
    ax2.scatter(d, d, color='white')
    dlat = round(lonlat2[1] - lonlat[1], 5)
    dlon = round(lonlat2[0] - lonlat[0], 5)
    ax2.set_title('dlat, dlon = ' + str(dlat) + ', ' + str(dlon))
    plt.show()

Let's check that the flow accumulation at the position of each virtual station is reasonably high.

In [15]:
for latlon in new_latlon:
    x, y = [int(round(i, 5)) for i in ~acc_aff * latlon[::-1]]
    print(data[y, x])
1819885
13755332
24849260
53104
8506136
10397620
840321
1771412
20724865
3331555
21354315

Which is okay. We will use these virtual station positions in the basin partitionning, which will be explained in another notebook.

In [16]:
with open('tmp/vs_latlon.pkl', 'wb') as f:
    pickle.dump(new_latlon, f)