Flagging Clogs: Finding U or C¶

As seen above, data must be processed for anomolous precipitation and other quality issues before it can be used to identify clogs. Once complete, clog QA can begin.

The first step is to apply the work above to apply one of 2 flags:

U: this flag indicates undercatch, or a period where a gauge measured less rain than likely fell. This could be because of a clog that blocked measurement or a gauge float that was stuck.
C: this flag indicates cumulative precip. Counter intuitively, this identifies times when more precip was measured because of a clog. For example, a clog may slowly drip partial precipitation into the gauge even after the storm has passed or it could suddenly burst, dumping 2 or 3 weeks of precipitation in a single 5 minute period. Either way, rain free or low rain periods will show more precip than actually occurred.

This will require some initial work to restructure the data to process more than one site at a time.

[1]:

import pandas as pd
import matplotlib.pyplot as plt
from numpy import nan, arange

# Jupyter magic to make plots display interactive
# must install ipympl (Ipython-matplotlib) and nodejs
from ipywidgets.embed import embed_minimal_html

%matplotlib ipympl

# expand all plots to comfortable viewing size
plt.rcParams['figure.figsize'] = [8, 5]

import sys
sys.path.append("../")
%load_ext autoreload
%autoreload explicit
%aimport post_gce_qc.cross_probe_qc
from post_gce_qc import qaqc, data_transfer, cross_probe_qc, main

Get Some Data structured¶

Use the main process to get a dictionary of processed data.

[2]:

all_flags = main.main(2018, 2022, probes={'all_params'})

Loading all PPT data from ../config.yaml

Load data from UPL_02

All quality checks and quality assurance rules applied to UPL_02
------------------

Load data from CEN_01

All quality checks and quality assurance rules applied to CEN_01
------------------

Load data from CEN_02

All quality checks and quality assurance rules applied to CEN_02
------------------

Load data from CS2_02

All quality checks and quality assurance rules applied to CS2_02
------------------

[3]:

all_flags

[3]:

{'UPL_02': <post_gce_qc.qaqc.ApplyFlags at 0x16c6673d0>,
 'CEN_01': <post_gce_qc.qaqc.ApplyFlags at 0x340df68c0>,
 'CEN_02': <post_gce_qc.qaqc.ApplyFlags at 0x340ca5600>,
 'CS2_02': <post_gce_qc.qaqc.ApplyFlags at 0x3409c6c50>}

[4]:

all_flags['CEN_01'].data

[4]:

	tank_height	precip	adj_precip
Date
2017-10-01 00:00:00	239.5	0.0	0.0
2017-10-01 00:05:00	239.699997	0.2	0.2
2017-10-01 00:10:00	239.800003	0.3	0.3
2017-10-01 00:15:00	239.899994	0.1	0.1
2017-10-01 00:20:00	239.800003	0.0	0.0
...	...	...	...
2022-09-30 23:35:00	29.469999	0.0	0.0
2022-09-30 23:40:00	29.459999	0.0	0.0
2022-09-30 23:45:00	29.48	0.0	0.0
2022-09-30 23:50:00	29.469999	0.0	0.0
2022-09-30 23:55:00	29.48	0.0	0.0

525888 rows × 3 columns

[7]:

cp = cross_probe_qc.BuildXTable()
adj = all_flags['CEN_02'].data['adj_precip']

[8]:

acc_tbl = pd.DataFrame()
for probe in all_flags:
    flags = all_flags[probe].data
    acc_tbl[probe] = flags['adj_precip'].groupby(pd.Grouper(freq='YE-SEP')).cumsum()

[9]:

acc_tbl

[9]:

	UPL_02	CEN_01	CEN_02	CS2_02
Date
2017-10-01 00:00:00	0.0	0.0	0.0	0.0
2017-10-01 00:05:00	0.33	0.2	0.2	<NA>
2017-10-01 00:10:00	0.66	0.5	0.7	<NA>
2017-10-01 00:15:00	0.66	0.6	0.7	0.254
2017-10-01 00:20:00	0.98	0.6	0.9	<NA>
...	...	...	...	...
2022-09-30 23:35:00	2510.410156	2240.721924	2169.219971	<NA>
2022-09-30 23:40:00	2510.410156	2240.721924	2169.219971	<NA>
2022-09-30 23:45:00	2510.410156	2240.721924	2169.219971	<NA>
2022-09-30 23:50:00	2510.410156	2240.721924	2169.219971	<NA>
2022-09-30 23:55:00	2510.410156	2240.721924	2169.219971	<NA>

525888 rows × 4 columns

[10]:

acc_tbl.dropna().plot(grid=True, legend=True)

[10]:

<Axes: xlabel='Date'>

I can tell that I have my work cut out for me: some years uplo and CS2 are close, and some years CS2 and CENT are close. Plus, there is an obvious problem with the 2019 data at UPLO. But since the NOAH IV at CS2 is on a 15 minute schedule, let’s fill the NAN’s.

[10]:

pd.options.display.max_rows = 70
pd.options.display.min_rows = 20
acc_tbl.CS2_02

[10]:

Date
2017-10-01 00:00:00      0.0
2017-10-01 00:05:00     <NA>
2017-10-01 00:10:00     <NA>
2017-10-01 00:15:00    0.254
2017-10-01 00:20:00     <NA>
2017-10-01 00:25:00     <NA>
2017-10-01 00:30:00    0.254
2017-10-01 00:35:00     <NA>
2017-10-01 00:40:00     <NA>
2017-10-01 00:45:00    0.254
                       ...
2022-09-30 23:10:00     <NA>
2022-09-30 23:15:00     <NA>
2022-09-30 23:20:00     <NA>
2022-09-30 23:25:00     <NA>
2022-09-30 23:30:00     <NA>
2022-09-30 23:35:00     <NA>
2022-09-30 23:40:00     <NA>
2022-09-30 23:45:00     <NA>
2022-09-30 23:50:00     <NA>
2022-09-30 23:55:00     <NA>
Name: CS2_02, Length: 525888, dtype: float[pyarrow]

[11]:

acc_tbl = pd.DataFrame()
for probe in all_flags:
    flags = all_flags[probe].data
    acc_tbl[probe] = flags['adj_precip'].groupby(pd.Grouper(freq='YE-SEP')).cumsum()
    if acc_tbl[probe].isna().any(): acc_tbl[probe] = acc_tbl[probe].ffill()

[12]:

pd.options.display.min_rows = 30
acc_tbl

[12]:

	UPL_02	CEN_01	CEN_02	CS2_02
Date
2017-10-01 00:00:00	0.0	0.0	0.0	0.0
2017-10-01 00:05:00	0.33	0.2	0.2	0.0
2017-10-01 00:10:00	0.66	0.5	0.7	0.0
2017-10-01 00:15:00	0.66	0.6	0.7	0.254
2017-10-01 00:20:00	0.98	0.6	0.9	0.254
2017-10-01 00:25:00	0.98	0.6	1.0	0.254
2017-10-01 00:30:00	0.98	0.6	1.0	0.254
2017-10-01 00:35:00	1.31	0.6	1.0	0.254
2017-10-01 00:40:00	1.31	0.8	1.0	0.254
2017-10-01 00:45:00	1.31	1.0	1.4	0.254
2017-10-01 00:50:00	1.31	1.0	1.4	0.254
2017-10-01 00:55:00	1.31	1.0	1.4	0.254
2017-10-01 01:00:00	1.31	1.0	1.4	0.254
2017-10-01 01:05:00	1.31	1.0	1.4	0.254
2017-10-01 01:10:00	1.31	1.0	1.4	0.254
...	...	...	...	...
2022-09-30 22:45:00	2510.410156	2240.721924	2169.219971	1464.310059
2022-09-30 22:50:00	2510.410156	2240.721924	2169.219971	1464.310059
2022-09-30 22:55:00	2510.410156	2240.721924	2169.219971	1464.310059
2022-09-30 23:00:00	2510.410156	2240.721924	2169.219971	1464.310059
2022-09-30 23:05:00	2510.410156	2240.721924	2169.219971	1464.310059
2022-09-30 23:10:00	2510.410156	2240.721924	2169.219971	1464.310059
2022-09-30 23:15:00	2510.410156	2240.721924	2169.219971	1464.310059
2022-09-30 23:20:00	2510.410156	2240.721924	2169.219971	1464.310059
2022-09-30 23:25:00	2510.410156	2240.721924	2169.219971	1464.310059
2022-09-30 23:30:00	2510.410156	2240.721924	2169.219971	1464.310059
2022-09-30 23:35:00	2510.410156	2240.721924	2169.219971	1464.310059
2022-09-30 23:40:00	2510.410156	2240.721924	2169.219971	1464.310059
2022-09-30 23:45:00	2510.410156	2240.721924	2169.219971	1464.310059
2022-09-30 23:50:00	2510.410156	2240.721924	2169.219971	1464.310059
2022-09-30 23:55:00	2510.410156	2240.721924	2169.219971	1464.310059

525888 rows × 4 columns

[13]:

# calculate the ratio of CEN SA to the other sites

base_site = acc_tbl['CEN_01']
cross_site = acc_tbl[acc_tbl.columns[~acc_tbl.columns.isin(['CEN_01'])]]

ratio = pd.DataFrame()
for col in cross_site:
    ratio[col] = (base_site - cross_site[col])/base_site

ratio.tail()

[13]:

	UPL_02	CEN_02	CS2_02
Date
2022-09-30 23:35:00	-0.120358	0.03191	0.346501
2022-09-30 23:40:00	-0.120358	0.03191	0.346501
2022-09-30 23:45:00	-0.120358	0.03191	0.346501
2022-09-30 23:50:00	-0.120358	0.03191	0.346501
2022-09-30 23:55:00	-0.120358	0.03191	0.346501

Attempts to perform base - cross by base.subtract(cross) or by base-cross, subtracting multiple columns from a single column, caused a memory error and took an excessively long time. Some attempts to improve performance below.

[13]:

%%timeit
# baseline for loop
base_site = acc_tbl['CEN_01']
cross_site = acc_tbl[acc_tbl.columns[~acc_tbl.columns.isin(['CEN_01'])]]

ratio = pd.DataFrame()
for col in cross_site:
    ratio[col] = (base_site - cross_site[col])/base_site

1.43 ms ± 12.7 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

[14]:

%%timeit
# pre-allocating before for loop
base_site = acc_tbl['CEN_01']
cross_site = acc_tbl[acc_tbl.columns[~acc_tbl.columns.isin(['CEN_01'])]]

ratio = pd.DataFrame(index=cross_site.index, columns=cross_site.columns)
for col in cross_site:
    ratio[col] = (base_site - cross_site[col])/base_site

10 ms ± 306 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

[15]:

%%timeit
# pre allocate, but use a list of columns
base_site = acc_tbl['CEN_01']
cross_site = acc_tbl[acc_tbl.columns[~acc_tbl.columns.isin(['CEN_01'])]]

ratio = pd.DataFrame(index=cross_site.index, columns=list(cross_site.columns))
for col in cross_site:
    ratio[col] = (base_site - cross_site[col])/base_site

9.8 ms ± 177 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

[16]:

%%timeit
# pre-allocated index only (no columns)
base_site = acc_tbl['CEN_01']
cross_site = acc_tbl[acc_tbl.columns[~acc_tbl.columns.isin(['CEN_01'])]]

ratio = pd.DataFrame(index=cross_site.index)
for col in cross_site:
    ratio[col] = (base_site - cross_site[col])/base_site

891 μs ± 13.4 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

[18]:

%%timeit
# pre-allocate index-only and use function inside for loop
def calc_ratio(base, cross):
    return (base - cross)/base

base_site = acc_tbl['CEN_01']
cross_site = acc_tbl[acc_tbl.columns[~acc_tbl.columns.isin(['CEN_01'])]]

ratio = pd.DataFrame(index=cross_site.index)
for col in cross_site:
    ratio[col] = calc_ratio(base_site, cross_site[col])

873 μs ± 9.09 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

[17]:

%%timeit
# try to vectorize
base_site = acc_tbl['CEN_01']
cross_site = - acc_tbl[acc_tbl.columns[~acc_tbl.columns.isin(['CEN_01'])]]

dif = cross_site.add(base_site, axis=0)
ratio = dif.divide(base_site, axis=0)

722 μs ± 8.71 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

[19]:

%%timeit
# pre-allocate and use DataFrame.apply() instead of for loop
def calc_ratio(base, cross):
    return (base - cross)/base

base_site = acc_tbl['CEN_01']
cross_site = acc_tbl[acc_tbl.columns[~acc_tbl.columns.isin(['CEN_01'])]]

ratio = cross_site.apply(lambda x: calc_ratio(base_site, x), axis=0)

742 μs ± 4.24 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

[22]:

import swifter

[23]:

%%timeit
# pre-allocate and use swifter with DataFrame.apply()
def calc_ratio(base, cross):
    return (base - cross)/base

base_site = acc_tbl['CEN_01']
cross_site = acc_tbl[acc_tbl.columns[~acc_tbl.columns.isin(['CEN_01'])]]

ratio = cross_site.swifter.apply(lambda x: calc_ratio(base_site, x), axis=0)

56.5 s ± 646 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

[19]:

from numba import jit
@jit(cache=True, parallel=True)
def calc_ratio(base, cross):
    return (base - cross)/base

[20]:

%%timeit
# pre-allocate and use DataFrame.apply() but compile function in numba
base_site = acc_tbl['CEN_01'].to_numpy()
cross_site = acc_tbl[acc_tbl.columns[~acc_tbl.columns.isin(['CEN_01'])]]

ratio = cross_site.apply(lambda x: calc_ratio(base_site, x.to_numpy()), axis=0)

1.12 ms ± 50.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

[21]:

%%timeit
base_site = acc_tbl['CEN_01'].to_numpy()
cross_site = acc_tbl[acc_tbl.columns[~acc_tbl.columns.isin(['CEN_01'])]]

ratio = pd.DataFrame(index=cross_site.index)
for col in cross_site:
    ratio[col] = calc_ratio(base_site, cross_site[col].to_numpy())

1.24 ms ± 42.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Let’s start by checking against CENT shelter since I’m very familiar with the response from Capturing-clogs-from-ACC-ratio. Then take a look at the data overall.

[14]:

def calc_ratio(base, cross):
    return (base - cross)/base

base_site = acc_tbl['CEN_01']
cross_site = acc_tbl[acc_tbl.columns[~acc_tbl.columns.isin(['CEN_01'])]]

ratio = cross_site.apply(lambda x: calc_ratio(base_site, x), axis=0)

plt.figure()
ratio['CEN_02'].plot(grid=True, legend=True)

[14]:

<Axes: xlabel='Date'>

Now that the data is restructured, I’ll start by looking at CENT SA, since I’ve already done a lot of work with it. It is important to be able to identify:

periods where there is a clog
Whether the clog is creating an increase (C), a decrease (U), or no change (i.e. dry periods) in precipitation
Periods where the paired sensor is clogged and cannot provide a comparison

[15]:

ratio.plot(grid=True, legend=True)

[15]:

<Axes: xlabel='Date'>

CENT 02 (Shelter gauge)¶

Fix overflagging issue¶

[16]:

base_site = 'CEN_01'
base_flag = all_flags[base_site]

[17]:

site = 'CEN_02'
c_rat = ratio[site]

norm_ratio = 0.025
min_accum = 34
drop_threshold = 0.01
drop_window = '8D'

above_min = (acc_tbl[[site, base_site]] > min_accum).all(axis=1)
below_normal = c_rat < norm_ratio
# static function
dropping_ratio = qaqc.QaRules.find_drops(ratio, drop_threshold, col=site, wind=drop_window)

clog = above_min & (dropping_ratio | below_normal)

[18]:

plt.figure()
ratio[site].plot(grid=True, legend=True)
ratio.loc[clog, site].plot(grid=True, legend=True, linestyle='', marker='o')

[18]:

<Axes: xlabel='Date'>

[19]:

acc_tbl[[base_site, site]].plot(grid=True, legend=True)

[19]:

<Axes: xlabel='Date'>

For some reason CEN01 gets behind in Dec and doesn’t catch up. First I’ll see if the raw data fixes this. But I can see a few flat lines with jumps in February and April, so this must be Dec specific. I’ll look at the flag explanations to see what they say.

[20]:

unadjacc = cross_probe_qc.BuildXTable.calc_wy_acc(base_flag.data.precip)

plt.figure()
ax1 = unadjacc.plot(grid=True, legend=True)
acc_tbl[[base_site, site]].plot(grid=True, legend=True, ax=ax1)

[20]:

<Axes: xlabel='Date'>

[22]:

acc_tbl[site].plot(grid=True, legend=True)

[22]:

<Axes: xlabel='Date'>

[21]:

find_qa = base_flag.event.explanation != ''
base_flag.event[find_qa]

[21]:

	prov_flag	QaRule_flag	event_code	explanation
Date
2017-10-19 14:45:00	MM	E	INTPRO	QaRule AutoFlag: duplicate;
2017-10-19 14:50:00	MM	E	INTPRO	QaRule AutoFlag: duplicate;
2017-10-19 14:55:00	MM	E	INTPRO	QaRule AutoFlag: duplicate;
2017-10-19 15:00:00	MM	E	INTPRO	QaRule AutoFlag: duplicate;
2017-10-19 15:05:00	MM	E	INTPRO	QaRule AutoFlag: duplicate;
2017-10-19 15:10:00	MM	E	INTPRO	QaRule AutoFlag: duplicate;
2017-10-19 15:15:00	MM	E	INTPRO	QaRule AutoFlag: duplicate;
2017-10-19 15:20:00	MM	E	INTPRO	QaRule AutoFlag: duplicate;
2017-10-19 15:25:00	MM	E	INTPRO	QaRule AutoFlag: duplicate;
2017-10-19 15:30:00	MM	E	INTPRO	QaRule AutoFlag: duplicate;
2018-01-20 05:00:00	<NA>	E	INTPRO	QaRule AutoFlag: duplicate;
2018-01-20 05:05:00	<NA>	E	INTPRO	QaRule AutoFlag: duplicate;
2018-01-20 05:10:00	<NA>	E	INTPRO	QaRule AutoFlag: duplicate;
2018-01-20 05:15:00	<NA>	E	INTPRO	QaRule AutoFlag: duplicate;
2018-01-20 05:20:00	<NA>	E	INTPRO	QaRule AutoFlag: duplicate;
...	...	...	...	...
2022-03-13 02:00:00	MM	E	INTPRO	QaRule AutoFlag: duplicate;
2022-03-13 02:05:00	MM	E	INTPRO	QaRule AutoFlag: duplicate;
2022-03-13 02:10:00	MM	E	INTPRO	QaRule AutoFlag: duplicate;
2022-03-13 02:15:00	MM	E	INTPRO	QaRule AutoFlag: duplicate;
2022-03-13 02:20:00	MM	E	INTPRO	QaRule AutoFlag: duplicate;
2022-03-13 02:25:00	MM	E	INTPRO	QaRule AutoFlag: duplicate;
2022-03-13 02:30:00	MM	E	INTPRO	QaRule AutoFlag: duplicate;
2022-03-13 02:35:00	MM	E	INTPRO	QaRule AutoFlag: duplicate;
2022-03-13 02:40:00	MM	E	INTPRO	QaRule AutoFlag: duplicate;
2022-03-13 02:45:00	MM	E	INTPRO	QaRule AutoFlag: duplicate;
2022-03-13 02:50:00	MM	E	INTPRO	QaRule AutoFlag: duplicate;
2022-03-13 02:55:00	MM	E	INTPRO	QaRule AutoFlag: duplicate;
2022-09-06 12:50:00	<NA>			QaRule AutoFlag: tank_empty;
2022-09-06 12:55:00	<NA>			QaRule AutoFlag: tank_empty;
2022-09-06 13:00:00	<NA>			QaRule AutoFlag: tank_empty;

1365 rows × 4 columns

[22]:

base_flag.data[find_qa]

[22]:

	tank_height	precip	adj_precip
Date
2017-10-19 14:45:00	346.5	0.3	0.0
2017-10-19 14:50:00	346.5	0.3	0.0
2017-10-19 14:55:00	346.5	0.3	0.0
2017-10-19 15:00:00	346.5	0.3	0.0
2017-10-19 15:05:00	346.5	0.3	0.0
2017-10-19 15:10:00	346.5	0.3	0.0
2017-10-19 15:15:00	346.5	0.3	0.0
2017-10-19 15:20:00	346.5	0.3	0.0
2017-10-19 15:25:00	346.5	0.3	0.0
2017-10-19 15:30:00	346.5	0.3	0.0
2018-01-20 05:00:00	209.800003	0.2	0.0
2018-01-20 05:05:00	209.800003	0.2	0.0
2018-01-20 05:10:00	209.800003	0.2	0.0
2018-01-20 05:15:00	209.800003	0.2	0.0
2018-01-20 05:20:00	209.800003	0.2	0.0
...	...	...	...
2022-03-13 02:00:00	39.169998	0.26	0.0
2022-03-13 02:05:00	39.169998	0.26	0.0
2022-03-13 02:10:00	39.169998	0.26	0.0
2022-03-13 02:15:00	39.169998	0.26	0.0
2022-03-13 02:20:00	39.169998	0.26	0.0
2022-03-13 02:25:00	39.169998	0.26	0.0
2022-03-13 02:30:00	39.169998	0.26	0.0
2022-03-13 02:35:00	39.169998	0.26	0.0
2022-03-13 02:40:00	39.169998	0.26	0.0
2022-03-13 02:45:00	39.169998	0.26	0.0
2022-03-13 02:50:00	39.169998	0.26	0.0
2022-03-13 02:55:00	39.169998	0.26	0.0
2022-09-06 12:50:00	0.0	0.0	0.0
2022-09-06 12:55:00	8.97	6.7	0.0
2022-09-06 13:00:00	7.248	4.978	0.0

1365 rows × 3 columns

[23]:

plt.figure()
strt, end = pd.to_datetime('12/25/2018'), pd.to_datetime('12/26/2018')
ax1 = base_flag.data.loc[strt:end, 'tank_height'].plot(grid=True, legend=True, linestyle='-.')
ax2 = plt.twinx(ax1)
base_flag.data.loc[strt:end, ['precip', 'adj_precip']].plot(grid=True, legend=True, ax=ax2)

[23]:

<Axes: xlabel='Date'>

There is no adj_precip until after the spike. And that huge chunk of precip is missing. Let’s check that out, but probably need to unflag something.

[24]:

strt, end = pd.to_datetime('12/25/2018 0900'), pd.to_datetime('12/25/2018 1000')
pd.set_option('display.max_rows', None)
base_flag.data[strt:end]

[24]:

	tank_height	precip	adj_precip
Date
2018-12-25 09:00:00	134.300003	0.0	<NA>
2018-12-25 09:05:00	134.399994	0.0	<NA>
2018-12-25 09:10:00	134.300003	0.0	<NA>
2018-12-25 09:15:00	134.300003	0.0	<NA>
2018-12-25 09:20:00	307.399994	173.100006	<NA>
2018-12-25 09:25:00	307.299988	173.0	0.0
2018-12-25 09:30:00	307.299988	0.0	0.0
2018-12-25 09:35:00	307.299988	0.0	0.0
2018-12-25 09:40:00	307.399994	0.1	0.1
2018-12-25 09:45:00	307.399994	0.0	0.0
2018-12-25 09:50:00	307.299988	0.0	0.0
2018-12-25 09:55:00	307.399994	0.0	0.0
2018-12-25 10:00:00	307.299988	0.0	0.0

[25]:

base_flag.event[strt:end]

[25]:

	prov_flag	QaRule_flag	event_code	explanation
Date
2018-12-25 09:00:00	M
2018-12-25 09:05:00	M
2018-12-25 09:10:00	M
2018-12-25 09:15:00	M
2018-12-25 09:20:00	RM
2018-12-25 09:25:00	<NA>	E	INTPRO	QaRule AutoFlag: duplicate;
2018-12-25 09:30:00	<NA>
2018-12-25 09:35:00	<NA>
2018-12-25 09:40:00	<NA>
2018-12-25 09:45:00	<NA>
2018-12-25 09:50:00	<NA>
2018-12-25 09:55:00	<NA>
2018-12-25 10:00:00	<NA>

OK, so GCE flagged the first value as Missing, which applied a 0 to the precip. But :meth:post_gce_qc.qaqc.QaRules caught the second value as a duplicate. So both were removed. I will address this with a manual flag.

[26]:

#del all_flags, ratio
all_flags = main.main(2018, 2022, probes={'all_params'})
bct = cross_probe_qc.BuildXTable()
tbl = bct.assemble_cross_table(all_flags, ppt_col='adj_precip')
acc_tbl = tbl.apply(bct.calc_wy_acc)

Loading all PPT data from ../config.yaml

Load data from UPL_02

All quality checks and quality assurance rules applied to UPL_02
------------------

Load data from CEN_01

All quality checks and quality assurance rules applied to CEN_01
------------------

Load data from CEN_02

All quality checks and quality assurance rules applied to CEN_02
------------------

Load data from CS2_02

All quality checks and quality assurance rules applied to CS2_02
------------------

[27]:

def calc_ratio(base, cross):
    return (base - cross)/base

base_site = acc_tbl['CEN_01']
cross_site = acc_tbl[acc_tbl.columns[~acc_tbl.columns.isin(['CEN_01'])]]

ratio = cross_site.apply(lambda x: calc_ratio(base_site, x), axis=0)

[28]:

base_site = 'CEN_01'
base_flag = all_flags[base_site]

strt, end = pd.to_datetime('12/25/2018 0900'), pd.to_datetime('12/25/2018 1000')
base_flag.event[strt:end]

[28]:

	prov_flag	QaRule_flag	event_code	explanation
Date
2018-12-25 09:00:00	M
2018-12-25 09:05:00	M
2018-12-25 09:10:00	M
2018-12-25 09:15:00	M
2018-12-25 09:20:00	RM			ManualFlag: Remove GCE M flag during clog period to include rapid unclog and catch up of cumulative precip.;
2018-12-25 09:25:00	<NA>	E	INTPRO	QaRule AutoFlag: duplicate;
2018-12-25 09:30:00	<NA>
2018-12-25 09:35:00	<NA>
2018-12-25 09:40:00	<NA>
2018-12-25 09:45:00	<NA>
2018-12-25 09:50:00	<NA>
2018-12-25 09:55:00	<NA>
2018-12-25 10:00:00	<NA>

[29]:

base_flag.data[strt:end]

[29]:

	tank_height	precip	adj_precip
Date
2018-12-25 09:00:00	134.300003	0.0	<NA>
2018-12-25 09:05:00	134.399994	0.0	<NA>
2018-12-25 09:10:00	134.300003	0.0	<NA>
2018-12-25 09:15:00	134.300003	0.0	<NA>
2018-12-25 09:20:00	307.399994	173.100006	173.100006
2018-12-25 09:25:00	307.299988	173.0	0.0
2018-12-25 09:30:00	307.299988	0.0	0.0
2018-12-25 09:35:00	307.299988	0.0	0.0
2018-12-25 09:40:00	307.399994	0.1	0.1
2018-12-25 09:45:00	307.399994	0.0	0.0
2018-12-25 09:50:00	307.299988	0.0	0.0
2018-12-25 09:55:00	307.399994	0.0	0.0
2018-12-25 10:00:00	307.299988	0.0	0.0

Undercatch or Cumulative: WY 2019¶

This WY has already had substantial work. I’ll start by using the criteria developed and writing a code to assign a flag for these events: U-undercatch, or C-cumulative precip.

[30]:

base_site = 'CEN_01'
base_flag = all_flags[base_site]
site = 'CEN_02'
c_rat = ratio[site]

norm_ratio = 0.025
min_accum = 34
drop_threshold = 0.01
drop_window = '8D'

above_min = (acc_tbl[[site, base_site]] > min_accum).all(axis=1)
below_normal = c_rat < norm_ratio
# static function
dropping_ratio = qaqc.QaRules.find_drops(ratio, drop_threshold, col=site, wind=drop_window)

clog = above_min & (dropping_ratio | below_normal)


plt.figure()
ax1 = ratio[site].plot(grid=True, legend=True)
ratio.loc[clog, site].plot(grid=True, legend=True, linestyle='', marker='o', ax=ax1)

[30]:

<Axes: xlabel='Date'>

10/29/18 example¶

This is a good example to start with. It get’s behind, it catches most of the way up, then it has this slow drip catch up. Some good undercatch followed by some mixed catch up. Perfect for a test.

[32]:

# add flags for clog
base_flag.flags['U'] |= clog
# grab a test day
tday = base_flag.get_flagged_days()[20]
# use clog as event DataFrame
ax1 = base_flag.plot_flagged_day(tday, base_site, auto_qa_event=pd.DataFrame(data=clog, columns=['clog']), paired_tank=all_flags[site].data['tank_height'])

[141]:

for f in range(10,16,1):
    plt.close(f)

[33]:

pd.options.display.max_rows = 100
pd.options.display.min_rows = 50

Clogs are the the big purple dots. The large catch up value is obvious, but we probably need to take a closer look at the small precipitation amounts afterwards and during the early part of the clog.

[34]:

strt, end = pd.to_datetime('10/28/18'), pd.to_datetime('10/31/18')
tbl.loc[strt:end].plot(legend=True, grid=True, linestyle='', marker='.')
tbl.loc[clog, base_site].plot(grid=True, legend=True, linestyle='', marker='o')

[34]:

<Axes: xlabel='Date'>

[37]:

strt, end = pd.to_datetime('10/29/18'), pd.to_datetime('10/31/18 0300')
tbl.loc[strt:end]

[37]:

	UPL_02	CEN_01	CEN_02	CS2_02
Date
2018-10-29 00:00:00	0.0	0.5	0.0	0.254
2018-10-29 00:05:00	0.0	0.0	0.0	0.254
2018-10-29 00:10:00	0.0	0.0	0.0	0.254
2018-10-29 00:15:00	0.0	0.0	0.2	1.016
2018-10-29 00:20:00	0.0	0.2	0.5	1.016
2018-10-29 00:25:00	0.0	0.3	0.1	1.016
2018-10-29 00:30:00	0.0	0.0	0.0	0.508
2018-10-29 00:35:00	0.0	0.1	0.0	0.508
2018-10-29 00:40:00	0.0	0.1	0.2	0.508
2018-10-29 00:45:00	0.0	0.0	0.0	0.0
2018-10-29 00:50:00	0.0	0.0	0.0	0.0
2018-10-29 00:55:00	0.0	0.3	0.0	0.0
2018-10-29 01:00:00	0.3	0.0	0.0	1.016
2018-10-29 01:05:00	0.0	0.0	0.0	1.016
2018-10-29 01:10:00	0.0	0.0	0.0	1.016
2018-10-29 01:15:00	0.4	0.2	0.0	0.762
2018-10-29 01:20:00	0.3	0.3	0.3	0.762
2018-10-29 01:25:00	0.0	0.1	0.4	0.762
2018-10-29 01:30:00	0.3	0.0	0.3	1.27
2018-10-29 01:35:00	0.0	0.3	0.1	1.27
2018-10-29 01:40:00	0.0	0.3	0.4	1.27
2018-10-29 01:45:00	0.4	0.4	0.3	0.762
2018-10-29 01:50:00	0.3	0.3	0.0	0.762
2018-10-29 01:55:00	0.3	0.0	0.3	0.762
2018-10-29 02:00:00	0.0	0.1	0.0	0.254
...	...	...	...	...
2018-10-31 01:00:00	0.0	0.0	0.0	0.0
2018-10-31 01:05:00	0.0	0.0	0.0	0.0
2018-10-31 01:10:00	0.0	0.0	0.0	0.0
2018-10-31 01:15:00	0.0	0.0	0.0	0.254
2018-10-31 01:20:00	0.0	0.0	0.0	0.254
2018-10-31 01:25:00	0.0	0.0	0.0	0.254
2018-10-31 01:30:00	0.0	1.0	0.0	0.508
2018-10-31 01:35:00	0.0	0.1	0.1	0.508
2018-10-31 01:40:00	0.0	0.3	0.4	0.508
2018-10-31 01:45:00	0.0	0.0	0.3	1.016
2018-10-31 01:50:00	0.0	0.3	0.0	1.016
2018-10-31 01:55:00	0.0	0.1	0.3	1.016
2018-10-31 02:00:00	0.0	0.7	0.0	0.508
2018-10-31 02:05:00	0.0	0.1	0.0	0.508
2018-10-31 02:10:00	0.0	0.2	0.4	0.508
2018-10-31 02:15:00	1.3	0.3	0.5	0.254
2018-10-31 02:20:00	0.0	0.0	0.0	0.254
2018-10-31 02:25:00	0.0	0.1	0.0	0.254
2018-10-31 02:30:00	0.3	0.0	0.0	0.508
2018-10-31 02:35:00	0.0	0.2	0.0	0.508
2018-10-31 02:40:00	0.0	0.1	0.2	0.508
2018-10-31 02:45:00	0.4	0.1	0.0	0.254
2018-10-31 02:50:00	0.3	0.1	0.3	0.254
2018-10-31 02:55:00	0.2	0.2	0.0	0.254
2018-10-31 03:00:00	0.0	0.0	0.2	0.254

613 rows × 4 columns

[35]:

tbl.loc[strt:end].plot(legend=True, grid=True, linestyle='', marker='.')
tbl.loc[clog, base_site].plot(grid=True, legend=True, linestyle='', marker='o')

[35]:

<Axes: xlabel='Date'>

At the 5 minute level they can be really different. I think an hour is a good bottom number to start with for a rolling average. Since there are many sites to be compared to, it would be ideal to only run the calculation on the base site. So That will be a starting point.

[152]:

plt.close(12)

[36]:

df_roll = base_flag.data.adj_precip.rolling(window='1H', center=True, min_periods=None)
df_mean =  df_roll.mean(engine='numba').interpolate(method='linear')

tbl.loc[strt:end, [base_site, site]].plot(legend=True, grid=True, linestyle='', marker='.')
tbl.loc[clog, base_site].plot(grid=True, legend=True, linestyle='', marker='o')

for nstd in [0.5, 1, 1.5, 2, 2.5]:
    df_std = df_mean + (nstd * df_roll.std(engine='numba').interpolate(method='linear'))
    df_std.plot(legend=True, grid=True)

/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/3224208315.py:1: FutureWarning: 'H' is deprecated and will be removed in a future version, please use 'h' instead.
  df_roll = base_flag.data.adj_precip.rolling(window='1H', center=True, min_periods=None)

The bottom line (mean plus 1/2 Std) doesn’t seem to contain even most of the base precipitation amounts. And the top line (mean + 2.5 std) seems to almost always contain all of the comparison precipitation, even in that short lived clog before the big clog. Let’s get a few less lines onthe screen. An hour seems to give a pretty nice envelope. Let’s see what it looks like to apply C or U based on this.

[172]:

plt.close(14)

[172]:

plt.close(14)

[37]:

#plt.figure()
tbl.loc[strt:end, [base_site, site]].plot(legend=True, grid=True, linestyle='', marker='.')
tbl.loc[clog, base_site].plot(grid=True, legend=True, linestyle='', marker='o')

for nstd in [ 1, 1.5, 2]:
    df_std = df_mean + (nstd * df_roll.std(engine='numba').interpolate(method='linear'))
    df_std.plot(legend=True, grid=True)
    u = tbl[site] > df_std
    tbl.loc[u, base_site].plot(grid=True, legend=True, linestyle='', marker='$U$')
    c = tbl[site] < df_std
    tbl.loc[c, base_site].plot(grid=True, legend=True, linestyle='', marker='$C$')

Clearly this needs to be filtered to only look during clogs. B/c that is a mess.

[38]:

#plt.figure()
tbl.loc[strt:end, [base_site, site]].plot(legend=True, grid=True, linestyle='', marker='.')
tbl.loc[clog, base_site].plot(grid=True, legend=True, linestyle='', marker='o')

for nstd in [ 1, 1.5]:
    df_std = df_mean + (nstd * df_roll.std(engine='numba').interpolate(method='linear'))
    df_std.plot(legend=True, grid=True)
    non0 = df_std>0
    u = clog & (tbl[site] >= df_std)
    tbl.loc[u, site].plot(grid=True, legend=True, linestyle='', marker='$U$')
    c = clog & (tbl[site] < df_std) & non0
    tbl.loc[c, base_site].plot(grid=True, legend=True, linestyle='', marker='$C$')

The case of zero precipitation from either sensor looks really weird. A brief 5 minute pause where it doesn’t rain or there is a slight shift in the clogged tank and the all of these 0 precipitation moments get tagged with a cumulative precipitation flag. I think there needs to be a running mean for both sensors. I’ll try to adjust the non0 variable first, but I think I’ll need to calculate both running means.

There is also a precision issue here: our sensors are mostly only accurate to 0.28 mm, and we’re parsing some 0.1mm values here…but that is a whole other topic.

[39]:

#plt.figure()
tbl.loc[strt:end, [base_site, site]].plot(legend=True, grid=True, linestyle='', marker='.')
tbl.loc[clog, base_site].plot(grid=True, legend=True, linestyle='', marker='o')

for nstd in [ 1, 1.5]:
    df_std = df_mean + (nstd * df_roll.std(engine='numba').interpolate(method='linear'))
    df_std.plot(legend=True, grid=True)
    non0 = tbl[base_site]>0
    u = clog & (tbl[site] >= df_std)
    tbl.loc[u, site].plot(grid=True, legend=True, linestyle='', marker='$U$')
    c = clog & (tbl[site] < df_std) & non0
    tbl.loc[c, base_site].plot(grid=True, legend=True, linestyle='', marker='$C$')

This attempted to switch the non0 parameter to the actual precipitation, not it’s running average.

That got rid of the ‘C’ on 0, but it still has all these ridiculous individual C’s when there is a momentary pause in CEN_02. That doesn’t seem clear or reasonable.

[186]:

for f in range(17,20):
    plt.close(f)

[40]:

#plt.figure()
xprb = tbl.loc[strt:end, [base_site, site]]
ax1 = xprb[strt:end].plot(legend=True, grid=True, linestyle='', marker='.')
xprb.loc[clog[strt:end], base_site].plot(grid=True, legend=True, linestyle='', marker='o', ax=ax1)

# calc rolling mean and std for pair
df_roll = xprb.rolling(window='1h', center=True, min_periods=None)
df_mean =  df_roll.mean(engine='numba').interpolate(method='linear')
df_std = df_mean + (1.5 * df_roll.std(engine='numba').interpolate(method='linear'))

df_std.plot(legend=True, grid=True, ax=ax1)

# set criteria for undercatch
non0 = df_std[site]>0
u = clog & (df_std[site] >= df_std[base_site]) & non0
xprb.loc[u, base_site].plot(grid=True, legend=True, linestyle='', marker='$U$', ax=ax1)

# set criteria for catchup cumulative totals
non0 = xprb[base_site]>0
c = clog & (df_std[site] < df_std[base_site]) & non0
xprb.loc[c, base_site].plot(grid=True, legend=True, linestyle='', marker='$C$', ax=ax1)

/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/652453100.py:15: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
  u = clog & (df_std[site] >= df_std[base_site]) & non0
/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/652453100.py:20: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
  c = clog & (df_std[site] < df_std[base_site]) & non0
/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/652453100.py:20: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
  c = clog & (df_std[site] < df_std[base_site]) & non0

[40]:

<Axes: xlabel='Date'>

[81]:

for f in range(17, 26):
    plt.close(f)

That looks a lot better.

When both sensors read no rain for a few hours it doesn’t get flagged (Green dot shows clog event without flag).
Lot of low precipitation values flagged as ‘U’
when the purple line disappears, any precip is flagged as ‘C’

Now let’s look at more examples from Capturing-clogs-from-ACC-ratio

12/18/18 example¶

[44]:

tday = base_flag.get_flagged_days()[25]

[45]:

base_flag.flags.drop(columns=[''], inplace=True)

Ok, so the legends are a screwed up. It’s sort of a hacky way to call that function. But this is a clear multi day clog. All U followed by a hug C. Let’s see if we can apply the criteria and still have it work right.

[56]:

base_flag.flags = base_flag.flags.drop('', axis=1)

[57]:

# add flags for clog
base_flag.flags['U'] |= clog

clog_event = pd.DataFrame(data=clog, columns=['clog'])
clog_event[['drain_event', 'neg_delta_tank']] = False
# grab a test day
tday = base_flag.get_flagged_days()[26]
# use clog as event DataFrame
ax1 = base_flag.plot_flagged_day(tday, base_site, tdelta='8D',  auto_qa_event=clog_event,
                                 paired_tank=all_flags['CEN_02'].data['tank_height'])

[44]:

plt.close(19)

[58]:

clg_plt = {'clog': {'ax': ax1[1], 'linestyle': '', 'marker': 'o', 'color': 'navy'}}
end = tday + pd.to_timedelta('8D')
all_flags['CEN_01'].data.loc[tday:end, 'tank_height'][clog].plot(grid=True, **clg_plt['clog'])

[58]:

<Axes: title={'center': 'CEN_01 - 2018-12-19 00:00:00'}, xlabel='Date'>

Okay, super hacky way to apply that plotting function (all clog values set to U among other things). But there is a clear, multi-day clog followed by a big cumulative total of all precipitation during the previous clog. Let’s see if the criteria still work.

It looks like one problem, right off the bat, is that the catchup isn’t flagged as part of the clog.

[51]:

site = 'CEN_02'

[59]:

strt, end = pd.to_datetime('12/18/18'), pd.to_datetime('12/26/18')
xprb = tbl.loc[strt:end, [base_site, site]]
ax1 = xprb[strt:end].plot(legend=True, grid=True, linestyle='', marker='.')
xprb.loc[clog[strt:end], base_site].plot(grid=True, legend=True, linestyle='', marker='o', ax=ax1)


# calc rolling mean and std for pair
df_roll = xprb.rolling(window='1h', center=True, min_periods=None)
df_mean =  df_roll.mean(engine='numba').interpolate(method='linear')
df_std = df_mean + (1.5 * df_roll.std(engine='numba').interpolate(method='linear'))

df_std.plot(legend=True, grid=True, ax=ax1)

# set criteria for undercatch
non0 = df_std[site]>0
u = clog & (df_std[site] >= df_std[base_site]) & non0
xprb.loc[u, base_site].plot(grid=True, legend=True, linestyle='', marker='$U$', ax=ax1)

# set criteria for catchup cumulative totals
non0 = xprb[base_site]>0
c = clog & (df_std[site] < df_std[base_site]) & non0
xprb.loc[c, base_site].plot(grid=True, legend=True, linestyle='', marker='$C$', ax=ax1)

/opt/homebrew/Caskroom/miniconda/base/envs/datasci/lib/python3.10/site-packages/pandas/plotting/_matplotlib/core.py:580: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). Consider using `matplotlib.pyplot.close()`.
  fig = self.plt.figure(figsize=self.figsize)
/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/386367701.py:16: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
  u = clog & (df_std[site] >= df_std[base_site]) & non0
/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/386367701.py:21: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
  c = clog & (df_std[site] < df_std[base_site]) & non0
/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/386367701.py:21: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
  c = clog & (df_std[site] < df_std[base_site]) & non0

[59]:

<Axes: xlabel='Date'>

[60]:

ax2 = xprb[strt:end].plot(legend=True, grid=True, linestyle='', marker='.')
xprb.loc[clog[strt:end], base_site].plot(grid=True, legend=True, linestyle='', marker='o', ax=ax2)
xprb.loc[u, base_site].plot(grid=True, legend=True, linestyle='', marker='$U$', ax=ax2)
xprb.loc[c, base_site].plot(grid=True, legend=True, linestyle='', marker='$C$', ax=ax2)
df_std.plot(legend=True, grid=True, ax=ax2)

[60]:

<Axes: xlabel='Date'>

That’s pretty great, but with one big problem…the cumulative precip at the end was missed…

This kind of makes sense: clogs are defined as below a fixed threshold or with a dropping ratio. The big 175 mm catchup at the end would be both a rising ratio and would come back above the fixed threshold. In development, it made more sense to err on the side of letting clogs start a little late or end a little early rather than over-flagging. Could this be as simple as extending by 1 timestep? A clog event immediately followed by this big amount should be fairly clear, but it should still get a flag.

Everything else looks pretty good. There are moments of unflagged data within the clog event, where it wasn’t raining, so no precipitation was missed. Everything else seems to have an Undercatch flag, which seems perfect.

Also, because of the running average, a single precip measurement in our comparison gauge flags all of our base precipitation for an hour.

[117]:

plt.close(20)

[61]:

end = pd.to_datetime('12/25/18 1000')
pd.options.display.min_rows = 30
clog[strt:end]

[61]:

Date
2018-12-18 00:00:00    False
2018-12-18 00:05:00    False
2018-12-18 00:10:00    False
2018-12-18 00:15:00    False
2018-12-18 00:20:00    False
2018-12-18 00:25:00    False
2018-12-18 00:30:00    False
2018-12-18 00:35:00    False
2018-12-18 00:40:00    False
2018-12-18 00:45:00    False
2018-12-18 00:50:00    False
2018-12-18 00:55:00    False
2018-12-18 01:00:00    False
2018-12-18 01:05:00    False
2018-12-18 01:10:00    False
                       ...
2018-12-25 08:50:00     True
2018-12-25 08:55:00     True
2018-12-25 09:00:00     True
2018-12-25 09:05:00     True
2018-12-25 09:10:00     True
2018-12-25 09:15:00     True
2018-12-25 09:20:00    False
2018-12-25 09:25:00    False
2018-12-25 09:30:00    False
2018-12-25 09:35:00    False
2018-12-25 09:40:00    False
2018-12-25 09:45:00    False
2018-12-25 09:50:00    False
2018-12-25 09:55:00    False
2018-12-25 10:00:00    False
Length: 2137, dtype: bool[pyarrow]

[62]:

xprb[strt:end]

[62]:

	CEN_01	CEN_02
Date
2018-12-18 00:00:00	0.2	0.0
2018-12-18 00:05:00	0.1	0.3
2018-12-18 00:10:00	0.2	0.0
2018-12-18 00:15:00	0.0	0.2
2018-12-18 00:20:00	0.2	0.0
2018-12-18 00:25:00	0.1	0.0
2018-12-18 00:30:00	0.2	0.5
2018-12-18 00:35:00	0.3	0.0
2018-12-18 00:40:00	0.1	0.5
2018-12-18 00:45:00	0.2	0.0
2018-12-18 00:50:00	0.2	0.3
2018-12-18 00:55:00	0.3	0.3
2018-12-18 01:00:00	0.2	0.0
2018-12-18 01:05:00	0.4	0.3
2018-12-18 01:10:00	0.1	0.0
...	...	...
2018-12-25 08:50:00	0.0	0.0
2018-12-25 08:55:00	0.0	0.0
2018-12-25 09:00:00	0.0	0.0
2018-12-25 09:05:00	0.0	0.0
2018-12-25 09:10:00	0.0	0.0
2018-12-25 09:15:00	0.0	0.0
2018-12-25 09:20:00	173.100006	0.0
2018-12-25 09:25:00	0.0	0.0
2018-12-25 09:30:00	0.0	0.0
2018-12-25 09:35:00	0.0	0.0
2018-12-25 09:40:00	0.1	0.0
2018-12-25 09:45:00	0.0	0.0
2018-12-25 09:50:00	0.0	0.0
2018-12-25 09:55:00	0.0	0.0
2018-12-25 10:00:00	0.0	0.0

2137 rows × 2 columns

[63]:

clog.shift(1)[strt:end]

[63]:

Date
2018-12-18 00:00:00    False
2018-12-18 00:05:00    False
2018-12-18 00:10:00    False
2018-12-18 00:15:00    False
2018-12-18 00:20:00    False
2018-12-18 00:25:00    False
2018-12-18 00:30:00    False
2018-12-18 00:35:00    False
2018-12-18 00:40:00    False
2018-12-18 00:45:00    False
2018-12-18 00:50:00    False
2018-12-18 00:55:00    False
2018-12-18 01:00:00    False
2018-12-18 01:05:00    False
2018-12-18 01:10:00    False
                       ...
2018-12-25 08:50:00     True
2018-12-25 08:55:00     True
2018-12-25 09:00:00     True
2018-12-25 09:05:00     True
2018-12-25 09:10:00     True
2018-12-25 09:15:00     True
2018-12-25 09:20:00     True
2018-12-25 09:25:00    False
2018-12-25 09:30:00    False
2018-12-25 09:35:00    False
2018-12-25 09:40:00    False
2018-12-25 09:45:00    False
2018-12-25 09:50:00    False
2018-12-25 09:55:00    False
2018-12-25 10:00:00    False
Length: 2137, dtype: bool[pyarrow]

[64]:

# set criteria for undercatch
non0 = df_std[site]>0
u = clog & (df_std[site] >= df_std[base_site]) & non0
xprb.loc[u, base_site].plot(grid=True, legend=True, linestyle='', marker='$U$', ax=ax1)

# set criteria for catchup cumulative totals
non0 = xprb[base_site]>0
c = clog.shift(1) & (df_std[site] < df_std[base_site]) & non0

/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/1431783776.py:3: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
  u = clog & (df_std[site] >= df_std[base_site]) & non0
/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/1431783776.py:8: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
  c = clog.shift(1) & (df_std[site] < df_std[base_site]) & non0
/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/1431783776.py:8: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
  c = clog.shift(1) & (df_std[site] < df_std[base_site]) & non0

[65]:

strt, end = pd.to_datetime('12/18/18'), pd.to_datetime('12/26/18')
ax3 = xprb[strt:end].plot(legend=True, grid=True, linestyle='', marker='.')
xprb.loc[clog[strt:end], base_site].plot(grid=True, legend=True, linestyle='', marker='o', ax=ax3)
xprb.loc[u, base_site].plot(grid=True, legend=True, linestyle='', marker='$U$', ax=ax3)
xprb.loc[c, base_site].plot(grid=True, legend=True, linestyle='', marker='$C$', ax=ax3)
df_std.plot(legend=True, grid=True, ax=ax3)

[65]:

<Axes: xlabel='Date'>

That works on this one. But I should really look at a broader set to see if it holds up. I like only shifting clog for C, not for U. But will 1 be enough in every case? When VARA is clogged, we often shake for at least 3 x 5 min intervals to get the float and water clog to the top.

Let’s try 3:

That will catch our CENT catch up, but will also hopefully work at VARA
That will prevent C in the first 10 minutes of a clog; i.e. in other words there must be undercatch before there can be cumulative precipitation

Test shifting clog on other examples¶

10/26/18¶

[66]:

strt, end = pd.to_datetime('12/18/18'), pd.to_datetime('12/26/18')

[67]:

strt, end = pd.to_datetime('10/26/18 0000'), pd.to_datetime('10/27/18 0000')

[68]:

xprb = tbl.loc[strt:end, [base_site, site]]
# calc rolling mean and std for pair
df_roll = xprb.rolling(window='1h', center=True, min_periods=None)
df_mean =  df_roll.mean(engine='numba').interpolate(method='linear')
df_std = df_mean + (1.5 * df_roll.std(engine='numba').interpolate(method='linear'))

# set criteria for undercatch
non0 = df_std[site]>0
u = clog & (df_std[site] >= df_std[base_site]) & non0

# set criteria for catchup cumulative totals
non0 = xprb[base_site]>0
c = clog.shift(3) & (df_std[site] < df_std[base_site]) & non0

/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/4257960171.py:9: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
  u = clog & (df_std[site] >= df_std[base_site]) & non0
/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/4257960171.py:13: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
  c = clog.shift(3) & (df_std[site] < df_std[base_site]) & non0
/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/4257960171.py:13: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
  c = clog.shift(3) & (df_std[site] < df_std[base_site]) & non0

[69]:

# add flags for clog
base_flag.flags['U'] |= u
base_flag.flags['C'] |= c

clog_event = pd.DataFrame(data=clog, columns=['clog'])
clog_event[['drain_event', 'neg_delta_tank']] = False
# grab a test day
# use clog as event DataFrame
ax1 = base_flag.plot_flagged_day(strt, base_site, tdelta='2D',  auto_qa_event=clog_event,
                                 paired_tank=all_flags['CEN_02'].data['tank_height'])

/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/2413713732.py:2: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
  base_flag.flags['U'] |= u
/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/2413713732.py:3: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
  base_flag.flags['C'] |= c

10/30/18 - 10/31¶

[70]:

strt, end = pd.to_datetime('10/30/18 0000'), pd.to_datetime('11/1/18 0000')

[71]:

xprb = tbl.loc[strt:end, [base_site, site]]
# calc rolling mean and std for pair
df_roll = xprb.rolling(window='1h', center=True, min_periods=None)
df_mean =  df_roll.mean(engine='numba').interpolate(method='linear')
df_std = df_mean + (1.5 * df_roll.std(engine='numba').interpolate(method='linear'))

# set criteria for undercatch
non0 = df_std[site]>0
u = clog & (df_std[site] >= df_std[base_site]) & non0

# set criteria for catchup cumulative totals
non0 = xprb[base_site]>0
c = clog.shift(3) & (df_std[site] < df_std[base_site]) & non0

/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/4257960171.py:9: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
  u = clog & (df_std[site] >= df_std[base_site]) & non0
/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/4257960171.py:13: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
  c = clog.shift(3) & (df_std[site] < df_std[base_site]) & non0
/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/4257960171.py:13: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
  c = clog.shift(3) & (df_std[site] < df_std[base_site]) & non0

[72]:

# add flags for clog
base_flag.flags['U'] |= u
base_flag.flags['C'] |= c

clog_event = pd.DataFrame(data=clog, columns=['clog'])
clog_event[['drain_event', 'neg_delta_tank']] = False
# grab a test day
# use clog as event DataFrame
ax1 = base_flag.plot_flagged_day(strt, base_site, tdelta='2D',  auto_qa_event=clog_event,
                                 paired_tank=all_flags['CEN_02'].data['tank_height'])

[73]:

ax3 = xprb[strt:end].plot(legend=True, grid=True, linestyle='', marker='.')
xprb.loc[clog[strt:end], base_site].plot(grid=True, legend=True, linestyle='', marker='o', ax=ax3)
xprb.loc[u, base_site].plot(grid=True, legend=True, linestyle='', marker='$U$', ax=ax3)
xprb.loc[c, base_site].plot(grid=True, legend=True, linestyle='', marker='$C$', ax=ax3)
df_std.plot(legend=True, grid=True, ax=ax3)

[73]:

<Axes: xlabel='Date'>

11/21/18¶

[74]:

strt, end = pd.to_datetime('11/21/18 0000'), pd.to_datetime('11/22/18 0000')

[75]:

xprb = tbl.loc[strt:end, [base_site, site]]
# calc rolling mean and std for pair
df_roll = xprb.rolling(window='1h', center=True, min_periods=None)
df_mean =  df_roll.mean(engine='numba').interpolate(method='linear')
df_std = df_mean + (1.5 * df_roll.std(engine='numba').interpolate(method='linear'))

# set criteria for undercatch
non0 = df_std[site]>0
u = clog & (df_std[site] >= df_std[base_site]) & non0

# set criteria for catchup cumulative totals
non0 = xprb[base_site]>0
c = clog.shift(3) & (df_std[site] < df_std[base_site]) & non0

/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/4257960171.py:9: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
  u = clog & (df_std[site] >= df_std[base_site]) & non0
/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/4257960171.py:13: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
  c = clog.shift(3) & (df_std[site] < df_std[base_site]) & non0
/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/4257960171.py:13: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
  c = clog.shift(3) & (df_std[site] < df_std[base_site]) & non0

[76]:

# add flags for clog
base_flag.flags['U'] |= u
base_flag.flags['C'] |= c

clog_event = pd.DataFrame(data=clog, columns=['clog'])
clog_event[['drain_event', 'neg_delta_tank']] = False
# grab a test day
# use clog as event DataFrame
ax1 = base_flag.plot_flagged_day(strt, base_site, tdelta='2D',  auto_qa_event=clog_event,
                                 paired_tank=all_flags['CEN_02'].data['tank_height'])

That even has our wonderful double precipitation flagged as E and Set0. The first value is given a C for cumulative catchup precipitation, and the second one is flagged as an estimated 0 with an INTPRO (internal processing creates artificial numbers) event flagged as Estimated and reset to 0.

11/27/18¶

[84]:

strt, end = pd.to_datetime('11/27/18 0000'), pd.to_datetime('11/28/18 0000')

[85]:

xprb = tbl.loc[strt:end, [base_site, site]]
# calc rolling mean and std for pair
df_roll = xprb.rolling(window='1h', center=True, min_periods=None)
df_mean =  df_roll.mean(engine='numba').interpolate(method='linear')
df_std = df_mean + (1.5 * df_roll.std(engine='numba').interpolate(method='linear'))

# set criteria for undercatch
non0 = df_std[site]>0
u = clog & (df_std[site] >= df_std[base_site]) & non0

# set criteria for catchup cumulative totals
non0 = xprb[base_site]>0
c = clog.shift(3) & (df_std[site] < df_std[base_site]) & non0

/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/4257960171.py:9: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
  u = clog & (df_std[site] >= df_std[base_site]) & non0
/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/4257960171.py:13: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
  c = clog.shift(3) & (df_std[site] < df_std[base_site]) & non0
/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/4257960171.py:13: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
  c = clog.shift(3) & (df_std[site] < df_std[base_site]) & non0

[79]:

# add flags for clog
base_flag.flags['U'] |= u
base_flag.flags['C'] |= c

clog_event = pd.DataFrame(data=clog, columns=['clog'])
clog_event[['drain_event', 'neg_delta_tank']] = False
# grab a test day
# use clog as event DataFrame
ax1 = base_flag.plot_flagged_day(strt, base_site, tdelta='2D',  auto_qa_event=clog_event,
                                 paired_tank=all_flags['CEN_02'].data['tank_height'])

[80]:

clg_plt = {'clog': {'ax': ax1[1], 'linestyle': '', 'marker': 'o', 'color': 'navy'}}
end = strt + pd.to_timedelta('2D')
all_flags['CEN_01'].data.loc[strt:end, 'tank_height'][clog].plot(grid=True, **clg_plt['clog'])

[80]:

<Axes: title={'center': 'CEN_01 - 2018-11-27 00:00:00'}, xlabel='Date'>

[86]:

ax3 = xprb[strt:end].plot(legend=True, grid=True, linestyle='', marker='.')
xprb.loc[clog[strt:end], base_site].plot(grid=True, legend=True, linestyle='', marker='o', ax=ax3)
xprb.loc[u, base_site].plot(grid=True, legend=True, linestyle='', marker='$U$', ax=ax3)
xprb.loc[c, base_site].plot(grid=True, legend=True, linestyle='', marker='$C$', ax=ax3)
df_std.plot(legend=True, grid=True, ax=ax3)

[86]:

<Axes: xlabel='Date'>

Well, this is an example of the C flag being overapplied because of the shift. We could knock it back to a single timestep shift, but I think that this is more likely an outlier to be an outlier, meaning that I think a shift of one will end up requiring me to flag every drain at UPLO for 2 years. So I’ll manually unflag this, unless I find that the error is happening a lot.

2/12/19 - 2/21/19¶

[81]:

strt, end = pd.to_datetime('2/12/19 0000'), pd.to_datetime('2/22/19 0000')

[82]:

xprb = tbl.loc[strt:end, [base_site, site]]
# calc rolling mean and std for pair
df_roll = xprb.rolling(window='1h', center=True, min_periods=None)
df_mean =  df_roll.mean(engine='numba').interpolate(method='linear')
df_std = df_mean + (1.5 * df_roll.std(engine='numba').interpolate(method='linear'))

# set criteria for undercatch
non0 = df_std[site]>0
u = clog & (df_std[site] >= df_std[base_site]) & non0

# set criteria for catchup cumulative totals
non0 = xprb[base_site]>0
c = clog.shift(1) & (df_std[site] < df_std[base_site]) & non0

/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/2902830835.py:9: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
  u = clog & (df_std[site] >= df_std[base_site]) & non0
/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/2902830835.py:13: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
  c = clog.shift(1) & (df_std[site] < df_std[base_site]) & non0
/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/2902830835.py:13: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
  c = clog.shift(1) & (df_std[site] < df_std[base_site]) & non0

[83]:

# add flags for clog
base_flag.flags['U'] |= u
base_flag.flags['C'] |= c

clog_event = pd.DataFrame(data=clog, columns=['clog'])
clog_event[['drain_event', 'neg_delta_tank']] = False
# grab a test day
# use clog as event DataFrame
ax1 = base_flag.plot_flagged_day(strt, base_site, tdelta='10D',  auto_qa_event=clog_event,
                                 paired_tank=all_flags['CEN_02'].data['tank_height'])

4/7/19 - 4/16/19¶

[87]:

strt, end = pd.to_datetime('4/7/19 0000'), pd.to_datetime('4/17/19 0000')

[88]:

xprb = tbl.loc[strt:end, [base_site, site]]
# calc rolling mean and std for pair
df_roll = xprb.rolling(window='1h', center=True, min_periods=None)
df_mean =  df_roll.mean(engine='numba').interpolate(method='linear')
df_std = df_mean + (1.5 * df_roll.std(engine='numba').interpolate(method='linear'))

# set criteria for undercatch
non0 = df_std[site]>0
u = clog & (df_std[site] >= df_std[base_site]) & non0

# set criteria for catchup cumulative totals
non0 = xprb[base_site]>0
c = clog.shift(1) & (df_std[site] < df_std[base_site]) & non0

/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/2902830835.py:9: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
  u = clog & (df_std[site] >= df_std[base_site]) & non0
/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/2902830835.py:13: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
  c = clog.shift(1) & (df_std[site] < df_std[base_site]) & non0
/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_80827/2902830835.py:13: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
  c = clog.shift(1) & (df_std[site] < df_std[base_site]) & non0

[89]:

# add flags for clog
base_flag.flags['U'] |= u
base_flag.flags['C'] |= c

clog_event = pd.DataFrame(data=clog, columns=['clog'])
clog_event[['drain_event', 'neg_delta_tank']] = False
# grab a test day
# use clog as event DataFrame
ax1 = base_flag.plot_flagged_day(strt, base_site, tdelta='10D',  auto_qa_event=clog_event,
                                 paired_tank=all_flags['CEN_02'].data['tank_height'])

[90]:

ax3 = xprb[strt:end].plot(legend=True, grid=True, linestyle='', marker='.')
xprb.loc[clog[strt:end], base_site].plot(grid=True, legend=True, linestyle='', marker='o', ax=ax3)
xprb.loc[u, base_site].plot(grid=True, legend=True, linestyle='', marker='$U$', ax=ax3)
xprb.loc[c, base_site].plot(grid=True, legend=True, linestyle='', marker='$C$', ax=ax3)
df_std.plot(legend=True, grid=True, ax=ax3)

[90]:

<Axes: xlabel='Date'>

It looks like with the shift clog used in the C flagging, this is properly catching all cases. In the graph directly above, the two low C values are still over 1 mm, which is very high, but obscured by the 0-300mm scale. The paired tank also seems flat at the time, so any precipitation is part of the catch up. And the C’s only persisted after a clog has ended in one case. This looks like it is ready to be programmed for production.

[ ]: