Using multiple probes to catch clogs

Most sites have more than one rain gauge, so intuitively, both rain gauges should measure about the same amount of precipitation. However, if one gauge is clogged, how can we check the other gauge? What about the 3 sites that only have one rain gauge?

Up to this point, testing has compared two gauges within the same site. While relationships across different sites will not be as tightly coupled as they are within a site, there should still be a measurable relationship. Below is work to develop a way to use that relationship, creating a composite clog warning from multiple sites. Expected issues to be worked out:

  • Cross site relationships

  • Making sure a clear sign of a clog isn’t ignored without allowing flase positives.

  • Ways to account for greater variance and rogue thunderstorms that only hit one site.

  • Allowing greater variance in ratios between sites (likely using longer averaging windows)

Let’s get the environment set up and load some data.

[1]:
import pandas as pd
import matplotlib.pyplot as plt
from numpy import nan, arange

# Jupyter magic to make plots display interactive
# must install ipympl (Ipython-matplotlib) and nodejs
from ipywidgets.embed import embed_minimal_html

%matplotlib ipympl

# expand all plots to comfortable viewing size
plt.rcParams['figure.figsize'] = [8, 5]

import sys
sys.path.append("../")
%load_ext autoreload
%autoreload explicit
%aimport post_gce_qc.cross_probe_qc
from post_gce_qc import qaqc, data_transfer, cross_probe_qc, main
[2]:
all_flags = main.main(2018, 2022, probes={'all_params'})
Loading all PPT data from ../config.yaml

Load data from UPL_02

All quality checks and quality assurance rules applied to UPL_02
------------------

Load data from CEN_01

All quality checks and quality assurance rules applied to CEN_01
------------------

Load data from CEN_02

All quality checks and quality assurance rules applied to CEN_02
------------------

Load data from CS2_02

All quality checks and quality assurance rules applied to CS2_02
------------------

[3]:
# create cross-probe table
xppt = cross_probe_qc.BuildXTable.assemble_cross_table(all_flags, ppt_col='adj_precip')
xacc = cross_probe_qc.BuildXTable.assemble_wy_acc(xppt)

# calculate ratios for CEN_01
xprobe = cross_probe_qc.XProbesQc(xppt.index, 'CEN_01')
xprobe.calc_accum_ratio(xacc)
[4]:
param = qaqc._load_yaml('../qa_param.yaml')['CEN_01']

xprobe.flag_x_clogs(xppt, xacc, param['auto_flag']['flag_x_clogs'])

xprobe.clog
[4]:
CEN_02 CS2_02
Date
2017-10-01 00:00:00 False False
2017-10-01 00:05:00 False False
2017-10-01 00:10:00 False False
2017-10-01 00:15:00 False False
2017-10-01 00:20:00 False False
... ... ...
2022-09-30 23:35:00 False False
2022-09-30 23:40:00 False False
2022-09-30 23:45:00 False False
2022-09-30 23:50:00 False False
2022-09-30 23:55:00 False False

525888 rows × 2 columns

[5]:
xprobe.ratio.plot(legend=True, grid=True)
[5]:
<Axes: xlabel='Date'>

Looking at the initial relationship with this set of 3 gauges, there are a lot of positives:

First Look

Pros

  1. All the relationships are pretty flat in 2018 and 2021.

  2. All gauges agree that CENT SA was clogged in early 2019

  3. UPLO SH, which is known to have wild diurnal oscillations in mid-late 2019 and goes off the deep end by 2020, giving us a false positive to work with.

  4. CS2MET and CENT are tracking pretty nicely

Cons

But there are a few issues as well:

  1. Everything looks wild in 2022

  2. UPLO has a prolonged period each fall where the data looks very messy

  3. CS2MET also has some prolonged messy periods, particularly in 2020

Strategy

I’ll work through each of these sites until I can identify likely clogs from a 2 site comparison. Then I will try to develop a composite or decision maker to finalize what is a clog.

I’ll take CEN SH as a given since the clog QA has so fare been developoed using the pair of gauges at CEN. Next I’ll look at CS2MET since it has already had a fair amount of QA.

Finding Clogs Using CS2MET

Setting levels

Let’s start by listing the parameters used by CENT SH and see if we can work backwards to get a standard way to find/define them.

CEN_02:
    find_paired_clog:
      min_accum: 34
      lowest_normal_ratio: 0.025
      rolling_window: 8D
      window_precision: 0.01
    flag_paired_clog:
      precision_val: 0.1
      window: 1h
      n_std: 1.5

Lowest Normal Level

Let’s see what percentile we used at CENT.

[6]:
crat = xprobe.ratio['CEN_02']

sum(crat < 0.025) / len(crat)
[6]:
0.07269418583424607
[7]:
lowest_normal_ratio = xprobe.ratio['CS2_02'].quantile(0.07)

xprobe.ratio[['CS2_02','CEN_02']].plot(legend=True, grid=True)
plt.gca().axhline(lowest_normal_ratio, color='k')
[7]:
<matplotlib.lines.Line2D at 0x32a56f220>
[8]:
plt.gca().axhline(0.025, color='c')
[8]:
<matplotlib.lines.Line2D at 0x3299f1180>
[9]:
lowest_normal_ratio = xprobe.ratio['CS2_02'].quantile(0.10)

plt.gca().axhline(lowest_normal_ratio, color='k')
[9]:
<matplotlib.lines.Line2D at 0x32a75ff40>
[10]:
lowest_normal_ratio = xprobe.ratio['CS2_02'].quantile(0.090)

plt.gca().axhline(lowest_normal_ratio, color='k')
[10]:
<matplotlib.lines.Line2D at 0x32a75ed70>
[11]:
lowest_normal_ratio = xprobe.ratio['CS2_02'].quantile(0.060)

plt.gca().axhline(lowest_normal_ratio, color='k')
[11]:
<matplotlib.lines.Line2D at 0x32a75ebc0>
[12]:
lowest_normal_ratio = xprobe.ratio['CS2_02'].quantile(0.050)

plt.gca().axhline(lowest_normal_ratio, color='k')
[12]:
<matplotlib.lines.Line2D at 0x329b22470>
[13]:
lowest_normal_ratio = xprobe.ratio['CS2_02'].quantile(0.080)

plt.gca().axhline(lowest_normal_ratio, color='k')
[13]:
<matplotlib.lines.Line2D at 0x329e81600>
[14]:
xprobe.ratio['CS2_02'].quantile(0.050)
[14]:
-0.14318625628948212

There is a lot of variation bewteen years here; WY 20 and 21 have much lower relationships and really drive the low threshold. If I bump up to the 9th percentile, then almost all of 2020 will be flagged. I have to bump it down to the 5th percentile to unflag most of that year. Some of the places where the ratio is below that value at CS2MET are kown clog periods for CEN SA, which is encouraging that this will be useful going forward. However this is clearly the weaker filter next to the filter for negative slope (i.e. dropping ratio).

Minimum accumulation

This should be fairly straightforward: for each WY, what is the accumulation when the ratio first goes above the minimum ratio. Let’s see if that works out.

[15]:
def first_min(df, min_val, max_val):
    valid = (df < min_val) & (df > max_val)
    return df[valid].index.min()

first_date = xprobe.ratio['CS2_02'].groupby(pd.Grouper(freq='YE-SEP')).apply(lambda x: first_min(x, 0.025, -0.1))
[16]:
xacc.loc[first_date, 'CS2_02']
[16]:
Date
2017-10-12 11:00:00     43.942001
2018-10-05 09:00:00         0.254
2019-10-03 16:50:00          5.08
2020-11-06 00:55:00    154.940002
2021-10-05 18:15:00         3.556
Name: CS2_02, dtype: float[pyarrow]

This is proving harder than expected. Let’s take a crack at it visually.

[17]:
plt.figure()
ax1 = plt.subplot(211)
ax2=plt.subplot(212)
xprobe.ratio['CS2_02'].plot(grid=True, legend=True, ax=ax1)
xacc[['CS2_02', 'CEN_01']].plot(grid=True, legend=True, ax=ax2)
[17]:
<Axes: xlabel='Date'>
[18]:
plt.figure()
ax1 = plt.subplot(211)
ax2=plt.subplot(212)
xprobe.ratio['CS2_02'].plot(grid=True, legend=True, ax=ax1)
xacc[['CS2_02', 'CEN_01']].plot(grid=True, legend=True, ax=ax2)
[18]:
<Axes: xlabel='Date'>
[32]:
plt.figure()
ax1 = plt.subplot(211)
ax2=plt.subplot(212)
xprobe.ratio['CS2_02'].plot(grid=True, legend=True, ax=ax1)
xacc[['CS2_02', 'CEN_01']].plot(grid=True, legend=True, ax=ax2)
[32]:
<Axes: xlabel='Date'>

This seems to vary a lot year to year. Notably, in 2 years during this range CS2MET get’s higher annual precipitation, and in 3 years CENT’s annual total is higher. Most years look like 140 would be a reasonable number. But 2017 it could be as low as 10, but not over 100. There’s an argument to be made for 50. I don’t want to rule out good data, but I also don’t want to be too permissive. Let’s look up close at the remaining 2 years.

[20]:
plt.figure()
ax1 = plt.subplot(211)
ax2=plt.subplot(212)
xprobe.ratio['CS2_02'].plot(grid=True, legend=True, ax=ax1)
xacc[['CS2_02', 'CEN_01']].plot(grid=True, legend=True, ax=ax2)
[20]:
<Axes: xlabel='Date'>
[21]:
plt.figure()
ax1 = plt.subplot(211)
ax2=plt.subplot(212)
xprobe.ratio['CS2_02'].plot(grid=True, legend=True, ax=ax1)
xacc[['CS2_02', 'CEN_01']].plot(grid=True, legend=True, ax=ax2)
[21]:
<Axes: xlabel='Date'>

50 seems reasonable. Let’s see how it does.

Window Precision

Let’s look at this as fraction of lowest level as well as percentile and see what we get. Starting with the numbers for CENT SH:

[22]:
0.01/0.025
[22]:
0.39999999999999997
[23]:
crat = xprobe.ratio['CEN_02']

sum(crat < 0.01) / len(crat)
[23]:
0.052275389436534016

So CENT used the 5th percentile, or 40% of below average. So for CS2MET that would be:

[24]:
xprobe.ratio['CS2_02'].quantile(0.050)
[24]:
-0.14318625628948212
[25]:
-0.143*0.4
[25]:
-0.0572

The 5th percentile is very large. My instinct is that the 40% rule might be better; it is already 5 x the value between gauges at CENT (i.e. less sensitive). But let’s take a look. We’ll assume 8D is still good for starters.

[28]:
drops = qaqc.QaRules.find_drops(xprobe.ratio, 0.057, col='CS2_02', wind='8D')
[36]:
plt.close(9)
[37]:
plt.figure()
ax1= xprobe.ratio['CS2_02'].plot(grid=True, legend=True)
xprobe.ratio.loc[drops, 'CS2_02'].plot(grid=True, ax=ax1, marker='o', color='k', linestyle='')
[37]:
<Axes: xlabel='Date'>

It looks like it caught the bulk of them. But it’s pretty messy without the low value ratios weeded out. Let’s try the whole package, not just the drops.

[199]:
cs2pair = xprobe.get_Probe2ProbeXQc_inst(xppt, xacc, probe='CS2_02')
xprobe.set_probe_clog_event('CS2_02', cs2pair, min_accum=50, lowest_normal_ratio=-0.14, window_precision=0.057)

plt.figure()
xprobe.ratio['CS2_02'].plot(legend=True)
xprobe.ratio.loc[xprobe.clog['CS2_02'], 'CS2_02'].plot(grid=True, linestyle='', marker='.')
/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_90338/3505410169.py:4: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). Consider using `matplotlib.pyplot.close()`.
  plt.figure()
[199]:
<Axes: xlabel='Date'>

0.057 misses a significant clog in 2019, let’s look at the CENT SH threshold of 0.01 to see the other side of the spectrum.

[195]:
cs2pair = xprobe.get_Probe2ProbeXQc_inst(xppt, xacc, probe='CS2_02')
xprobe.set_probe_clog_event('CS2_02', cs2pair, min_accum=50, lowest_normal_ratio=-0.14, window_precision=0.01)

plt.figure()
xprobe.ratio['CS2_02'].plot(legend=True)
xprobe.ratio.loc[xprobe.clog['CS2_02'], 'CS2_02'].plot(grid=True, linestyle='', marker='.')
[195]:
<Axes: xlabel='Date'>
[196]:
cs2pair = xprobe.get_Probe2ProbeXQc_inst(xppt, xacc, probe='CS2_02')
xprobe.set_probe_clog_event('CS2_02', cs2pair, min_accum=50, lowest_normal_ratio=-0.14, window_precision=0.03)

plt.figure()
xprobe.ratio['CS2_02'].plot(legend=True)
xprobe.ratio.loc[xprobe.clog['CS2_02'], 'CS2_02'].plot(grid=True, linestyle='', marker='.')
[196]:
<Axes: xlabel='Date'>

Iteratively, 0.03 is better, but let’s look at another year to be sure. 0.057 way under-flagged, but 0.01 (used at CEN SH) way over-flagged.

[85]:
plt.close(13)
[77]:
cs2pair = xprobe.get_Probe2ProbeXQc_inst(xppt, xacc, probe='CS2_02')
xprobe.set_probe_clog_event('CS2_02', cs2pair, min_accum=50, lowest_normal_ratio=-0.15, window_precision=0.03)

ax1 = xprobe.ratio.plot(legend=True)
xprobe.ratio.loc[xprobe.clog['CS2_02'], 'CS2_02'].plot(grid=True, linestyle='', marker='.', ax=ax1)
[77]:
<Axes: xlabel='Date'>

It seems like UPLO agrees with these cases, but that CEN SH does not. So this is a good example of where we are correctly capturing a clog relative to CS2MET, but it seems likely that CENT really didn’t get rain, just the South-Ridge (UPLO/CS2/PRI). We might just need to fix that with weighting, but let’s see if we can make it better by lengthening the window.

Rolling Window Size

[87]:
cs2pair = xprobe.get_Probe2ProbeXQc_inst(xppt, xacc, probe='CS2_02')
xprobe.set_probe_clog_event('CS2_02', cs2pair, min_accum=50, lowest_normal_ratio=-0.15, window_precision=0.03, rolling_window='4D')

ax1 = xprobe.ratio.plot(legend=True)
xprobe.ratio.loc[xprobe.clog['CS2_02'], 'CS2_02'].plot(grid=True, linestyle='', marker='.', ax=ax1)
[87]:
<Axes: xlabel='Date'>

A few iterations found longer windows improved nothing and shorter windows whitle away at the start and end time make a shorter and shorter clog, but only slightly (run_avg > (data + precision)), and it makes the clogging we want to capture worse. I think this will come down to a weighting scheme.

Weighting Clogs From Other Sites

Integers math is faster and integers take less memory, so INT will be used for weighting. There are 7 sites, with 10 gauges (not counting shelter tipping buckets), so a scale of 100 will be used.

  • A clog score of 66 will be needed to flag a clog

  • Sites will split their weight between all gauges; i.e. UPLO will split a weight of 20 as 10 each between SH and SA

  • Gauges within a site should have the highest weight

  • Gauges within a site should not need many other gauges to flag a clog

  • Some zones cannot be allowed to identify clogs at CENT alone

    • south ridge

    • low elevation

    • high elevation

A few notes on observed relationships

  • On average, annual precipitation at CENT is closest to VARA

  • Storms tend to either:

    • hit all sites the same

    • hit sites with amounts increasing with elevation

    • hit the south ridge or the north ridge hardest

    • split between any of the above at an elevation band

  • In storms driven by ridge position, CENT often matches VARA;

  • In elevation driven storms it matches the pattern of VARA and UPLO, but at lower amounts

  • I have no experience with the relationship to HI15

Elev

south ridge

Roswell ridge

north ridge

total

High elevation

{20} UPLO

{40} VARA

60

Mid Elevation

{7} MACK

{58} CENT

{35} HI15

100

Low Elevation

{7+7} PRIM/CS2

14

Total

41

58

75

By limiting CENT’s within site weight to 58, it keeps either PRIM or CS2 from being able to push the total into a clog by itself.

Let’s take a look first at how well it does in WY19, which has been thoroughly studied, and then let’s look at how the weighting handles things in the false positive seen above in WY20.

[93]:
xprobe.clog.loc[pd.to_datetime('12/23/18'):pd.to_datetime('12/23/19')]
[93]:
CEN_02 CS2_02
Date
2018-12-23 00:00:00 True True
2018-12-23 00:05:00 True True
2018-12-23 00:10:00 True True
2018-12-23 00:15:00 True True
2018-12-23 00:20:00 True True
... ... ...
2019-12-22 23:40:00 False False
2019-12-22 23:45:00 False False
2019-12-22 23:50:00 False False
2019-12-22 23:55:00 False False
2019-12-23 00:00:00 False False

105121 rows × 2 columns

[157]:
plt.close(15)
[158]:
cs2pair = xprobe.get_Probe2ProbeXQc_inst(xppt, xacc, probe='CS2_02')
xprobe.set_probe_clog_event('CS2_02', cs2pair, min_accum=50, lowest_normal_ratio=-0.15, window_precision=0.03, rolling_window='8D')

ax1 = xprobe.ratio.plot(legend=True)
xprobe.ratio.loc[xprobe.clog['CS2_02'], 'CS2_02'].plot(grid=True, linestyle='', marker='o', ax=ax1)
xprobe.ratio.loc[xprobe.clog['CEN_02'], 'CEN_02'].plot(grid=True, linestyle='', marker='.', ax=ax1)
[158]:
<Axes: xlabel='Date'>

That looks pretty good. Now we need to weight the scores.

Unfortunately, multiplication isn’t allowed in PyArrow Booleans, so first I need to convert the data somehow.

[146]:
%timeit xprobe.clog.astype('bool8')
510 μs ± 445 ns per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
[134]:
%load_ext memory_profiler
[144]:
%memit xprobe.clog.astype('bool8')
peak memory: 4523.14 MiB, increment: 0.00 MiB
[135]:
%timeit xprobe.clog.astype(pd.Int8Dtype())
1.34 ms ± 14.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
[ ]:
import pyarrow as pa
[136]:
%memit xprobe.clog.astype(pd.Int8Dtype())
peak memory: 4527.98 MiB, increment: 7.72 MiB
[138]:
%timeit xprobe.clog.astype('int8[pyarrow]')
486 μs ± 3.41 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
[139]:
%memit xprobe.clog.astype('int8[pyarrow]')
peak memory: 4507.25 MiB, increment: 0.00 MiB
[121]:
%timeit xprobe.clog.astype(pd.ArrowDtype(pa.int8()))
488 μs ± 5.81 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
[137]:
%memit xprobe.clog.astype(pd.ArrowDtype(pa.int8()))
peak memory: 4507.06 MiB, increment: 0.00 MiB
[175]:
weight_x_clogs = {'CEN_02': 58,
                  'CS2_02': 7}

clogs = xprobe.clog.astype('int8[pyarrow]')
for site, wt in weight_x_clogs.items():
    clogs[site] *= wt

clog_score = clogs.sum(axis=1)
[161]:
strt, end = pd.to_datetime('12/18/18'), pd.to_datetime('12/26/18')
[173]:
def clog_plot(probes, strt, end):
    cs2pair = xprobe.get_Probe2ProbeXQc_inst(xppt, xacc, probe=probes[0])
    xprobe.set_probe_clog_event(probes[0], cs2pair, min_accum=50, lowest_normal_ratio=-0.15, window_precision=0.03, rolling_window='8D')

    plt.figure()
    ax1 = plt.subplot(211)
    xprobe.ratio[strt:end].plot(legend=True, ax=ax1)
    xprobe.ratio[strt:end].loc[xprobe.clog.loc[strt:end,probes[0]], probes[0]].plot(grid=True, linestyle='', marker='o', ax=ax1)
    xprobe.ratio[strt:end].loc[xprobe.clog.loc[strt:end, probes[1]], probes[1]].plot(grid=True, linestyle='', marker='.', ax=ax1)

    ax2=plt.subplot(212)
    clog_score[strt:end].plot(ax=ax2, grid=True)
[ ]:

[ ]:

[185]:
strt, end = pd.to_datetime('4/6/19'), pd.to_datetime('4/17/19')
clog_plot(['CS2_02', 'CEN_02'], strt, end)
[187]:
strt, end = pd.to_datetime('4/6/19'), pd.to_datetime('4/17/19')
clog_plot(['CS2_02', 'CEN_02'], strt, end)
[188]:
strt, end = pd.to_datetime('2/12/19'), pd.to_datetime('2/21/19')
clog_plot(['CS2_02', 'CEN_02'], strt, end)
[190]:
strt, end = pd.to_datetime('11/21/18'), pd.to_datetime('11/27/18')
clog_plot(['CS2_02', 'CEN_02'], strt, end)
[191]:
strt, end = pd.to_datetime('10/26/18'), pd.to_datetime('11/01/18')
clog_plot(['CS2_02', 'CEN_02'], strt, end)

How heavily should within site gauges be weighted.

With examples from WY19, CS2MET is parameterized so that it usually starts flagging clogs a little later than CEN SH. It also misses very small clogs, because the ratio drop doesn’t exceed the threshold. This seems about right for a site that is so different and far away. The bulk of the clog period is signaled by both sensors, accumulating a total score >60. CEN SH continues to give the most precise identification of clogs. Should CENT Shelter be weighted to signal a clog without any other gauges? Currently, the weighting requires an additional gauge to trigger a clog (score of 66). With just CS2MET, this would delay initiation of clog events and flagging. With additional sites will this improve? At the 5 minute level, it may be impractical to expect storms to start and stop at the same time between sites that are miles apart. However, we still need a system that can ID clogs when sensors go down. So it is important to be able to draw from multiple sensors, so there is always at least one or 2 valid comparisons.

In WY19 there only seems to be two potential false positive in the graphs directly above. Both will be ignored because of the weighting. The first appears to be a storm cycle that hit CS2MET but not CENT. The second is in a period where real clogging exists, but CS2MET does not stop and start, but continues the clog well after. Upon close inspection, CS2MET has a much larger overall drop, so the running average will be higher. These examples seem to highlight the variability across sites, and the need to give CS2MET a low weight.

Testing U and C flag weights

Let’s see what flag weighting looks like. This will apply the same weights in the same way, but will assign an ultimate flag.

[291]:
cs2pair = xprobe.get_Probe2ProbeXQc_inst(xppt, xacc, probe='CS2_02')
xprobe.set_probe_clog_event('CS2_02', cs2pair, min_accum=50, lowest_normal_ratio=-0.15, window_precision=0.03, rolling_window='8D')
xprobe.set_probe_clog_flag('CS2_02', cs2pair, precision_val=0.254 ,window='1h', n_std=1.5)
[206]:
xprobe.clog['CS2_02']
[206]:
Date
2017-10-01 00:00:00    False
2017-10-01 00:05:00    False
2017-10-01 00:10:00    False
2017-10-01 00:15:00    False
2017-10-01 00:20:00    False
                       ...
2022-09-30 23:35:00    False
2022-09-30 23:40:00    False
2022-09-30 23:45:00    False
2022-09-30 23:50:00    False
2022-09-30 23:55:00    False
Name: CS2_02, Length: 525888, dtype: bool[pyarrow]
[292]:
all_flags['CEN_01'].flags['C'] = xprobe.C['CS2_02']
all_flags['CEN_01'].flags['U'] = xprobe.U['CS2_02']

clog_event = pd.DataFrame(columns=['clog', 'duplicate', 'drain_event'], index=xprobe.clog['CS2_02'].index, data=False)
clog_event.loc[xprobe.clog['CS2_02']>0,'clog'] = True

clog_event
[292]:
clog duplicate drain_event
Date
2017-10-01 00:00:00 False False False
2017-10-01 00:05:00 False False False
2017-10-01 00:10:00 False False False
2017-10-01 00:15:00 False False False
2017-10-01 00:20:00 False False False
... ... ... ...
2022-09-30 23:35:00 False False False
2022-09-30 23:40:00 False False False
2022-09-30 23:45:00 False False False
2022-09-30 23:50:00 False False False
2022-09-30 23:55:00 False False False

525888 rows × 3 columns

[228]:
all_flags['CEN_01'].apply_QaRules_flags(clog_event, all_flags['CEN_01'].flags)
[296]:
for f in range(25,80):
    plt.close(f)
[ ]:

[252]:
%aimport post_gce_qc.qaqc
[295]:
day = pd.to_datetime('10/20/17')
all_flags['CEN_01'].plot_flagged_day(day,  'CEN_01', tdelta='5D', auto_qa_event=clog_event, paired_tank=all_flags['CS2_02'].data.tank_height)
[295]:
(<Axes: xlabel='Date'>,
 <Axes: title={'center': 'CEN_01 - 2017-10-20 00:00:00'}, xlabel='Date'>)
[293]:
day = pd.to_datetime('12/19/18')
all_flags['CEN_01'].plot_flagged_day(day,  'CEN_01', tdelta='8D', auto_qa_event=clog_event, paired_tank=all_flags['CS2_02'].data.tank_height)
[293]:
(<Axes: xlabel='Date'>,
 <Axes: title={'center': 'CEN_01 - 2018-12-19 00:00:00'}, xlabel='Date'>)

In the first example, we see a lot of C and unflagged precip throughout. In the second example, the precipitation was added twice. We see the second value is flagged Set0 and E since we estimate the real value to be 0. And the first value got a C, which is what I wanted to see. The first case, however, prompts some questions. It isn’t entirely surprising, since we saw above that the timing of clogs is a little off in the comparison. But maybe a longer interval would help. Let’s try 2 hours.

[299]:
plt.close(83)
[308]:
cenpair = xprobe.get_Probe2ProbeXQc_inst(xppt, xacc, 'CEN_02')
xprobe.set_probe_clog_flag('CEN_02', cenpair, precision_val=0.1 ,window='1h', n_std=1.5)

all_flags['CEN_01'].flags['C'] = xprobe.C['CEN_02']
all_flags['CEN_01'].flags['U'] = xprobe.U['CEN_02']

clog_event = pd.DataFrame(columns=['clog', 'duplicate', 'drain_event'], index=xprobe.clog['CS2_02'].index, data=False)
clog_event.loc[xprobe.clog['CEN_02']>0,'clog'] = True

all_flags['CEN_01'].apply_QaRules_flags(clog_event, all_flags['CEN_01'].flags)

day = pd.to_datetime('10/20/17')
all_flags['CEN_01'].plot_flagged_day(day,  'CEN_01', tdelta='5D', auto_qa_event=clog_event, paired_tank=all_flags['CEN_02'].data.tank_height)
[308]:
(<Axes: xlabel='Date'>,
 <Axes: title={'center': 'CEN_01 - 2017-10-20 00:00:00'}, xlabel='Date'>)

Now let’s compare to what CEN SH is doing during the same period.

[300]:
xprobe.set_probe_clog_flag('CS2_02', cs2pair, precision_val=0.254 ,window='2h', n_std=1.5)

all_flags['CEN_01'].flags['C'] = xprobe.C['CS2_02']
all_flags['CEN_01'].flags['U'] = xprobe.U['CS2_02']
all_flags['CEN_01'].apply_QaRules_flags(clog_event, all_flags['CEN_01'].flags)

day = pd.to_datetime('10/20/17')
all_flags['CEN_01'].plot_flagged_day(day,  'CEN_01', tdelta='5D', auto_qa_event=clog_event, paired_tank=all_flags['CS2_02'].data.tank_height)
[300]:
(<Axes: xlabel='Date'>,
 <Axes: title={'center': 'CEN_01 - 2017-10-20 00:00:00'}, xlabel='Date'>)

So, that is a messy example: paired with CS2MET it goes back and forth between U, C, and no flag. However, it would need a lot of other sites to agree to makes those flags final, since CEN02 quite definitively shows no clog. So the mess will be excluded by the weighting.

I want to look at a broader set of examples. This may be an outlier that we can dismiss, since it appears to simply be a period where CS2MET is outpacing CENT, but CENT is not clogged. So as waves of precipitation surge CS2MET ahead, it shows undercatch, and then they have similar precipitation rates for a while and nothing gets flagged.

[254]:
for f in range(25, 36):
    plt.close(f)
[309]:
flag_day = all_flags['CEN_01'].get_flagged_days()

for day in flag_day:
    end = day+pd.to_timedelta('1D')
    if clog_event['clog'][day:end].any() and end.year <2019:
        all_flags['CEN_01'].plot_flagged_day(day,  'CEN_01', auto_qa_event=clog_event, paired_tank=all_flags['CS2_02'].data.tank_height)
[311]:
plt.close(112)
[312]:
day = pd.to_datetime('12/18/18 1530')
all_flags['CEN_01'].plot_flagged_day(day,  'CEN_01', tdelta='8D', auto_qa_event=clog_event, paired_tank=all_flags['CS2_02'].data.tank_height)
[312]:
(<Axes: xlabel='Date'>,
 <Axes: title={'center': 'CEN_01 - 2018-12-18 15:30:00'}, xlabel='Date'>)

That’s not perfect, but on whole, most of those looked pretty reasonable. The true clogs are performing well: directly above is a good example where there is rain-free period that has no flagging, but a clog event. Then there is a rainy period that is all ‘U’ and then a sudden catch up that gets a ‘C’.The example below starts with has a long period of catch up during a rain free period, all flagged with ‘C’ and no flag when neither

[454]:
day = pd.to_datetime('11/4/17 1800')
all_flags['CEN_01'].plot_flagged_day(day,  'CEN_01', tdelta='4D', auto_qa_event=clog_event, paired_tank=all_flags['CS2_02'].data.tank_height)
[454]:
(<Axes: xlabel='Date'>,
 <Axes: title={'center': 'CEN_01 - 2017-11-04 18:00:00'}, xlabel='Date'>)

The only thing left to do is to weight the flags just like the clog event. I’ll err towards U. I’ll also need to format them into DataFrames that can be used by :meth:qaqc.ApplyFlags. Let’s see if I can do that without costing too much.

[334]:
%timeit pd.DataFrame({'U':U,'C':C})
52.4 μs ± 428 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
[333]:
%timeit pd.DataFrame(columns=['U','C'], data=[U,C])
457 ms ± 8.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
[341]:
%%timeit
flag = pd.DataFrame(index=U.index, dtype='boolean[pyarrow]')
flag['U']=U
flag['C']=C
169 μs ± 1.62 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
[ ]:

[351]:
%%timeit
flag = pd.DataFrame(index=U.index, columns=['U', 'C'], dtype='boolean[pyarrow]')
flag['U']=U
flag['C']=C
1.38 ms ± 24 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
[346]:
%%timeit
flag = pd.DataFrame({'U':False, 'C':False}, index=U.index, dtype='boolean[pyarrow]')
flag['U']=U
flag['C']=C
1.23 ms ± 3.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
[343]:
eventwt, Uwt, Cwt, = xprobe.get_weight_x_clog({'CS2_02':58})
clog, U_C = xprobe.get_flag_x_clog(eventwt, Uwt, Cwt)

# test with at least one weighted true value
clog[0] = True

all_flags['CEN_01'].apply_QaRules_flags(clog, U_C)
[ ]:

Reverse Relationship to Identify Known CS2MET Clogs

There were two known events at CS2MET in WY19 where the gauge overflowed. Both of these should show up as undercatch. Hopefully this is a simple exercise taking parameters from this relationship and reversing the sign or taking the inverse.

[401]:
xprobe = cross_probe_qc.XProbesQc(all_flags['CS2_02'].data.index, 'CS2_02')
xprobe.set_accum_ratio(xacc)

param = {'CEN_01':{'clog_pair_flagging_wrap':{"min_accum": 50,
                                    'lowest_normal_ratio': 0.143,
                                    'window_precision': 0.03,
                                    'precision_val': 0.1,
                                    'window': '1h',
                                    'n_std': 1.5}}}

xprobe.set_x_clogs(xppt, xacc, param)

flags = pd.DataFrame({'U':xprobe.U['CEN_01'], 'C':xprobe.C['CEN_01']})
event = pd.DataFrame({'clog':xprobe.clog['CEN_01']})

all_flags['CS2_02'].apply_QaRules_flags(event, flags)


strt = pd.to_datetime('2/23/19')
all_flags['CS2_02'].plot_flagged_day(strt,  'CS2_02', tdelta='6D', auto_qa_event=event, paired_tank=all_flags['CEN_01'].data.tank_height)
[401]:
(<Axes: xlabel='Date'>,
 <Axes: title={'center': 'CS2_02 - 2019-02-23 00:00:00'}, xlabel='Date'>)
[400]:
plt.close(114)
[402]:
strt = pd.to_datetime('2/1/19')
all_flags['CS2_02'].plot_flagged_day(strt,  'CS2_02', tdelta='30D', auto_qa_event=event, paired_tank=all_flags['CEN_01'].data.tank_height)
[402]:
(<Axes: xlabel='Date'>,
 <Axes: title={'center': 'CS2_02 - 2019-02-01 00:00:00'}, xlabel='Date'>)

Over flagging

It looks like it flagged most values. I still think that the rolling window sizes and the threshold for window precision are probably the same as they were with the relationship in the other direction. I bet it’s the lowest_normal_ratio getting us in trouble.

[403]:
plt.figure()
xprobe.ratio.CEN_01.plot(grid=True)
[403]:
<Axes: xlabel='Date'>

The lowest normal ratio was set at 0.143, that is way to high. Let’s pick a better number and try again. We can see that the gauge got substantially behind in April of 2019, but it never catches up. An event like that sets the value low for the season, so we’ll have to go off of that number and depend mostly on dropping ratio.

[432]:
plt.close(118)
[419]:
del xprobe, flags, event
xprobe = cross_probe_qc.XProbesQc(all_flags['CS2_02'].data.index, 'CS2_02')
xprobe.set_accum_ratio(xacc)

param = {'CEN_01':{'clog_pair_flagging_wrap':{"min_accum": 50,
                                    'lowest_normal_ratio': -0.09,
                                    'window_precision': 0.03,
                                    'precision_val': 0.1,
                                    'window': '1h',
                                    'n_std': 1.5}}}

xprobe.set_x_clogs(xppt, xacc, param)

flags = pd.DataFrame({'U':xprobe.U['CEN_01'], 'C':xprobe.C['CEN_01']})
event = pd.DataFrame({'clog':xprobe.clog['CEN_01']})

all_flags['CS2_02'].apply_QaRules_flags(event, flags)


strt = pd.to_datetime('2/21/19')
all_flags['CS2_02'].plot_flagged_day(strt,  'CS2_02', tdelta='8D', auto_qa_event=event, paired_tank=all_flags['CEN_01'].data.tank_height)
[419]:
(<Axes: xlabel='Date'>,
 <Axes: title={'center': 'CS2_02 - 2019-02-21 00:00:00'}, xlabel='Date'>)

Wow, that looks like catchup (delayed accumulation) from a clog at CEN_01. We haven’t parameterized the relationship with CEN02, but let’s just give it a try and see if it does any better, since both of our CS2MET clogs appear to overlap with clogs at CEN01 in 2019.

[433]:
del xprobe, flags, event
all_flags['CS2_02'].flags[['U','C']] = False
all_flags['CS2_02'].event['clog'] = False

xprobe = cross_probe_qc.XProbesQc(all_flags['CS2_02'].data.index, 'CS2_02')
xprobe.set_accum_ratio(xacc)

param = {'CEN_02':{'clog_pair_flagging_wrap':{"min_accum": 50,
                                    'lowest_normal_ratio': -0.09,
                                    'window_precision': 0.03,
                                    'precision_val': 0.1,
                                    'window': '1h',
                                    'n_std': 1.5}}}

xprobe.set_x_clogs(xppt, xacc, param)

flags = pd.DataFrame({'U':xprobe.U['CEN_02'], 'C':xprobe.C['CEN_02']})
event = pd.DataFrame({'clog':xprobe.clog['CEN_02']})

all_flags['CS2_02'].apply_QaRules_flags(event, flags)


strt = pd.to_datetime('2/21/19')
all_flags['CS2_02'].plot_flagged_day(strt,  'CS2_02', tdelta='8D', auto_qa_event=event, paired_tank=all_flags['CEN_02'].data.tank_height)
[433]:
(<Axes: xlabel='Date'>,
 <Axes: title={'center': 'CS2_02 - 2019-02-21 00:00:00'}, xlabel='Date'>)
[430]:
event.loc[pd.to_datetime('2/25/19'):pd.to_datetime('2/26/19')]
[430]:
clog
Date
2019-02-25 00:00:00 False
2019-02-25 00:15:00 False
2019-02-25 00:30:00 False
2019-02-25 00:45:00 False
2019-02-25 01:00:00 False
... ...
2019-02-25 23:00:00 False
2019-02-25 23:15:00 False
2019-02-25 23:30:00 False
2019-02-25 23:45:00 False
2019-02-26 00:00:00 False

97 rows × 1 columns

[437]:
strt = pd.to_datetime('4/8/19')
all_flags['CS2_02'].plot_flagged_day(strt,  'CS2_02', tdelta='4D', auto_qa_event=event, paired_tank=all_flags['CEN_02'].data.tank_height)
[437]:
(<Axes: xlabel='Date'>,
 <Axes: title={'center': 'CS2_02 - 2019-04-08 00:00:00'}, xlabel='Date'>)
[436]:
plt.close(121)

CEN02 seems to have caught the second clog well, but not the first. It looks like the clog happens at the very end of the storm, near peak accumulation. From the graph it is fairly intuitive that this wouldn’t be identified as a clog. The diurnal fluctuations have been manually flagged.

Delayed Accumulation Could Signal a Clog in Reverse

Delayed accumulation in the paired probe will have a negative slope, so it is important to check that it isn’t being captured as a clog.

[458]:
del xprobe, flags, event
all_flags['CS2_02'].flags[['U','C']] = False
all_flags['CS2_02'].event['clog'] = False

xprobe = cross_probe_qc.XProbesQc(all_flags['CEN_01'].data.index, 'CS2_02')
xprobe.set_accum_ratio(xacc)

param = {'CEN_01':{'clog_pair_flagging_wrap':{"min_accum": 50,
                                    'lowest_normal_ratio': -0.09,
                                    'window_precision': 0.03,
                                    'precision_val': 0.1,
                                    'window': '1h',
                                    'n_std': 1.5}}}

xprobe.set_x_clogs(xppt, xacc, param)

flags = pd.DataFrame({'U':xprobe.U['CEN_01'], 'C':xprobe.C['CEN_01']})
event = pd.DataFrame({'clog':xprobe.clog['CEN_01']})

all_flags['CS2_02'].apply_QaRules_flags(event, flags)
[464]:
strt = pd.to_datetime('12/25/18')
all_flags['CS2_02'].plot_flagged_day(strt,  'CS2_02', tdelta='8D', auto_qa_event=event, paired_tank=all_flags['CEN_01'].data.tank_height)
[464]:
(<Axes: xlabel='Date'>,
 <Axes: title={'center': 'CS2_02 - 2018-12-25 00:00:00'}, xlabel='Date'>)
[465]:
plt.figure()
ax1 = plt.subplot(211)
xprobe.ratio['CEN_01'].plot(grid=True, ax=ax1)

ax2 = plt.subplot(212)
xprobe.ratio['CEN_01'].rolling(window='8D', closed='both').mean(engine='numba').plot(grid=True, ax=ax2)
[465]:
<Axes: xlabel='Date'>

This is a big problem! And because the rolling window (bottom figure) is so long (8days), the false clog lasts for a long time. I could try to filter out large changes in only a few timesteps, however I think we would still see this problem with gradual changes. The classic example is Mack, which often melts the snow in it’s funnel over several days, so that it’s total precipitation eventually matches other sites, but it is delayed over multiple days. That wouldn’t be filtered by a search for large changes, however, I am fairly certain it would create the same problem.

Can Weighting Fix This?

This may point out a critical part of weighting: that a single probe should never be able to trigger a clog by itself, even within a site. So, if we had looked at CEN01 vs CEN02 instead, we will see the same phenomenon. But, by requiring an additional probe to confirm a clog, the weighting should prevent this from being flagged.

The downside of this, is that within site probes provide the best timing for the beginning and end of a clog. So, by waiting until another probe also shows a clog, it shrinks the duration of the clog, slightly underflagging.

There are also possible scenarios where the weighting scheme fails… since CEN01 and CEN02 are so heavily weighted, any other probe could confirm. So, if we see one of the bogus clogs from a site like CS2MET during the 7 days after this delayed accumulation, where it simply caught more rain in the storm track, it would now have enough weight to flag it as a clog.

Weighting + Manual Unflagging

At the moment, I don’t see a clear path to an analytical solution. I think this situation will continue to create false positives, but the weighting should prevent it from flagging bodus periods. Hopefully, errors will be few enough that it will be reasonable to clean them up with manual flags (or unflags).

Rolling Avg at Second Scale

I will try a multi week or month long rolling average to try and pick out when the downward trend comes after an upward trend, as opposed to before it.

[ ]: