Finding Artificial Precip

In final evaluation of Capturing-clogs-from-ACC-ratio, a large amount of precipitation was discovered during a clog that turned out to be artificial. At first glance, it looked like a slow drip past a clog, but on closer inspection, it was an artifact of our GCE processing. It is unclear if the clog triggered the process. However, it has some overlap with the searches for duplicate values that were part of cleaning the data before searching for clogs.

Below the data is loaded into this Jupyter Lab before investigating the artificial precip.

[1]:
# setup environment.
import pandas as pd
import matplotlib.pyplot as plt
from numpy import nan, arange

# Jupyter magic to make plots display interactive
# must install ipympl (Ipython-matplotlib) and nodejs
from ipywidgets.embed import embed_minimal_html
%matplotlib widget

# expand all plots to comfortable viewing size
plt.rcParams['figure.figsize'] = [8, 5]

import sys
sys.path.append("../post_gce_qc/")
%load_ext autoreload
%autoreload explicit
%aimport qaqc
from qaqc import *
[2]:
# load data
#################
prov = qaqc.LoadProvisionalData('../config.yaml')
prov.load_ppt_data()

cnsh = prov.pivot_on_probe(prov.df, 'CEN', '02')
cnsa = prov.pivot_on_probe(prov.df, 'CEN', '01')

cn_ppt = cnsa.merge(cnsh, left_index=True, right_index=True, suffixes=('_SA', '_SH'))
[3]:
# clean data for comparison
##############################
sh_precision = 0.3810
sa_precision = 0.2794

probe = {'paired_sensor': 'CEN_02',
        'precision': sa_precision,
        'tank_min': 0,
        'tank_max': 412,
        'drain_event_max_recharge': 2.67,
        'use_pair': False,
        'drain_event_window': 3,
        'ppt_runavg_window': 4,
        'drain_event_nstd': 2}

qc = qaqc.QaRules(cn_ppt, qa_params=probe)
qc.flag_double_precip(ppt_col='TOT_SA', tank_col='INST_SA')
qc.flag_empty_tank(tank_col='INST_SA', pause_nsteps=2)

new_acc = qc.reset_wy_acc(ppt_col='TOT_SA')

ratio = (new_acc - cn_ppt['ACC_SH'])/new_acc
[4]:
# filter just our test year
df = cn_ppt#[pd.to_datetime('10/1/18'):pd.to_datetime('9/30/19')]
#ratio = ratio[pd.to_datetime('10/1/18'):pd.to_datetime('9/30/19')]
ratio.name='pair_ratio'
df_ratio = pd.DataFrame(ratio)

# find clogs
#######################
min_accum = 34
above_min_accum = (cn_ppt[['ACC_SH', 'ACC_SA']] > min_accum).all(axis=1)

below_normal = ratio < 0.025
dropping_ratio = qc.find_drops(df_ratio, 0.01, col='pair_ratio', wind=f'8D')

clog = (below_normal | dropping_ratio) & above_min_accum
[5]:
# create flag object for plotting
#################################
flag = ApplyFlags(df.index)

# add our new FSDB_auto_flag to this instance of ApplyFlags
clog_flag = pd.DataFrame(clog, columns=['U'], index=pd.to_datetime(df.index))
clog_event = pd.DataFrame(clog, columns=['clog'], index=df.index)
clog_event[['drain_event', 'neg_delta_tank', 'duplicate']] = False

flag.apply_FSDB_flags(clog_event, clog_flag)

# add the original data to the instance to plot
flag.import_provisional_data(qc.df_orig, tank_col='INST_SA', ppt_col='TOT_SA',
                             ppt_flag_col='TOT_Flag_SA')

False precip with flat tank

Where we left off, there was around an inch of precip before the clog was flagged on 4/5/19. It’s accumulating a constant amount of precip for 20 hours straight!! 0.01 inches was measured every 5 minutes from 4/6/19 0955 - 4/9/19 2045 without any gaps. It may require some cleaning before dealing with the clog.

[6]:
strt = pd.to_datetime('4/5/19')
end = strt + pd.to_timedelta('60h')

ax1 = df_ratio[strt:end].plot(legend=True, grid=True, color='g')
ax2=plt.twinx(ax1)
df.loc[strt:end,['ACC_SA']].plot(ax=ax2, legend=True, grid=True)

strt_diff = df.loc[strt,'ACC_SH']-df.loc[strt,'ACC_SA']
(df.loc[strt:end,'ACC_SH']-strt_diff).plot(ax=ax2, legend=True, grid=True)

df.loc[strt:end,'ACC_SA'][clog].plot(ax=ax2, grid=True, linestyle='', marker='o', color='navy')
ratio[strt:end][clog].plot(ax=ax1, grid=True, linestyle='', marker='o', color='navy')

rolling_rat = ratio[strt:end].rolling('8D').mean(engine=numba)
rolling_rat.plot(ax=ax1, grid=True, color='g', linestyle=':', legend=True)
rolling_rat[clog].plot(ax=ax1, grid=True, linestyle='', marker='o', color='navy')
[6]:
<Axes: xlabel='Date'>
[7]:
ax2.legend(['ACC_SA', 'ACC_SH', 'clog']).set_loc('lower left')

A closer look at the ACC and ratio show that the constant 0.01” accumulation swamp out the clog by creating an initial positive change in the ratio, delaying the clog flag. Need to dig into the raw data to see how this can be flagged, or even what’s causing it.

[8]:
strt = pd.to_datetime('4/6/19 0955')
end = strt + pd.to_timedelta(f'{5*100}min')#pd.to_datetime('4/9/19 2045')

pd.set_option('display.max_rows', 101)
cnsa[strt:end]
[8]:
INST INST_Flag TOT TOT_Flag ACC ACC_Flag
Date
2019-04-06 09:55:00 128.899994 <NA> 0.2 <NA> 1598.300049 <NA>
2019-04-06 10:00:00 128.899994 M 0.2 M 1598.5 M
2019-04-06 10:05:00 128.899994 M 0.2 M 1598.699951 M
2019-04-06 10:10:00 128.899994 M 0.2 M 1598.900024 M
2019-04-06 10:15:00 128.899994 M 0.2 M 1599.099976 M
2019-04-06 10:20:00 128.899994 M 0.2 M 1599.300049 M
2019-04-06 10:25:00 128.899994 M 0.2 M 1599.5 M
2019-04-06 10:30:00 128.899994 M 0.2 M 1599.699951 M
2019-04-06 10:35:00 128.899994 M 0.2 M 1599.900024 M
2019-04-06 10:40:00 128.899994 M 0.2 M 1600.099976 M
2019-04-06 10:45:00 128.899994 M 0.2 M 1600.300049 M
2019-04-06 10:50:00 128.899994 M 0.2 M 1600.5 M
2019-04-06 10:55:00 128.899994 M 0.2 M 1600.699951 M
2019-04-06 11:00:00 128.899994 M 0.2 M 1600.900024 M
2019-04-06 11:05:00 128.899994 M 0.2 M 1601.099976 M
2019-04-06 11:10:00 128.899994 M 0.2 M 1601.300049 M
2019-04-06 11:15:00 128.899994 M 0.2 M 1601.5 M
2019-04-06 11:20:00 128.899994 M 0.2 M 1601.699951 M
2019-04-06 11:25:00 128.899994 M 0.2 M 1601.900024 M
2019-04-06 11:30:00 128.899994 M 0.2 M 1602.099976 M
2019-04-06 11:35:00 128.899994 M 0.2 M 1602.300049 M
2019-04-06 11:40:00 128.899994 M 0.2 M 1602.5 M
2019-04-06 11:45:00 128.899994 M 0.2 M 1602.699951 M
2019-04-06 11:50:00 128.899994 M 0.2 M 1602.900024 M
2019-04-06 11:55:00 128.899994 M 0.2 M 1603.099976 M
2019-04-06 12:00:00 128.899994 M 0.2 M 1603.300049 M
2019-04-06 12:05:00 128.899994 M 0.2 M 1603.5 M
2019-04-06 12:10:00 128.899994 M 0.2 M 1603.699951 M
2019-04-06 12:15:00 128.899994 M 0.2 M 1603.900024 M
2019-04-06 12:20:00 128.899994 M 0.2 M 1604.099976 M
2019-04-06 12:25:00 128.899994 M 0.2 M 1604.300049 M
2019-04-06 12:30:00 128.899994 M 0.2 M 1604.5 M
2019-04-06 12:35:00 128.899994 M 0.2 M 1604.699951 M
2019-04-06 12:40:00 128.899994 M 0.2 M 1604.900024 M
2019-04-06 12:45:00 128.899994 M 0.2 M 1605.099976 M
2019-04-06 12:50:00 128.899994 M 0.2 M 1605.300049 M
2019-04-06 12:55:00 128.899994 M 0.2 M 1605.5 M
2019-04-06 13:00:00 128.899994 M 0.2 M 1605.699951 M
2019-04-06 13:05:00 128.899994 M 0.2 M 1605.900024 M
2019-04-06 13:10:00 128.899994 M 0.2 M 1606.099976 M
2019-04-06 13:15:00 128.899994 M 0.2 M 1606.300049 M
2019-04-06 13:20:00 128.899994 M 0.2 M 1606.5 M
2019-04-06 13:25:00 128.899994 M 0.2 M 1606.699951 M
2019-04-06 13:30:00 128.899994 M 0.2 M 1606.900024 M
2019-04-06 13:35:00 128.899994 M 0.2 M 1607.099976 M
2019-04-06 13:40:00 128.899994 M 0.2 M 1607.300049 M
2019-04-06 13:45:00 128.899994 M 0.2 M 1607.5 M
2019-04-06 13:50:00 128.899994 M 0.2 M 1607.699951 M
2019-04-06 13:55:00 128.899994 M 0.2 M 1607.900024 M
2019-04-06 14:00:00 128.899994 M 0.2 M 1608.099976 M
2019-04-06 14:05:00 128.899994 M 0.2 M 1608.300049 M
2019-04-06 14:10:00 128.899994 M 0.2 M 1608.5 M
2019-04-06 14:15:00 128.899994 M 0.2 M 1608.699951 M
2019-04-06 14:20:00 128.899994 M 0.2 M 1608.900024 M
2019-04-06 14:25:00 128.899994 M 0.2 M 1609.099976 M
2019-04-06 14:30:00 128.899994 M 0.2 M 1609.300049 M
2019-04-06 14:35:00 128.899994 M 0.2 M 1609.5 M
2019-04-06 14:40:00 128.899994 M 0.2 M 1609.699951 M
2019-04-06 14:45:00 128.899994 M 0.2 M 1609.900024 M
2019-04-06 14:50:00 128.899994 M 0.2 M 1610.099976 M
2019-04-06 14:55:00 128.899994 M 0.2 M 1610.300049 M
2019-04-06 15:00:00 128.899994 M 0.2 M 1610.5 M
2019-04-06 15:05:00 128.899994 M 0.2 M 1610.699951 M
2019-04-06 15:10:00 128.899994 M 0.2 M 1610.900024 M
2019-04-06 15:15:00 128.899994 M 0.2 M 1611.099976 M
2019-04-06 15:20:00 128.899994 M 0.2 M 1611.300049 M
2019-04-06 15:25:00 128.899994 M 0.2 M 1611.5 M
2019-04-06 15:30:00 128.899994 M 0.2 M 1611.699951 M
2019-04-06 15:35:00 128.899994 M 0.2 M 1611.900024 M
2019-04-06 15:40:00 128.899994 M 0.2 M 1612.099976 M
2019-04-06 15:45:00 128.899994 M 0.2 M 1612.300049 M
2019-04-06 15:50:00 128.899994 M 0.2 M 1612.5 M
2019-04-06 15:55:00 128.899994 M 0.2 M 1612.699951 M
2019-04-06 16:00:00 128.899994 M 0.2 M 1612.900024 M
2019-04-06 16:05:00 128.899994 M 0.2 M 1613.099976 M
2019-04-06 16:10:00 128.899994 M 0.2 M 1613.300049 M
2019-04-06 16:15:00 128.899994 M 0.2 M 1613.5 M
2019-04-06 16:20:00 128.899994 M 0.2 M 1613.699951 M
2019-04-06 16:25:00 128.899994 M 0.2 M 1613.900024 M
2019-04-06 16:30:00 128.899994 M 0.2 M 1614.099976 M
2019-04-06 16:35:00 128.899994 M 0.2 M 1614.300049 M
2019-04-06 16:40:00 128.899994 M 0.2 M 1614.5 M
2019-04-06 16:45:00 128.899994 M 0.2 M 1614.699951 M
2019-04-06 16:50:00 128.899994 M 0.2 M 1614.900024 M
2019-04-06 16:55:00 128.899994 M 0.2 M 1615.099976 M
2019-04-06 17:00:00 128.899994 M 0.2 M 1615.300049 M
2019-04-06 17:05:00 128.899994 M 0.2 M 1615.5 M
2019-04-06 17:10:00 128.899994 M 0.2 M 1615.699951 M
2019-04-06 17:15:00 128.899994 M 0.2 M 1615.900024 M
2019-04-06 17:20:00 128.899994 M 0.2 M 1616.099976 M
2019-04-06 17:25:00 128.899994 M 0.2 M 1616.300049 M
2019-04-06 17:30:00 128.899994 M 0.2 M 1616.5 M
2019-04-06 17:35:00 128.899994 M 0.2 M 1616.699951 M
2019-04-06 17:40:00 128.899994 M 0.2 M 1616.900024 M
2019-04-06 17:45:00 128.899994 M 0.2 M 1617.099976 M
2019-04-06 17:50:00 128.899994 M 0.2 M 1617.300049 M
2019-04-06 17:55:00 128.899994 M 0.2 M 1617.5 M
2019-04-06 18:00:00 128.899994 M 0.2 M 1617.699951 M
2019-04-06 18:05:00 128.899994 M 0.2 M 1617.900024 M
2019-04-06 18:10:00 128.899994 M 0.2 M 1618.099976 M
2019-04-06 18:15:00 128.899994 M 0.2 M 1618.300049 M
[588]:
strt = pd.to_datetime('4/6/19 0955')
end = pd.to_datetime('4/9/19 2045')

cnsa[strt:end].describe()
[588]:
INST TOT ACC
count 995.0 995.0 995.0
mean 128.899994 0.2 1697.698467
std 0.0 0.0 57.474289
min 128.899994 0.2 1598.300049
25% 128.899994 0.2 1647.999023
50% 128.899994 0.2 1697.697998
75% 128.899994 0.2 1747.39801
max 128.899994 0.2 1797.097046
[10]:
cnsa.loc[strt:end, 'TOT'].sum()
[10]:
199.00000296533108

The tank value is 100% static! This was flagged ‘M’, the tank values were deleted, and had pd.ffill() applied during load. Tank value does not change because it is missing data that was forward filled from the last known value. Yet 199 mm of precip accumulate! This is totally bad data! So the question is how to filter it.

[11]:
#strt = pd.to_datetime('4/5/19')
end = strt + pd.to_timedelta('11D')

ax1 = df_ratio[strt:end].plot(legend=True, grid=True, color='g')
ax2=plt.twinx(ax1)
df.loc[strt:end,['ACC_SA']].plot(ax=ax2, legend=True, grid=True)

strt_diff = df.loc[strt,'ACC_SH']-df.loc[strt,'ACC_SA']
(df.loc[strt:end,'ACC_SH']-strt_diff).plot(ax=ax2, legend=True, grid=True)

df.loc[strt:end,'ACC_SA'][clog].plot(ax=ax2, grid=True, linestyle='', marker='o', color='navy', legend=True)
ratio[strt:end][clog].plot(ax=ax1, grid=True, linestyle='', marker='o', color='navy', legend=True)

rolling_rat = ratio[strt:end].rolling('8D').mean(engine=numba)
rolling_rat.plot(ax=ax1, grid=True, color='g', linestyle=':', legend=True)
rolling_rat[clog].plot(ax=ax1, grid=True, linestyle='', marker='o', color='navy', legend=True)
[11]:
<Axes: xlabel='Date'>
[12]:
ax2.legend().set_loc('lower left')

Compare to raw data

fpath = input()
[562]:
raw = pd.read_csv(fpath, header=1, parse_dates=True, index_col='TIMESTAMP',
                                encoding='latin-1', engine='pyarrow', dtype_backend='pyarrow')
raw = raw.iloc[2:]
raw.index = pd.to_datetime(raw.index)
[563]:
pd.options.display.min_rows = 45
pd.merge(raw.loc[strt:end, 'SA_PRECIP'],cnsa.loc[strt:end, 'TOT'], left_index=True, right_index=True)
[563]:
SA_PRECIP TOT
TIMESTAMP
2019-04-06 09:55:00 128.9 0.2
2019-04-06 10:00:00 129.1 0.2
2019-04-06 10:05:00 129.3 0.2
2019-04-06 10:10:00 129.7 0.2
2019-04-06 10:15:00 129.8 0.2
2019-04-06 10:20:00 129.8 0.2
2019-04-06 10:25:00 129.9 0.2
2019-04-06 10:30:00 130.3 0.2
2019-04-06 10:35:00 130.4 0.2
2019-04-06 10:40:00 130.7 0.2
2019-04-06 10:45:00 131 0.2
2019-04-06 10:50:00 130.9 0.2
2019-04-06 10:55:00 131 0.2
2019-04-06 11:00:00 131.1 0.2
2019-04-06 11:05:00 131.1 0.2
2019-04-06 11:10:00 131.2 0.2
2019-04-06 11:15:00 131.2 0.2
2019-04-06 11:20:00 131.2 0.2
2019-04-06 11:25:00 131.2 0.2
2019-04-06 11:30:00 131.3 0.2
... ... ...
2019-04-09 19:10:00 133.5 0.2
2019-04-09 19:15:00 133.5 0.2
2019-04-09 19:20:00 133.5 0.2
2019-04-09 19:25:00 133.5 0.2
2019-04-09 19:30:00 133.6 0.2
2019-04-09 19:35:00 133.5 0.2
2019-04-09 19:40:00 133.5 0.2
2019-04-09 19:45:00 133.5 0.2
2019-04-09 19:50:00 133.5 0.2
2019-04-09 19:55:00 133.5 0.2
2019-04-09 20:00:00 133.5 0.2
2019-04-09 20:05:00 133.4 0.2
2019-04-09 20:10:00 133.5 0.2
2019-04-09 20:15:00 133.5 0.2
2019-04-09 20:20:00 133.5 0.2
2019-04-09 20:25:00 133.5 0.2
2019-04-09 20:30:00 133.5 0.2
2019-04-09 20:35:00 133.5 0.2
2019-04-09 20:40:00 133.5 0.2
2019-04-09 20:45:00 133.5 0.2

995 rows × 2 columns

[592]:
plt.figure()
raw.loc[strt:end, 'SA_PRECIP'].astype('float[pyarrow]').plot(grid=True, legend=True)
tot = cnsa.loc[strt:end, 'TOT'].cumsum()
tot.index = tot.index.astype('datetime64[ns]')
ax2 = plt.twinx()
tot.plot(ax=ax2, grid=True, color='orange')
plt.legend(loc='lower right')
[592]:
<matplotlib.legend.Legend at 0x2d9189b2bb0>

So, the raw data isn’t perfectly flat, but we have 3.6 mm of tank change and 199 mm of precip. We can see that, right away, the precip is doubling the incriment that the tank is increasing by, and within 5 timesteps, the tank pauses it’s accumulation, but the precip still remains constant. So, while not absolutely flat, it shows the same problem. It would be nice if the raw tank values weren’t deleted. Let’s see if we can filter out this issue.

Filtering for a flat tank

Max == min

The extreme example above might have a simple solution.

[13]:
dly_tank = cnsa['INST'].rolling('1D')
dly_max = dly_tank.max(engine=numba)
dly_min = dly_tank.min(engine=numba)

flat_tank = dly_max == dly_min
raining = cnsa['TOT'] > 0

cnsa[flat_tank&raining]
[13]:
INST INST_Flag TOT TOT_Flag ACC ACC_Flag
Date
2019-04-07 09:50:00 128.899994 M 0.2 M 1655.698975 M
2019-04-07 09:55:00 128.899994 M 0.2 M 1655.899048 M
2019-04-07 10:00:00 128.899994 M 0.2 M 1656.098999 M
2019-04-07 10:05:00 128.899994 M 0.2 M 1656.29895 M
2019-04-07 10:10:00 128.899994 M 0.2 M 1656.499023 M
... ... ... ... ... ... ...
2019-04-09 20:25:00 128.899994 M 0.2 M 1796.296997 M
2019-04-09 20:30:00 128.899994 M 0.2 M 1796.496948 M
2019-04-09 20:35:00 128.899994 M 0.2 M 1796.697021 M
2019-04-09 20:40:00 128.899994 M 0.2 M 1796.896973 M
2019-04-09 20:45:00 128.899994 M 0.2 M 1797.097046 M

708 rows × 6 columns

[14]:
# test with Shelter
dly_tank = cnsh['INST'].rolling('1D')
dly_max = dly_tank.max(engine=numba)
dly_min = dly_tank.min(engine=numba)

flat_tank = dly_max == dly_min
raining = cnsh['TOT'] > 0

cnsh[flat_tank&raining]
[14]:
INST INST_Flag TOT TOT_Flag ACC ACC_Flag
Date

That caught the problem right away. A few issues though:

  1. This caught the one and only known case, but I still don’t know what caused it, and it didn’t catch any other cases. So are there cases we still don’t know about?

  2. This starts a day late. It will need to be adjusted back by the size of the window (‘1D’).

  3. I already have a qa rule to catch flat tanks…I’m suspicious that I need another one. Maybe I should take another look at QaRules.flag_duplicate_precip(), plus an unused rule QaRules.flag_overaccum_precip()

Let me try another method that might work better with our diurnal fluctuations we see at UPLO and VARA.

Max==min w/ a buffer

[15]:
dly_tank = cnsa['INST'].rolling('1D')
dly_max = dly_tank.max(engine=numba)
dly_min = dly_tank.min(engine=numba)

flat_tank = abs(dly_max - dly_min) < 0.288
raining = cnsa['TOT'] > 0

cnsa[flat_tank&raining]
[15]:
INST INST_Flag TOT TOT_Flag ACC ACC_Flag
Date
2017-10-01 00:05:00 239.699997 <NA> 0.2 Q 0.2 Q
2018-02-20 13:55:00 62.82 <NA> 0.09 <NA> 1216.300049 <NA>
2018-10-01 21:55:00 16.76 <NA> 0.01 <NA> 0.19 <NA>
2019-04-07 09:45:00 128.899994 M 0.2 M 1655.499023 M
2019-04-07 09:50:00 128.899994 M 0.2 M 1655.698975 M
... ... ... ... ... ... ...
2019-04-09 20:40:00 128.899994 M 0.2 M 1796.896973 M
2019-04-09 20:45:00 128.899994 M 0.2 M 1797.097046 M
2019-09-30 22:25:00 64.290001 <NA> 0.12 <NA> 2423.826904 <NA>
2019-11-21 14:20:00 59.259998 <NA> 0.04 <NA> 166.479996 <NA>
2022-09-19 00:15:00 12.14 <NA> 0.03 <NA> 2238.02002 <NA>

715 rows × 6 columns

This still catches the April 2019 event and the rows not shown are all during that event. I’ll run down the newly identified times and see what they look like.

[16]:
cnsa.head(15)
[16]:
INST INST_Flag TOT TOT_Flag ACC ACC_Flag
Date
2017-10-01 00:00:00 239.5 <NA> 0.0 <NA> 0.0 <NA>
2017-10-01 00:05:00 239.699997 <NA> 0.2 Q 0.2 Q
2017-10-01 00:10:00 239.800003 <NA> 0.3 <NA> 0.5 <NA>
2017-10-01 00:15:00 239.899994 <NA> 0.1 <NA> 0.6 <NA>
2017-10-01 00:20:00 239.800003 <NA> 0.0 <NA> 0.6 <NA>
2017-10-01 00:25:00 239.899994 <NA> 0.0 <NA> 0.6 <NA>
2017-10-01 00:30:00 239.899994 <NA> 0.0 <NA> 0.6 <NA>
2017-10-01 00:35:00 239.899994 <NA> 0.0 <NA> 0.6 <NA>
2017-10-01 00:40:00 240.100006 <NA> 0.2 <NA> 0.8 <NA>
2017-10-01 00:45:00 240.300003 <NA> 0.2 <NA> 1.0 <NA>
2017-10-01 00:50:00 240.100006 <NA> 0.0 <NA> 1.0 <NA>
2017-10-01 00:55:00 240.300003 <NA> 0.0 <NA> 1.0 <NA>
2017-10-01 01:00:00 240.199997 <NA> 0.0 <NA> 1.0 <NA>
2017-10-01 01:05:00 240.300003 <NA> 0.0 <NA> 1.0 <NA>
2017-10-01 01:10:00 240.300003 <NA> 0.0 <NA> 1.0 <NA>
[17]:
pair_tank = qc.df_orig['INST_SH']

ax = flag.plot_flagged_day(pd.to_datetime('2/19/2018 1355'), 'CEN_01', tdelta='1D', auto_qa_event=clog_event, paired_tank=pair_tank)
[18]:
pair_tank = qc.df_orig['INST_SH']

ax = flag.plot_flagged_day(pd.to_datetime('9/30/2018 2055'), 'CEN_01', tdelta='1D', auto_qa_event=clog_event, paired_tank=pair_tank)

The first one seems like it is just because there aren’t enough values. I will set a minnumum number of datapoints to fix that. The second one looks like it’s just the first rain after a long rain-free period. The third one is the most telling. There is minimal bounce in the tank values, but simplepre.m is outputting some small precip values. So, after a full day where the tank value doesn’t change anymore than it’s precision, the little bit of precip being spit out by Fox’s algorythm is caught by the filter. A great example of why this filter will not work.

The flat tank is only important because, in our first example, there were a long series of NAN values that were filled with a forward fill. If we move away from the flat tank criteria, this filter would be more focused on precip when there was tank bounce that may or may not be valid. Since we already have a fully fleshed out method for dealing with duplicates, let’s see if we can make that logic work to capture this instance.

QaRules.flag_duplicate_precip

This was developed above in Remove-duplicate-values, a sub-section of GCE-flagging-during-clogs, which was focused on doubling of precip after clogs dumped large amounts into the tanks all at once. It usually was flagged by GCE as an ‘F’ following a ‘J’.

The logic was: duplicate = large_ppt & repeat_val & flat_tank

where large_ppt was defined as:

precision = self.precision * int(large_ppt_size)
large_ppt = rainfall > large_ppt_size

and both flat_tank and repeat_val were Series.diff() < precision.

I’ll start by looking at what the filter grabed when I created the clean ACC data above with large_ppt = 1* precision, which is hopefully the duplicate values I already know about. Then I’ll start expanding and see if it will capture the April 2019 event.

[19]:
qc.df_orig[qc.qa_events.duplicate]
[19]:
INST_SA INST_Flag_SA TOT_SA TOT_Flag_SA ACC_SA ACC_Flag_SA INST_SH INST_Flag_SH TOT_SH TOT_Flag_SH ACC_SH ACC_Flag_SH
Date
2018-11-21 17:00:00 217.199997 <NA> 5.1 F 205.630005 F 236.800003 <NA> 0.0 <NA> 192.649994 <NA>
2018-12-25 09:25:00 307.299988 <NA> 173.0 F 826.309998 F 311.399994 <NA> 0.0 <NA> 628.419983 <NA>
2019-06-28 00:20:00 141.300003 <NA> 21.200001 F 2247.336914 F 170.600006 <NA> 0.0 <NA> 1778.619995 <NA>
2019-09-08 13:20:00 35.189999 <NA> 4.48 F 2278.616943 F 59.5 <NA> 0.0 <NA> 1807.150024 <NA>

This had been locked so that large_ppt_size was an integer. Therefore 1mm was the minimum. I adjusted this to be increments of precision. This lets me adjust the definition of what is large ppt more finely to test this. ** large_ppt relabeled min_ppt.**

minimum_ppt > 0

Criteria:

# flat_tank and repeat_val use 1*precision
large_ppt = rainfall > 0
flat_tank & repeat_val & large_ppt
[20]:
pd.options.display.min_rows = 20
# reset
qc.qa_events.duplicate = False
# rerun
qc.flag_double_precip(ppt_col='TOT_SA', tank_col='INST_SA', min_ppt_nprecision=0,
                      flat_tank_nprecision=1, flat_ppt_nprecision=1)

print(qc.df_orig[qc.qa_events.duplicate].describe())
qc.df_orig[qc.qa_events.duplicate].head(20)
          INST_SA    TOT_SA       ACC_SA     INST_SH    TOT_SH       ACC_SH
count     24031.0   24031.0      24031.0     24031.0   24031.0      24031.0
mean   139.274711   0.15218  1055.776442  166.930365   0.10182   964.259268
std      99.73674  1.125805   635.093782   99.344013  0.162233   569.434429
min         7.499      0.01         0.08       22.74       0.0          0.0
25%     58.919998       0.1   454.029999   83.699997       0.0   432.779999
50%    116.900002      0.12  1089.650024  142.600006       0.0   987.150024
75%    205.300003       0.2  1544.019958  236.399994       0.2  1421.699951
max    498.200012     173.0  2423.826904  507.100006       1.7  2220.330078
[20]:
INST_SA INST_Flag_SA TOT_SA TOT_Flag_SA ACC_SA ACC_Flag_SA INST_SH INST_Flag_SH TOT_SH TOT_Flag_SH ACC_SH ACC_Flag_SH
Date
2017-10-01 00:05:00 239.699997 <NA> 0.2 Q 0.2 Q 259.799988 <NA> 0.2 Q 0.2 Q
2017-10-01 00:10:00 239.800003 <NA> 0.3 <NA> 0.5 <NA> 260.100006 <NA> 0.5 <NA> 0.7 <NA>
2017-10-01 00:15:00 239.899994 <NA> 0.1 <NA> 0.6 <NA> 260.100006 <NA> 0.0 <NA> 0.7 <NA>
2017-10-01 00:40:00 240.100006 <NA> 0.2 <NA> 0.8 <NA> 260.399994 <NA> 0.0 <NA> 1.0 <NA>
2017-10-01 00:45:00 240.300003 <NA> 0.2 <NA> 1.0 <NA> 260.799988 <NA> 0.4 <NA> 1.4 <NA>
2017-10-01 01:55:00 240.399994 <NA> 0.1 <NA> 1.1 <NA> 260.799988 <NA> 0.0 <NA> 1.6 <NA>
2017-10-01 02:25:00 241.300003 <NA> 0.2 <NA> 2.0 <NA> 261.299988 <NA> 0.0 <NA> 2.2 <NA>
2017-10-01 02:50:00 241.399994 <NA> 0.1 <NA> 2.1 <NA> 261.700012 <NA> 0.0 <NA> 2.4 <NA>
2017-10-01 02:55:00 241.5 <NA> 0.1 <NA> 2.2 <NA> 261.799988 <NA> 0.0 <NA> 2.4 <NA>
2017-10-01 03:05:00 241.699997 <NA> 0.2 <NA> 2.4 <NA> 262.0 <NA> 0.0 <NA> 2.6 <NA>
2017-10-01 03:30:00 241.800003 <NA> 0.1 <NA> 2.5 <NA> 262.200012 <NA> 0.0 <NA> 3.1 <NA>
2017-10-01 03:50:00 242.399994 <NA> 0.2 <NA> 3.1 <NA> 263.0 <NA> 0.5 <NA> 3.6 <NA>
2017-10-01 04:00:00 242.899994 <NA> 0.2 <NA> 3.6 <NA> 263.200012 <NA> 0.2 <NA> 3.8 <NA>
2017-10-01 04:15:00 243.0 <NA> 0.1 <NA> 3.7 <NA> 263.399994 <NA> 0.1 <NA> 4.0 <NA>
2017-10-01 05:05:00 243.100006 <NA> 0.1 <NA> 3.8 <NA> 263.700012 <NA> 0.0 <NA> 4.3 <NA>
2017-10-01 06:05:00 243.199997 <NA> 0.1 <NA> 3.9 <NA> 263.5 <NA> 0.0 <NA> 4.3 <NA>
2017-10-01 06:25:00 243.399994 <NA> 0.2 <NA> 4.1 <NA> 263.899994 <NA> 0.2 <NA> 4.5 <NA>
2017-10-01 06:55:00 244.600006 <NA> 0.1 <NA> 5.3 <NA> 265.5 <NA> 0.3 <NA> 6.1 <NA>
2017-10-01 07:00:00 244.699997 <NA> 0.1 <NA> 5.4 <NA> 265.5 <NA> 0.0 <NA> 6.1 <NA>
2017-10-01 07:10:00 244.899994 <NA> 0.2 <NA> 5.6 <NA> 265.200012 <NA> 0.0 <NA> 6.1 <NA>

precip > 0 capture 24,310 rows, and the first 20 appear to be perfectly valid precip. It’s just real gradual. This threshold is clearly catching a lot of false positives. Although, if the tank is changing less than the precision of the sensor, it does make you wonder why we’re accumulating so much precip in this mode.

minimum_ppt > 0.2794

Criteria:

# flat_tank and repeat_val use 1*precision
large_ppt = rainfall > 0.2794
flat_tank & repeat_val & large_ppt
[21]:
pd.options.display.min_rows = 20
# reset
qc.qa_events.duplicate = False
# rerun
qc.flag_double_precip(ppt_col='TOT_SA', tank_col='INST_SA',  min_ppt_nprecision=1,
                      flat_tank_nprecision=1, flat_ppt_nprecision=1)
qc.df_orig[qc.qa_events.duplicate]
[21]:
INST_SA INST_Flag_SA TOT_SA TOT_Flag_SA ACC_SA ACC_Flag_SA INST_SH INST_Flag_SH TOT_SH TOT_Flag_SH ACC_SH ACC_Flag_SH
Date
2017-10-01 00:10:00 239.800003 <NA> 0.3 <NA> 0.5 <NA> 260.100006 <NA> 0.5 <NA> 0.7 <NA>
2017-10-19 14:45:00 346.5 MM 0.3 M 107.5 M 352.299988 M 0.2 M 103.699997 M
2017-10-19 14:50:00 346.5 MM 0.3 M 107.800003 M 352.299988 M 0.2 M 103.900002 M
2017-10-19 14:55:00 346.5 MM 0.3 M 108.099998 M 352.299988 M 0.2 M 104.099998 M
2017-10-19 15:00:00 346.5 MM 0.3 M 108.400002 M 352.299988 M 0.2 M 104.300003 M
2017-10-19 15:05:00 346.5 MM 0.3 M 108.699997 M 352.299988 M 0.2 M 104.5 M
2017-10-19 15:10:00 346.5 MM 0.3 M 109.0 M 352.299988 M 0.2 M 104.699997 M
2017-10-19 15:15:00 346.5 MM 0.3 M 109.300003 M 352.299988 M 0.2 M 104.900002 M
2017-10-19 15:20:00 346.5 MM 0.3 M 109.599998 M 352.299988 M 0.2 M 105.099998 M
2017-10-19 15:25:00 346.5 MM 0.3 M 109.900002 M 352.299988 M 0.2 M 105.300003 M
2017-10-19 15:30:00 346.5 MM 0.3 M 110.199997 M 352.299988 M 0.2 M 105.5 M
2018-11-21 17:00:00 217.199997 <NA> 5.1 F 205.630005 F 236.800003 <NA> 0.0 <NA> 192.649994 <NA>
2018-12-25 09:25:00 307.299988 <NA> 173.0 F 826.309998 F 311.399994 <NA> 0.0 <NA> 628.419983 <NA>
2019-06-28 00:20:00 141.300003 <NA> 21.200001 F 2247.336914 F 170.600006 <NA> 0.0 <NA> 1778.619995 <NA>
2019-09-08 13:20:00 35.189999 <NA> 4.48 F 2278.616943 F 59.5 <NA> 0.0 <NA> 1807.150024 <NA>
[22]:
clog_event['duplicate'] = qc.qa_events.duplicate

flag.apply_FSDB_flags(clog_event, clog_flag)

# add the original data to the instance to plot
flag.import_provisional_data(qc.df_orig, tank_col='INST_SA', ppt_col='TOT_SA',
                             ppt_flag_col='TOT_Flag_SA')

pair_tank = qc.df_orig['INST_SH']

ax = flag.plot_flagged_day(pd.to_datetime('10/19/17 1400'), 'CEN_01', tdelta='90min', auto_qa_event=clog_event, paired_tank=pair_tank)
[23]:
strt = pd.to_datetime('10/19/17 1400')
this_event = cn_ppt.loc[strt:pd.to_datetime('10/19/17 1530'), ['ACC_SA', 'ACC_SH']]

strt_diff = this_event.loc[strt, 'ACC_SA'] - this_event.loc[strt, 'ACC_SH']

this_event['ACC_SA'] -= strt_diff
this_event.plot(grid=True, legend=True, marker='*')

[23]:
<Axes: xlabel='Date'>
[24]:
ax2 = plt.twinx(plt.gca())

strt = pd.to_datetime('10/19/17 1400')
this_event = cn_ppt.loc[strt:pd.to_datetime('10/19/17 1530'), ['INST_SA', 'INST_SH']]

strt_diff = this_event.loc[strt, 'INST_SA'] - this_event.loc[strt, 'INST_SH']

this_event['INST_SA'] -= strt_diff
this_event.plot(ax=ax2, linestyle='--', marker='.', grid=True, legend=True).legend(loc='lower right')
[24]:
<matplotlib.legend.Legend at 0x2d94fc32a30>

precip > 0.2794 captures our original group, and a cluster from 2017 that look like it might need to be flagged. However, the April 2019 case is already excluded, because it accumulates at a rate of 0.2, and precision is 0.2794. So this is underflagging already. Let’s try half of precision.

minimum_ppt > 0.2794 / 2

Criteria:

# flat_tank and repeat_val use 1*precision
large_ppt = rainfall > (0.2794/2)
flat_tank & repeat_val & large_ppt
[25]:
pd.options.display.min_rows = 15
# reset
qc.qa_events.duplicate = False
# reflag

qc.flag_double_precip(ppt_col='TOT_SA', tank_col='INST_SA', min_ppt_nprecision=0.5,
                      flat_tank_nprecision=1, flat_ppt_nprecision=1)
qc.df_orig[qc.qa_events.duplicate]
[25]:
INST_SA INST_Flag_SA TOT_SA TOT_Flag_SA ACC_SA ACC_Flag_SA INST_SH INST_Flag_SH TOT_SH TOT_Flag_SH ACC_SH ACC_Flag_SH
Date
2017-10-01 00:05:00 239.699997 <NA> 0.2 Q 0.2 Q 259.799988 <NA> 0.2 Q 0.2 Q
2017-10-01 00:10:00 239.800003 <NA> 0.3 <NA> 0.5 <NA> 260.100006 <NA> 0.5 <NA> 0.7 <NA>
2017-10-01 00:40:00 240.100006 <NA> 0.2 <NA> 0.8 <NA> 260.399994 <NA> 0.0 <NA> 1.0 <NA>
2017-10-01 00:45:00 240.300003 <NA> 0.2 <NA> 1.0 <NA> 260.799988 <NA> 0.4 <NA> 1.4 <NA>
2017-10-01 02:25:00 241.300003 <NA> 0.2 <NA> 2.0 <NA> 261.299988 <NA> 0.0 <NA> 2.2 <NA>
2017-10-01 03:05:00 241.699997 <NA> 0.2 <NA> 2.4 <NA> 262.0 <NA> 0.0 <NA> 2.6 <NA>
2017-10-01 03:50:00 242.399994 <NA> 0.2 <NA> 3.1 <NA> 263.0 <NA> 0.5 <NA> 3.6 <NA>
... ... ... ... ... ... ... ... ... ... ... ... ...
2022-09-29 00:55:00 24.35 <NA> 0.17 <NA> 2250.22998 <NA> 40.610001 <NA> 0.02 <NA> 2216.139893 <NA>
2022-09-29 02:30:00 24.879999 <NA> 0.16 <NA> 2250.76001 <NA> 40.860001 <NA> 0.12 <NA> 2216.389893 <NA>
2022-09-29 13:20:00 27.48 <NA> 0.16 <NA> 2253.360107 <NA> 44.450001 <NA> 0.11 <NA> 2219.97998 <NA>
2022-09-29 15:25:00 28.190001 <NA> 0.24 <NA> 2254.070068 <NA> 44.41 <NA> 0.0 <NA> 2220.01001 <NA>
2022-09-29 15:30:00 28.34 <NA> 0.15 <NA> 2254.219971 <NA> 44.32 <NA> 0.0 <NA> 2220.01001 <NA>
2022-09-29 15:35:00 28.49 <NA> 0.15 <NA> 2254.370117 <NA> 44.32 <NA> 0.0 <NA> 2220.01001 <NA>
2022-09-29 20:25:00 29.34 <NA> 0.14 <NA> 2255.219971 <NA> 44.650002 <NA> 0.06 <NA> 2220.179932 <NA>

9236 rows × 12 columns

OK, so if any precip captrues 24,000 rows, and precip greater than precision (0.2794) captures 15 rows, half precision does just a bit better than splitting the difference at 9,236. That might still be too much for me to drill down into. Maybe a list of dates will be more clear.

[26]:
qc.df_orig[qc.qa_events.duplicate].index
[26]:
Index([2017-10-01 00:05:00, 2017-10-01 00:10:00, 2017-10-01 00:40:00,
       2017-10-01 00:45:00, 2017-10-01 02:25:00, 2017-10-01 03:05:00,
       2017-10-01 03:50:00, 2017-10-01 04:00:00, 2017-10-01 06:25:00,
       2017-10-01 07:10:00,
       ...
       2022-09-21 10:25:00, 2022-09-21 17:35:00, 2022-09-21 19:40:00,
       2022-09-29 00:55:00, 2022-09-29 02:30:00, 2022-09-29 13:20:00,
       2022-09-29 15:25:00, 2022-09-29 15:30:00, 2022-09-29 15:35:00,
       2022-09-29 20:25:00],
      dtype='timestamp[s][pyarrow]', name='Date', length=9236)

minimum_ppt > 0.196

Criteria:

# flat_tank and repeat_val use 1*precision
large_ppt = rainfall > 0.196
flat_tank & repeat_val & large_ppt
[27]:
pd.options.display.min_rows = 5
# reset
qc.qa_events.duplicate = False
# reflag

qc.flag_double_precip(ppt_col='TOT_SA', tank_col='INST_SA', min_ppt_nprecision=0.7,
                      flat_tank_nprecision=1, flat_ppt_nprecision=1)
qc.df_orig[qc.qa_events.duplicate]
[27]:
INST_SA INST_Flag_SA TOT_SA TOT_Flag_SA ACC_SA ACC_Flag_SA INST_SH INST_Flag_SH TOT_SH TOT_Flag_SH ACC_SH ACC_Flag_SH
Date
2017-10-01 00:05:00 239.699997 <NA> 0.2 Q 0.2 Q 259.799988 <NA> 0.2 Q 0.2 Q
2017-10-01 00:10:00 239.800003 <NA> 0.3 <NA> 0.5 <NA> 260.100006 <NA> 0.5 <NA> 0.7 <NA>
... ... ... ... ... ... ... ... ... ... ... ... ...
2022-09-21 19:40:00 19.91 <NA> 0.25 <NA> 2245.790039 <NA> 37.380001 <NA> 0.19 <NA> 2212.909912 <NA>
2022-09-29 15:25:00 28.190001 <NA> 0.24 <NA> 2254.070068 <NA> 44.41 <NA> 0.0 <NA> 2220.01001 <NA>

9119 rows × 12 columns

That’s not getting much better. It seems like to include the April 2019 event, I end up including all sorts of other individual moments that seem like they’re mostly false positives. Plus, sometimes it grabs a single timestamp from a day, which is more likely a result of the algorithm “catching up” to tank accumulation. Of course, I don’t really understand what in GCE causes this error, so I’m guessing a little, but just one timestamp in a day seems like it is part of a bigger and complicated picture for at least that hour’s precip. This method doesn’t seem to be working at finding consistent periods of repeat artifical values.

One key that seems to be missing here is looking over a longer period of time than just one timestep to the next. This method was developed to find individual timesteps that were duplicated. This precip question seems to need a longer looking approach, like the rolling window (tested below).

However, the focus has been on tuning the size of the precip. With a modificaiton, I could tune how flat the tank must be and how close the precip values.

Identical precision for flat and repeat

In the specific case of artificial precipitation, as found in April 2019, the tank is precisely flat, because it is the same value filled forward. And for some strange reason, the precipitation is also completely identical. The original method was looking for nearly identical tank values and nearly duplicated precipitation. This case is actually repeating, so I can make it exact. I had to adjust the function to allow for >= 0 difference.

Criteria:

# flat_tank and repeat_val use 0*precision
repeat_val = (abs(rainfall.diff()) <= 0) & (rainfall > 0)
flat_tank = df[tank_col].diff() <= 0

minimum_ppt = rainfall > 0

flat_tank & repeat_val & minimum_ppt
[85]:
#pd.set_option('display.min_rows', 100)
pd.options.display.max_rows = 35
[157]:
# reset
qc.qa_events.duplicate = False
# reflag

qc.flag_repeating_val_precip(ppt_col='TOT_SA', tank_col='INST_SA', flat_ppt_nprecision=0, flat_tank_nprecision=0,
                            min_ppt_nprecision=0)
qc.df_orig[qc.qa_events.duplicate]
[157]:
INST_SA INST_Flag_SA TOT_SA TOT_Flag_SA ACC_SA ACC_Flag_SA INST_SH INST_Flag_SH TOT_SH TOT_Flag_SH ACC_SH ACC_Flag_SH
Date
2017-10-19 14:45:00 346.5 MM 0.3 M 107.5 M 352.299988 M 0.2 M 103.699997 M
2017-10-19 14:50:00 346.5 MM 0.3 M 107.800003 M 352.299988 M 0.2 M 103.900002 M
2017-10-19 14:55:00 346.5 MM 0.3 M 108.099998 M 352.299988 M 0.2 M 104.099998 M
2017-10-19 15:00:00 346.5 MM 0.3 M 108.400002 M 352.299988 M 0.2 M 104.300003 M
2017-10-19 15:05:00 346.5 MM 0.3 M 108.699997 M 352.299988 M 0.2 M 104.5 M
2017-10-19 15:10:00 346.5 MM 0.3 M 109.0 M 352.299988 M 0.2 M 104.699997 M
2017-10-19 15:15:00 346.5 MM 0.3 M 109.300003 M 352.299988 M 0.2 M 104.900002 M
2017-10-19 15:20:00 346.5 MM 0.3 M 109.599998 M 352.299988 M 0.2 M 105.099998 M
2017-10-19 15:25:00 346.5 MM 0.3 M 109.900002 M 352.299988 M 0.2 M 105.300003 M
2017-10-19 15:30:00 346.5 MM 0.3 M 110.199997 M 352.299988 M 0.2 M 105.5 M
... ... ... ... ... ... ... ... ... ... ... ... ...
2022-03-13 02:10:00 39.169998 MM 0.26 M 1269.619995 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:15:00 39.169998 MM 0.26 M 1269.880005 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:20:00 39.169998 MM 0.26 M 1270.140015 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:25:00 39.169998 MM 0.26 M 1270.400024 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:30:00 39.169998 MM 0.26 M 1270.660034 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:35:00 39.169998 MM 0.26 M 1270.920044 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:40:00 39.169998 MM 0.26 M 1271.180054 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:45:00 39.169998 MM 0.26 M 1271.439941 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:50:00 39.169998 MM 0.26 M 1271.699951 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:55:00 39.169998 MM 0.26 M 1271.959961 M 69.540001 M 0.0 M 1182.130005 M

1309 rows × 12 columns

[158]:
qc.df_orig.loc[qc.qa_events.duplicate,'TOT_SA'].sum()
[158]:
224.22000300884247

That looks like it is working really well. It always has a string of several values and they look suspicious. And I don’t see any singular timestamps. The only downside is, if we start getting real tank values from provisional, it will stop working. I tried adjusting it to allow even 0.05*0.279 bounce in the tank value, and it adds a bunch singular timestamps without the timestamp before or after. So… this works on this particular version of GCE output, but will have a future break point.

It makes sense to see if we can put a QA rule in place that will continue to catch this problem, but will succeed against the future break point.

I want to plot one of the examples where the data wasn’t flagged missing before moving on.

[30]:
clog_event['duplicate'] = qc.qa_events.duplicate

flag.apply_FSDB_flags(clog_event, clog_flag)

# add the original data to the instance to plot
flag.import_provisional_data(qc.df_orig, tank_col='INST_SA', ppt_col='TOT_SA',
                             ppt_flag_col='TOT_Flag_SA')

pair_tank = qc.df_orig['INST_SH']

ax = flag.plot_flagged_day(pd.to_datetime('6/3/20 1200'), 'CEN_01', tdelta='2D', auto_qa_event=clog_event, paired_tank=pair_tank)
[31]:
plt.grid(True)

This is a little confusing. The tank value isn’t constant, so let’s double check this.

[32]:
ax = flag.plot_flagged_day(pd.to_datetime('6/4/20 0900'), 'CEN_01', tdelta='4h', auto_qa_event=clog_event, paired_tank=pair_tank)
[33]:
strt, end = pd.to_datetime('6/4/20 0800'), pd.to_datetime('6/4/20 1000')
qc.df_orig.loc[strt:end, ['INST_SA','INST_Flag_SA','TOT_SA','TOT_Flag_SA']]
[33]:
INST_SA INST_Flag_SA TOT_SA TOT_Flag_SA
Date
2020-06-04 08:00:00 19.5 <NA> 0.05 <NA>
2020-06-04 08:05:00 19.5 <NA> 0.05 <NA>
2020-06-04 08:10:00 19.5 <NA> 0.05 <NA>
2020-06-04 08:15:00 19.5 <NA> 0.05 <NA>
2020-06-04 08:20:00 19.5 <NA> 0.05 <NA>
2020-06-04 08:25:00 19.5 <NA> 0.05 <NA>
2020-06-04 08:30:00 19.5 <NA> 0.05 <NA>
2020-06-04 08:35:00 19.5 <NA> 0.05 <NA>
2020-06-04 08:40:00 19.5 <NA> 0.05 <NA>
2020-06-04 08:45:00 19.629999 <NA> 0.05 <NA>
2020-06-04 08:50:00 19.75 <NA> 0.05 <NA>
2020-06-04 08:55:00 19.629999 <NA> 0.05 <NA>
2020-06-04 09:00:00 19.75 <NA> 0.05 <NA>
2020-06-04 09:05:00 19.75 <NA> 0.05 <NA>
2020-06-04 09:10:00 19.75 <NA> 0.05 <NA>
2020-06-04 09:15:00 19.629999 <NA> 0.05 <NA>
2020-06-04 09:20:00 19.629999 <NA> 0.05 <NA>
2020-06-04 09:25:00 19.75 <NA> 0.05 <NA>
2020-06-04 09:30:00 19.75 <NA> 0.05 <NA>
2020-06-04 09:35:00 19.5 <NA> 0.05 <NA>
2020-06-04 09:40:00 19.5 <NA> 0.05 <NA>
2020-06-04 09:45:00 19.629999 <NA> 0.05 <NA>
2020-06-04 09:50:00 19.629999 <NA> 0.05 <NA>
2020-06-04 09:55:00 19.629999 <NA> 0.05 <NA>
2020-06-04 10:00:00 19.75 <NA> 0.05 <NA>
[34]:
cnsa.loc[pd.to_datetime('6/3/20 1200'):pd.to_datetime('6/4/20 1200'), 'TOT'].sum()
[34]:
10.950000163167715

We’re only catching parts! Every time the tank shifts 0.12, we stop flagging. We need to figure out how to make that continuous. 0.12 does seem to be the SD of the reading with a flat tank. But, as already stated above, that leads to all sorts of lone time stamps. Which, may highlight problems with simple_pre.m, but are not the case we are trying to run down. So I need to find a way to broaden the criteria for a flat tank to allow at least 0.12 mm of bounce, but only capture values that are consequtive.

Around 1/2” during a rain free period is worth adressing.

Requiring consecutive values

Hopefully with a requirement for consequitive values I can increase the precision for assessing when a tank is flat to capture the sensor bounce.


Shift 1

I’ll shift before and after and see if it also is True. I’ll start by making sure it works with the flat precision before trying to expand.

[159]:
after_repeat = qc.qa_events.duplicate.shift(1)
before_repeat = qc.qa_events.duplicate.shift(-1)

duplicates = qc.qa_events.duplicate & (after_repeat | before_repeat )

strt, end = pd.to_datetime('4/9/19 2015'), pd.to_datetime('4/9/19 2115')
qc.df_orig.loc[strt:end, ['INST_SA','INST_Flag_SA','TOT_SA','TOT_Flag_SA']]
[159]:
INST_SA INST_Flag_SA TOT_SA TOT_Flag_SA
Date
2019-04-09 20:15:00 128.899994 M 0.2 M
2019-04-09 20:20:00 128.899994 M 0.2 M
2019-04-09 20:25:00 128.899994 M 0.2 M
2019-04-09 20:30:00 128.899994 M 0.2 M
2019-04-09 20:35:00 128.899994 M 0.2 M
2019-04-09 20:40:00 128.899994 M 0.2 M
2019-04-09 20:45:00 128.899994 M 0.2 M
2019-04-09 20:50:00 128.899994 M 0.0 M
2019-04-09 20:55:00 128.899994 M 0.0 M
2019-04-09 21:00:00 128.899994 M 0.0 M
2019-04-09 21:05:00 128.899994 M 0.0 M
2019-04-09 21:10:00 128.899994 M 0.0 M
2019-04-09 21:15:00 128.899994 M 0.0 M
[160]:
duplicates[strt:end]
[160]:
Date
2019-04-09 20:15:00     True
2019-04-09 20:20:00     True
2019-04-09 20:25:00     True
2019-04-09 20:30:00     True
2019-04-09 20:35:00     True
2019-04-09 20:40:00     True
2019-04-09 20:45:00     True
2019-04-09 20:50:00    False
2019-04-09 20:55:00    False
2019-04-09 21:00:00    False
2019-04-09 21:05:00    False
2019-04-09 21:10:00    False
2019-04-09 21:15:00    False
Name: duplicate, dtype: bool[pyarrow]

Let’s make sure it catches the start of repeats cleanly too, then I’ll see how it does globally.

[161]:
strt, end = pd.to_datetime('4/6/19 0900'), pd.to_datetime('4/6/19 1030')
qc.df_orig.loc[strt:end, ['INST_SA','INST_Flag_SA','TOT_SA','TOT_Flag_SA']]
[161]:
INST_SA INST_Flag_SA TOT_SA TOT_Flag_SA
Date
2019-04-06 09:00:00 126.199997 <NA> 0.6 <NA>
2019-04-06 09:05:00 126.300003 <NA> 0.1 <NA>
2019-04-06 09:10:00 126.400002 <NA> 0.1 <NA>
2019-04-06 09:15:00 126.5 <NA> 0.1 <NA>
2019-04-06 09:20:00 126.699997 <NA> 0.2 <NA>
2019-04-06 09:25:00 126.900002 <NA> 0.2 <NA>
2019-04-06 09:30:00 127.300003 <NA> 0.4 <NA>
2019-04-06 09:35:00 127.800003 <NA> 0.5 <NA>
2019-04-06 09:40:00 128.300003 <NA> 0.5 <NA>
2019-04-06 09:45:00 128.600006 <NA> 0.3 <NA>
2019-04-06 09:50:00 128.699997 <NA> 0.1 <NA>
2019-04-06 09:55:00 128.899994 <NA> 0.2 <NA>
2019-04-06 10:00:00 128.899994 M 0.2 M
2019-04-06 10:05:00 128.899994 M 0.2 M
2019-04-06 10:10:00 128.899994 M 0.2 M
2019-04-06 10:15:00 128.899994 M 0.2 M
2019-04-06 10:20:00 128.899994 M 0.2 M
2019-04-06 10:25:00 128.899994 M 0.2 M
2019-04-06 10:30:00 128.899994 M 0.2 M
[162]:
duplicates[strt:end]
[162]:
Date
2019-04-06 09:00:00    False
2019-04-06 09:05:00    False
2019-04-06 09:10:00    False
2019-04-06 09:15:00    False
2019-04-06 09:20:00    False
2019-04-06 09:25:00    False
2019-04-06 09:30:00    False
2019-04-06 09:35:00    False
2019-04-06 09:40:00    False
2019-04-06 09:45:00    False
2019-04-06 09:50:00    False
2019-04-06 09:55:00    False
2019-04-06 10:00:00     True
2019-04-06 10:05:00     True
2019-04-06 10:10:00     True
2019-04-06 10:15:00     True
2019-04-06 10:20:00     True
2019-04-06 10:25:00     True
2019-04-06 10:30:00     True
Name: duplicate, dtype: bool[pyarrow]

Let’s look at what’s selected as a whole.

[124]:
qc.df_orig[duplicates]
[124]:
INST_SA INST_Flag_SA TOT_SA TOT_Flag_SA ACC_SA ACC_Flag_SA INST_SH INST_Flag_SH TOT_SH TOT_Flag_SH ACC_SH ACC_Flag_SH
Date
2017-10-01 08:15:00 245.100006 <NA> 0.1 <NA> 5.8 <NA> 265.5 <NA> 0.0 <NA> 6.1 <NA>
2017-10-01 08:20:00 245.199997 <NA> 0.1 <NA> 5.9 <NA> 265.799988 <NA> 0.3 <NA> 6.4 <NA>
2017-10-13 03:05:00 324.600006 T 0.1 <NA> 85.300003 <NA> 343.700012 <NA> 0.0 <NA> 84.300003 <NA>
2017-10-13 03:10:00 324.700012 T 0.1 <NA> 85.400002 <NA> 343.700012 <NA> 0.0 <NA> 84.300003 <NA>
2017-10-13 03:15:00 324.799988 T 0.1 <NA> 85.5 <NA> 343.799988 <NA> 0.1 <NA> 84.400002 <NA>
2017-10-13 06:45:00 328.700012 T 0.1 <NA> 89.400002 <NA> 347.200012 <NA> 0.3 <NA> 87.800003 <NA>
2017-10-13 06:50:00 328.799988 T 0.1 <NA> 89.5 <NA> 347.200012 <NA> 0.0 <NA> 87.800003 <NA>
2017-10-19 14:45:00 346.5 MM 0.3 M 107.5 M 352.299988 M 0.2 M 103.699997 M
2017-10-19 14:50:00 346.5 MM 0.3 M 107.800003 M 352.299988 M 0.2 M 103.900002 M
2017-10-19 14:55:00 346.5 MM 0.3 M 108.099998 M 352.299988 M 0.2 M 104.099998 M
2017-10-19 15:00:00 346.5 MM 0.3 M 108.400002 M 352.299988 M 0.2 M 104.300003 M
2017-10-19 15:05:00 346.5 MM 0.3 M 108.699997 M 352.299988 M 0.2 M 104.5 M
2017-10-19 15:10:00 346.5 MM 0.3 M 109.0 M 352.299988 M 0.2 M 104.699997 M
2017-10-19 15:15:00 346.5 MM 0.3 M 109.300003 M 352.299988 M 0.2 M 104.900002 M
2017-10-19 15:20:00 346.5 MM 0.3 M 109.599998 M 352.299988 M 0.2 M 105.099998 M
2017-10-19 15:25:00 346.5 MM 0.3 M 109.900002 M 352.299988 M 0.2 M 105.300003 M
2017-10-19 15:30:00 346.5 MM 0.3 M 110.199997 M 352.299988 M 0.2 M 105.5 M
... ... ... ... ... ... ... ... ... ... ... ... ...
2022-05-14 05:05:00 309.700012 <NA> 0.1 <NA> 1902.439941 <NA> 321.299988 <NA> 0.3 <NA> 1773.589966 <NA>
2022-05-14 05:10:00 309.799988 <NA> 0.1 <NA> 1902.540039 <NA> 321.600006 <NA> 0.3 <NA> 1773.890015 <NA>
2022-05-30 09:05:00 86.099998 <NA> 0.1 <NA> 2000.290039 <NA> 113.699997 <NA> 0.5 <NA> 1944.400024 <NA>
2022-05-30 09:10:00 86.199997 <NA> 0.1 <NA> 2000.390015 <NA> 113.599998 <NA> 0.0 <NA> 1944.400024 <NA>
2022-05-30 13:40:00 88.099998 <NA> 0.1 <NA> 2002.290039 <NA> 115.5 <NA> 0.3 <NA> 1946.199951 <NA>
2022-05-30 13:45:00 88.199997 <NA> 0.1 <NA> 2002.390015 <NA> 115.5 <NA> 0.0 <NA> 1946.199951 <NA>
2022-06-04 22:55:00 114.400002 <NA> 0.1 <NA> 2028.589966 <NA> 141.199997 <NA> 0.3 <NA> 1971.900024 <NA>
2022-06-04 23:00:00 114.5 <NA> 0.1 <NA> 2028.689941 <NA> 141.100006 <NA> 0.0 <NA> 1971.900024 <NA>
2022-06-05 09:30:00 128.5 <NA> 0.1 <NA> 2042.689941 <NA> 154.199997 <NA> 0.0 <NA> 1984.900024 <NA>
2022-06-05 09:35:00 128.600006 <NA> 0.1 <NA> 2042.790039 <NA> 154.300003 <NA> 0.1 <NA> 1985.0 <NA>
2022-06-12 20:10:00 250.899994 <NA> 0.1 <NA> 2165.090088 <NA> 270.700012 <NA> 0.0 <NA> 2101.399902 <NA>
2022-06-12 20:15:00 251.0 <NA> 0.1 <NA> 2165.189941 <NA> 270.299988 <NA> 0.0 <NA> 2101.399902 <NA>
2022-06-13 20:35:00 263.200012 <NA> 0.1 <NA> 2177.389893 <NA> 281.600006 <NA> 0.0 <NA> 2112.5 <NA>
2022-06-13 20:40:00 263.299988 <NA> 0.1 <NA> 2177.48999 <NA> 281.899994 <NA> 0.1 <NA> 2112.600098 <NA>
2022-06-13 20:45:00 263.399994 <NA> 0.1 <NA> 2177.590088 <NA> 281.5 <NA> 0.0 <NA> 2112.600098 <NA>
2022-06-14 09:05:00 266.399994 <NA> 0.1 <NA> 2180.590088 <NA> 283.799988 <NA> 0.0 <NA> 2114.899902 <NA>
2022-06-14 09:10:00 266.5 <NA> 0.1 <NA> 2180.689941 <NA> 283.899994 <NA> 0.0 <NA> 2114.899902 <NA>

1798 rows × 12 columns

Just 2 in a row seems… a little small. I don’t love all the small events this is catching. It becomes less about catching an artificial repeat and more about questioning simple_pre.m. Which is not a task I’ve taken on.


3 in a row (shift 2)

The requirement can be both 1 and 2 spaces behind (or in front) of the value also have to be true. That would mean a minimum of 3. A couple speed testst to determine the fastest way to calculate this, and then see if we still capture the right start and stop times.

[107]:
%timeit qc.qa_events['duplicate'].shift([1,2]).all()
2.27 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
[108]:
%%timeit
after_repeat = qc.qa_events['duplicate'].shift(1)
after_repeat2 = after_repeat & qc.qa_events.duplicate.shift(2)
443 µs ± 23.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
[109]:
%%timeit
after_repeat = qc.qa_events['duplicate'].shift(1)
after_repeat &= qc.qa_events['duplicate'].shift(2)
581 µs ± 37.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
[164]:
after_repeat = qc.qa_events.duplicate.shift(1)
after_repeat2 = after_repeat & qc.qa_events.duplicate.shift(2)
before_repeat = qc.qa_events.duplicate.shift(-1)
before_repeat2 = before_repeat & qc.qa_events.duplicate.shift(-2)

duplicates = qc.qa_events.duplicate & (after_repeat2 | before_repeat2 )

strt, end = pd.to_datetime('4/9/19 2015'), pd.to_datetime('4/9/19 2115')
print(duplicates[strt:end])
strt, end = pd.to_datetime('4/6/19 0900'), pd.to_datetime('4/6/19 1030')
duplicates[strt:end]
Date
2019-04-09 20:15:00     True
2019-04-09 20:20:00     True
2019-04-09 20:25:00     True
2019-04-09 20:30:00     True
2019-04-09 20:35:00     True
2019-04-09 20:40:00     True
2019-04-09 20:45:00     True
2019-04-09 20:50:00    False
2019-04-09 20:55:00    False
2019-04-09 21:00:00    False
2019-04-09 21:05:00    False
2019-04-09 21:10:00    False
2019-04-09 21:15:00    False
Name: duplicate, dtype: bool[pyarrow]
[164]:
Date
2019-04-06 09:00:00    False
2019-04-06 09:05:00    False
2019-04-06 09:10:00    False
2019-04-06 09:15:00    False
2019-04-06 09:20:00    False
2019-04-06 09:25:00    False
2019-04-06 09:30:00    False
2019-04-06 09:35:00    False
2019-04-06 09:40:00    False
2019-04-06 09:45:00    False
2019-04-06 09:50:00    False
2019-04-06 09:55:00    False
2019-04-06 10:00:00     True
2019-04-06 10:05:00     True
2019-04-06 10:10:00     True
2019-04-06 10:15:00     True
2019-04-06 10:20:00     True
2019-04-06 10:25:00     True
2019-04-06 10:30:00     True
Name: duplicate, dtype: bool[pyarrow]

That confirms that it still works to capture the beginning and end of the big event that started this analysis. Now to see if the whole process works.

  1. widen the precision for defining a flat tank

  2. look to capture all of the 6/13/22 event

  3. check that other events captured are consecutive

[42]:
0.12 / qc.precision
[42]:
0.4294917680744453
[43]:
0.43*qc.precision
[43]:
0.12014199999999998
[170]:
pd.set_option('display.min_rows', 30)
# reset
qc.qa_events.duplicate = False
# reflag

qc.flag_repeating_val_precip(ppt_col='TOT_SA', tank_col='INST_SA', flat_ppt_nprecision=0, flat_tank_nprecision=0.43,
                            min_ppt_nprecision=0)

after_repeat = qc.qa_events.duplicate.shift(1)
after_repeat2 = after_repeat & qc.qa_events.duplicate.shift(2)
before_repeat = qc.qa_events.duplicate.shift(-1)
before_repeat2 = before_repeat & qc.qa_events.duplicate.shift(-2)

duplicates = qc.qa_events.duplicate & (after_repeat2 | before_repeat2 )

qc.df_orig[duplicates]
[170]:
INST_SA INST_Flag_SA TOT_SA TOT_Flag_SA ACC_SA ACC_Flag_SA INST_SH INST_Flag_SH TOT_SH TOT_Flag_SH ACC_SH ACC_Flag_SH
Date
2017-10-13 03:05:00 324.600006 T 0.1 <NA> 85.300003 <NA> 343.700012 <NA> 0.0 <NA> 84.300003 <NA>
2017-10-13 03:15:00 324.799988 T 0.1 <NA> 85.5 <NA> 343.799988 <NA> 0.1 <NA> 84.400002 <NA>
2017-10-19 14:45:00 346.5 MM 0.3 M 107.5 M 352.299988 M 0.2 M 103.699997 M
2017-10-19 14:50:00 346.5 MM 0.3 M 107.800003 M 352.299988 M 0.2 M 103.900002 M
2017-10-19 14:55:00 346.5 MM 0.3 M 108.099998 M 352.299988 M 0.2 M 104.099998 M
2017-10-19 15:00:00 346.5 MM 0.3 M 108.400002 M 352.299988 M 0.2 M 104.300003 M
2017-10-19 15:05:00 346.5 MM 0.3 M 108.699997 M 352.299988 M 0.2 M 104.5 M
2017-10-19 15:10:00 346.5 MM 0.3 M 109.0 M 352.299988 M 0.2 M 104.699997 M
2017-10-19 15:15:00 346.5 MM 0.3 M 109.300003 M 352.299988 M 0.2 M 104.900002 M
2017-10-19 15:20:00 346.5 MM 0.3 M 109.599998 M 352.299988 M 0.2 M 105.099998 M
... ... ... ... ... ... ... ... ... ... ... ... ...
2022-03-13 02:50:00 39.169998 MM 0.26 M 1271.699951 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:55:00 39.169998 MM 0.26 M 1271.959961 M 69.540001 M 0.0 M 1182.130005 M
2022-04-19 12:15:00 354.100006 <NA> 0.1 <NA> 1586.890015 <NA> 362.600006 <NA> 0.2 <NA> 1475.040039 <NA>
2022-04-19 12:25:00 354.299988 <NA> 0.1 <NA> 1587.089966 <NA> 362.700012 <NA> 0.1 <NA> 1475.140015 <NA>
2022-04-20 15:55:00 370.299988 <NA> 0.1 <NA> 1603.089966 <NA> 377.799988 <NA> 0.0 <NA> 1490.439941 <NA>
2022-04-20 16:05:00 370.5 <NA> 0.1 <NA> 1603.290039 <NA> 378.100006 <NA> 0.0 <NA> 1490.73999 <NA>
2022-05-06 19:20:00 182.399994 <NA> 0.1 <NA> 1775.140015 <NA> 201.199997 <NA> 0.0 <NA> 1653.48999 <NA>
2022-05-06 19:30:00 182.600006 <NA> 0.1 <NA> 1775.339966 <NA> 201.399994 <NA> 0.2 <NA> 1653.689941 <NA>
2022-06-13 20:35:00 263.200012 <NA> 0.1 <NA> 2177.389893 <NA> 281.600006 <NA> 0.0 <NA> 2112.5 <NA>
2022-06-13 20:45:00 263.399994 <NA> 0.1 <NA> 2177.590088 <NA> 281.5 <NA> 0.0 <NA> 2112.600098 <NA>

1359 rows × 12 columns

We only add 50 new flags, which is good, but there is some weirdness here…It is ocassionally grabbing every other timestep…

[96]:
pd.set_option('display.max_rows', 20)

strt, end = pd.to_datetime('6/13/22 2000'), pd.to_datetime('6/13/22 2100')

dups = pd.DataFrame({'qaDuplicates':qc.qa_events['duplicate'],'after1':after_repeat,
                     'after2':after_repeat2, 'before1':before_repeat,
                     'before2':before_repeat2, 'combined':duplicates,} )

dups[strt:end]
[96]:
qaDuplicates after1 after2 before1 before2 combined
Date
2022-06-13 20:00:00 False False False False False False
2022-06-13 20:05:00 False False False False False False
2022-06-13 20:10:00 False False False False False False
2022-06-13 20:15:00 False False False False False False
2022-06-13 20:20:00 False False False False False False
2022-06-13 20:25:00 False False False False False False
2022-06-13 20:30:00 False False False True True False
2022-06-13 20:35:00 True False False True True True
2022-06-13 20:40:00 True True False True False False
2022-06-13 20:45:00 True True True False False True
2022-06-13 20:50:00 False True True False False False
2022-06-13 20:55:00 False False False True False False
2022-06-13 21:00:00 True False False False False False

Ok, so basically, if the rule is it has to be preceded or followed by 2 conesqutive True’s, that is the case for the first and the last, but not the middle when there is only three. Let’s change some & cases to | cases and see if that get’s resolved.

[167]:
pd.set_option('display.max_rows', 20)

strt, end = pd.to_datetime('6/13/22 2000'), pd.to_datetime('6/13/22 2100')

after_repeat = qc.qa_events.duplicate.shift(1)
after_repeat2 = after_repeat | qc.qa_events.duplicate.shift(2)
before_repeat = qc.qa_events.duplicate.shift(-1)
before_repeat2 = before_repeat | qc.qa_events.duplicate.shift(-2)

duplicates = qc.qa_events.duplicate & (after_repeat2 | before_repeat2 )


dups = pd.DataFrame({'qaDuplicates':qc.qa_events['duplicate'],'after1':after_repeat,
                     'after2':after_repeat2, 'before1':before_repeat,
                     'before2':before_repeat2, 'combined':duplicates,} )

dups[strt:end]
[167]:
qaDuplicates after1 after2 before1 before2 combined
Date
2022-06-13 20:00:00 False False False False False False
2022-06-13 20:05:00 False False False False False False
2022-06-13 20:10:00 False False False False False False
2022-06-13 20:15:00 False False False False False False
2022-06-13 20:20:00 False False False False False False
2022-06-13 20:25:00 False False False False True False
2022-06-13 20:30:00 False False False True True False
2022-06-13 20:35:00 True False False True True True
2022-06-13 20:40:00 True True True True True True
2022-06-13 20:45:00 True True True False False True
2022-06-13 20:50:00 False True True False True False
2022-06-13 20:55:00 False False True True True False
2022-06-13 21:00:00 True False False False False False

That’s working better.

[169]:
pd.set_option('display.min_rows', 15)

qc.df_orig[duplicates]
[169]:
INST_SA INST_Flag_SA TOT_SA TOT_Flag_SA ACC_SA ACC_Flag_SA INST_SH INST_Flag_SH TOT_SH TOT_Flag_SH ACC_SH ACC_Flag_SH
Date
2017-10-01 08:15:00 245.100006 <NA> 0.1 <NA> 5.8 <NA> 265.5 <NA> 0.0 <NA> 6.1 <NA>
2017-10-01 08:20:00 245.199997 <NA> 0.1 <NA> 5.9 <NA> 265.799988 <NA> 0.3 <NA> 6.4 <NA>
2017-10-13 03:05:00 324.600006 T 0.1 <NA> 85.300003 <NA> 343.700012 <NA> 0.0 <NA> 84.300003 <NA>
2017-10-13 03:10:00 324.700012 T 0.1 <NA> 85.400002 <NA> 343.700012 <NA> 0.0 <NA> 84.300003 <NA>
2017-10-13 03:15:00 324.799988 T 0.1 <NA> 85.5 <NA> 343.799988 <NA> 0.1 <NA> 84.400002 <NA>
2017-10-13 06:45:00 328.700012 T 0.1 <NA> 89.400002 <NA> 347.200012 <NA> 0.3 <NA> 87.800003 <NA>
2017-10-13 06:50:00 328.799988 T 0.1 <NA> 89.5 <NA> 347.200012 <NA> 0.0 <NA> 87.800003 <NA>
... ... ... ... ... ... ... ... ... ... ... ... ...
2022-06-12 20:10:00 250.899994 <NA> 0.1 <NA> 2165.090088 <NA> 270.700012 <NA> 0.0 <NA> 2101.399902 <NA>
2022-06-12 20:15:00 251.0 <NA> 0.1 <NA> 2165.189941 <NA> 270.299988 <NA> 0.0 <NA> 2101.399902 <NA>
2022-06-13 20:35:00 263.200012 <NA> 0.1 <NA> 2177.389893 <NA> 281.600006 <NA> 0.0 <NA> 2112.5 <NA>
2022-06-13 20:40:00 263.299988 <NA> 0.1 <NA> 2177.48999 <NA> 281.899994 <NA> 0.1 <NA> 2112.600098 <NA>
2022-06-13 20:45:00 263.399994 <NA> 0.1 <NA> 2177.590088 <NA> 281.5 <NA> 0.0 <NA> 2112.600098 <NA>
2022-06-14 09:05:00 266.399994 <NA> 0.1 <NA> 2180.590088 <NA> 283.799988 <NA> 0.0 <NA> 2114.899902 <NA>
2022-06-14 09:10:00 266.5 <NA> 0.1 <NA> 2180.689941 <NA> 283.899994 <NA> 0.0 <NA> 2114.899902 <NA>

1807 rows × 12 columns

Changing the condition to or let’s in sequences of just 2 in a row. That definitely isn’t what we set out to look for here. And some of them are just small little blips. Not something we’ve decided to filter, and nothing that is completely unreasonable at the measurement precision being used.

Going back to the original strategy using & but make it 3 before or 3 after fixes this case. Of course, it will continue to have the problem seen above: if we require it to be preceded or followed by 3, then a string of 4 will only flag the first and last.

And if it is changed to be proceeded & followed the beginning and end are always trimmed off. Plus the added problem of, in the case looked at above on 6/13/22, only selecting the middle timestep and discarding all the others.

Time for a new approach.


duplcates.diff() as DataFrame (xor)

When preforming difference between a row and it’s preceeding row, in Pandas and NumPy boolean values become ^ which is a logical XOR. That means that only one value can be True. E.g.

  • True ^ True = False

  • False ^ False = False

  • True ^ False = True

So that would mean that anytime the value is False, it hasn’t changed. That might work well.

Let’s try some things. For starters, take a look at our duplicate filter falgs:

[193]:
pd.set_option('display.max_rows', 25)
strt, end = pd.to_datetime('4/6/19 0845'), pd.to_datetime('4/6/19 1030')

qc.qa_events.duplicate[strt:end]
[193]:
Date
2019-04-06 08:45:00    False
2019-04-06 08:50:00    False
2019-04-06 08:55:00    False
2019-04-06 09:00:00    False
2019-04-06 09:05:00    False
2019-04-06 09:10:00     True
2019-04-06 09:15:00     True
2019-04-06 09:20:00    False
2019-04-06 09:25:00    False
2019-04-06 09:30:00    False
2019-04-06 09:35:00    False
2019-04-06 09:40:00    False
2019-04-06 09:45:00    False
2019-04-06 09:50:00    False
2019-04-06 09:55:00    False
2019-04-06 10:00:00     True
2019-04-06 10:05:00     True
2019-04-06 10:10:00     True
2019-04-06 10:15:00     True
2019-04-06 10:20:00     True
2019-04-06 10:25:00     True
2019-04-06 10:30:00     True
Name: duplicate, dtype: bool[pyarrow]
[194]:
qc.qa_events.duplicate[strt:end].diff()
[194]:
Date
2019-04-06 08:45:00     <NA>
2019-04-06 08:50:00    False
2019-04-06 08:55:00    False
2019-04-06 09:00:00    False
2019-04-06 09:05:00    False
2019-04-06 09:10:00     True
2019-04-06 09:15:00    False
2019-04-06 09:20:00     True
2019-04-06 09:25:00    False
2019-04-06 09:30:00    False
2019-04-06 09:35:00    False
2019-04-06 09:40:00    False
2019-04-06 09:45:00    False
2019-04-06 09:50:00    False
2019-04-06 09:55:00    False
2019-04-06 10:00:00     True
2019-04-06 10:05:00    False
2019-04-06 10:10:00    False
2019-04-06 10:15:00    False
2019-04-06 10:20:00    False
2019-04-06 10:25:00    False
2019-04-06 10:30:00    False
Name: duplicate, dtype: bool[pyarrow]

So a False first, second, and third differential will mean I’m requiring 4 in a row as a minimum…? For example, 0910 and 0915 will both become True. And then I can just use a ~ to transform everything to True based instead of True duplicate and False consecutive. But this is looking at following values only, so I’ll need to also look at preceeding values.

[209]:
next_the_same = qc.qa_events.duplicate[strt:end].diff()
two_the_same = next_the_same.diff()
three_the_same = two_the_same.diff()
three_the_same.diff()
[209]:
Date
2019-04-06 08:45:00     <NA>
2019-04-06 08:50:00     <NA>
2019-04-06 08:55:00     <NA>
2019-04-06 09:00:00     <NA>
2019-04-06 09:05:00    False
2019-04-06 09:10:00     True
2019-04-06 09:15:00     True
2019-04-06 09:20:00    False
2019-04-06 09:25:00    False
2019-04-06 09:30:00     True
2019-04-06 09:35:00     True
2019-04-06 09:40:00    False
2019-04-06 09:45:00    False
2019-04-06 09:50:00    False
2019-04-06 09:55:00    False
2019-04-06 10:00:00     True
2019-04-06 10:05:00     True
2019-04-06 10:10:00     True
2019-04-06 10:15:00     True
2019-04-06 10:20:00    False
2019-04-06 10:25:00    False
2019-04-06 10:30:00    False
Name: duplicate, dtype: bool[pyarrow]
[199]:
qc.qa_events.duplicate[strt:end].diff(periods=-1)
[199]:
Date
2019-04-06 08:45:00    False
2019-04-06 08:50:00    False
2019-04-06 08:55:00    False
2019-04-06 09:00:00    False
2019-04-06 09:05:00     True
2019-04-06 09:10:00    False
2019-04-06 09:15:00     True
2019-04-06 09:20:00    False
2019-04-06 09:25:00    False
2019-04-06 09:30:00    False
2019-04-06 09:35:00    False
2019-04-06 09:40:00    False
2019-04-06 09:45:00    False
2019-04-06 09:50:00    False
2019-04-06 09:55:00     True
2019-04-06 10:00:00    False
2019-04-06 10:05:00    False
2019-04-06 10:10:00    False
2019-04-06 10:15:00    False
2019-04-06 10:20:00    False
2019-04-06 10:25:00    False
2019-04-06 10:30:00     <NA>
Name: duplicate, dtype: bool[pyarrow]
[215]:
req_consecutive = 3
cnt = 0
next_the_same = last_the_same = qc.qa_events.duplicate[strt:end]
while cnt < req_consecutive:
    next_the_same = next_the_same.diff()
    last_the_same = last_the_same.diff(periods=-1)
    cnt += 1

print(~next_the_same)
print(~last_the_same)
~next_the_same|~last_the_same
Date
2019-04-06 08:45:00     <NA>
2019-04-06 08:50:00     <NA>
2019-04-06 08:55:00     <NA>
2019-04-06 09:00:00     True
2019-04-06 09:05:00     True
2019-04-06 09:10:00    False
2019-04-06 09:15:00     True
2019-04-06 09:20:00     True
2019-04-06 09:25:00     True
2019-04-06 09:30:00    False
2019-04-06 09:35:00     True
2019-04-06 09:40:00     True
2019-04-06 09:45:00     True
2019-04-06 09:50:00     True
2019-04-06 09:55:00     True
2019-04-06 10:00:00    False
2019-04-06 10:05:00     True
2019-04-06 10:10:00    False
2019-04-06 10:15:00     True
2019-04-06 10:20:00     True
2019-04-06 10:25:00     True
2019-04-06 10:30:00     True
Name: duplicate, dtype: bool[pyarrow]
Date
2019-04-06 08:45:00     True
2019-04-06 08:50:00     True
2019-04-06 08:55:00    False
2019-04-06 09:00:00     True
2019-04-06 09:05:00     True
2019-04-06 09:10:00     True
2019-04-06 09:15:00    False
2019-04-06 09:20:00     True
2019-04-06 09:25:00     True
2019-04-06 09:30:00     True
2019-04-06 09:35:00     True
2019-04-06 09:40:00     True
2019-04-06 09:45:00    False
2019-04-06 09:50:00     True
2019-04-06 09:55:00    False
2019-04-06 10:00:00     True
2019-04-06 10:05:00     True
2019-04-06 10:10:00     True
2019-04-06 10:15:00     True
2019-04-06 10:20:00     <NA>
2019-04-06 10:25:00     <NA>
2019-04-06 10:30:00     <NA>
Name: duplicate, dtype: bool[pyarrow]
[215]:
Date
2019-04-06 08:45:00    True
2019-04-06 08:50:00    True
2019-04-06 08:55:00    <NA>
2019-04-06 09:00:00    True
2019-04-06 09:05:00    True
2019-04-06 09:10:00    True
2019-04-06 09:15:00    True
2019-04-06 09:20:00    True
2019-04-06 09:25:00    True
2019-04-06 09:30:00    True
2019-04-06 09:35:00    True
2019-04-06 09:40:00    True
2019-04-06 09:45:00    True
2019-04-06 09:50:00    True
2019-04-06 09:55:00    True
2019-04-06 10:00:00    True
2019-04-06 10:05:00    True
2019-04-06 10:10:00    True
2019-04-06 10:15:00    True
2019-04-06 10:20:00    True
2019-04-06 10:25:00    True
2019-04-06 10:30:00    True
Name: duplicate, dtype: bool[pyarrow]

That works with even numbers of differentials, but not with odd numbers. It fails on our little blip at 0910 and 0915. Plus, XOR is fairly confusing. Even if I’m the one to revisit this, I’ll have trouble keeping it straight, so clarity and future maintenance could easily be a problematic.


duplicates.diff() as int (1/-1)

If the booleans are converted to int first then True = 1 and False = 0.

True - True = 0
False - False = 0
False - True = -1
True - False = 1

So, let’s look at how the differential’s progress in both directions.

[228]:
duplicates = qc.qa_events.duplicate[strt:end].astype(int)

dups = pd.DataFrame({'qaDuplicates':qc.qa_events.loc[strt:end, 'duplicate'],'duplicates_AsInt':duplicates,
                     'diff_1':duplicates.diff(), 'diff_2':duplicates.diff().diff(),
                     'diff_3':duplicates.diff().diff().diff(),
                     'diff_4':duplicates.diff().diff().diff().diff(),
                     '-diff_1':duplicates.diff(periods=-1), '-diff_2':duplicates.diff(periods=-1).diff(periods=-1),
                     '-diff_3':duplicates.diff(periods=-1).diff(periods=-1).diff(periods=-1),
                     '-diff_4':duplicates.diff(periods=-1).diff(periods=-1).diff(periods=-1).diff(periods=-1),} )
dups
[228]:
qaDuplicates duplicates_AsInt diff_1 diff_2 diff_3 diff_4 -diff_1 -diff_2 -diff_3 -diff_4
Date
2019-04-06 08:45:00 False 0 NaN NaN NaN NaN 0.0 0.0 0.0 0.0
2019-04-06 08:50:00 False 0 0.0 NaN NaN NaN 0.0 0.0 0.0 1.0
2019-04-06 08:55:00 False 0 0.0 0.0 NaN NaN 0.0 0.0 -1.0 -3.0
2019-04-06 09:00:00 False 0 0.0 0.0 0.0 NaN 0.0 1.0 2.0 2.0
2019-04-06 09:05:00 False 0 0.0 0.0 0.0 0.0 -1.0 -1.0 0.0 2.0
2019-04-06 09:10:00 True 1 1.0 1.0 1.0 1.0 0.0 -1.0 -2.0 -3.0
2019-04-06 09:15:00 True 1 0.0 -1.0 -2.0 -3.0 1.0 1.0 1.0 1.0
2019-04-06 09:20:00 False 0 -1.0 -1.0 0.0 2.0 0.0 0.0 0.0 0.0
2019-04-06 09:25:00 False 0 0.0 1.0 2.0 2.0 0.0 0.0 0.0 0.0
2019-04-06 09:30:00 False 0 0.0 0.0 -1.0 -3.0 0.0 0.0 0.0 0.0
2019-04-06 09:35:00 False 0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
2019-04-06 09:40:00 False 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
2019-04-06 09:45:00 False 0 0.0 0.0 0.0 0.0 0.0 0.0 -1.0 -3.0
2019-04-06 09:50:00 False 0 0.0 0.0 0.0 0.0 0.0 1.0 2.0 3.0
2019-04-06 09:55:00 False 0 0.0 0.0 0.0 0.0 -1.0 -1.0 -1.0 -1.0
2019-04-06 10:00:00 True 1 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0
2019-04-06 10:05:00 True 1 0.0 -1.0 -2.0 -3.0 0.0 0.0 0.0 0.0
2019-04-06 10:10:00 True 1 0.0 0.0 1.0 3.0 0.0 0.0 0.0 0.0
2019-04-06 10:15:00 True 1 0.0 0.0 0.0 -1.0 0.0 0.0 0.0 NaN
2019-04-06 10:20:00 True 1 0.0 0.0 0.0 0.0 0.0 0.0 NaN NaN
2019-04-06 10:25:00 True 1 0.0 0.0 0.0 0.0 0.0 NaN NaN NaN
2019-04-06 10:30:00 True 1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN

So, if the criteria is the Ndifferential == 0 with a logical OR between preceeding and following, then this is looking pretty good. Let’s grab one of our examples from above to make sure it filters more than just two points.

[379]:
strt, end = pd.to_datetime('6/13/22 2000'), pd.to_datetime('6/13/22 2125')

duplicates = qc.qa_events.duplicate[strt:end].astype(int)

dups = pd.DataFrame({'qaDuplicates':qc.qa_events.loc[strt:end, 'duplicate'],'duplicates_AsInt':duplicates,
                     'diff_1':duplicates.diff(), 'diff_2':duplicates.diff().diff(),
                     'diff_3':duplicates.diff().diff().diff(),
                     'diff_4':duplicates.diff().diff().diff().diff(),
                     '-diff_1':duplicates.diff(periods=-1), '-diff_2':duplicates.diff(periods=-1).diff(periods=-1),
                     '-diff_3':duplicates.diff(periods=-1).diff(periods=-1).diff(periods=-1),
                     '-diff_4':duplicates.diff(periods=-1).diff(periods=-1).diff(periods=-1).diff(periods=-1),} )
dups
[379]:
qaDuplicates duplicates_AsInt diff_1 diff_2 diff_3 diff_4 -diff_1 -diff_2 -diff_3 -diff_4
Date
2022-06-13 20:00:00 False 0 NaN NaN NaN NaN 0.0 0.0 0.0 0.0
2022-06-13 20:05:00 False 0 0.0 NaN NaN NaN 0.0 0.0 0.0 0.0
2022-06-13 20:10:00 False 0 0.0 0.0 NaN NaN 0.0 0.0 0.0 0.0
2022-06-13 20:15:00 False 0 0.0 0.0 0.0 NaN 0.0 0.0 0.0 1.0
2022-06-13 20:20:00 False 0 0.0 0.0 0.0 0.0 0.0 0.0 -1.0 -3.0
2022-06-13 20:25:00 False 0 0.0 0.0 0.0 0.0 0.0 1.0 2.0 3.0
2022-06-13 20:30:00 False 0 0.0 0.0 0.0 0.0 -1.0 -1.0 -1.0 -2.0
2022-06-13 20:35:00 True 1 1.0 1.0 1.0 1.0 0.0 0.0 1.0 3.0
2022-06-13 20:40:00 True 1 0.0 -1.0 -2.0 -3.0 0.0 -1.0 -2.0 -2.0
2022-06-13 20:45:00 True 1 0.0 0.0 1.0 3.0 1.0 1.0 0.0 -3.0
2022-06-13 20:50:00 False 0 -1.0 -1.0 -1.0 -2.0 0.0 1.0 3.0 6.0
2022-06-13 20:55:00 False 0 0.0 1.0 2.0 3.0 -1.0 -2.0 -3.0 -4.0
2022-06-13 21:00:00 True 1 1.0 1.0 0.0 -2.0 1.0 1.0 1.0 1.0
2022-06-13 21:05:00 False 0 -1.0 -2.0 -3.0 -3.0 0.0 0.0 0.0 0.0
2022-06-13 21:10:00 False 0 0.0 1.0 3.0 6.0 0.0 0.0 0.0 NaN
2022-06-13 21:15:00 False 0 0.0 0.0 -1.0 -4.0 0.0 0.0 NaN NaN
2022-06-13 21:20:00 False 0 0.0 0.0 0.0 1.0 0.0 NaN NaN NaN
2022-06-13 21:25:00 False 0 0.0 0.0 0.0 0.0 NaN NaN NaN NaN

I think that’s working as intended, but has the same problem as the methods that used .shift().

  • first differential would allow the 3 in a row.

  • second differential would allow the first and last value, but not the middle.

  • third differential blocks all. Weird exception here because there is a random detached True value 3 time steps away.


Answer from web

I found an answer on stack overflow. I found that it does seem to work, but not with pyarrow booleans. I change the type, then we’re all good. I did hope to avoid any sort of complicated running averages, but this seems vairly fast.

I walk through the method below.

[380]:
test =  qc.qa_events.duplicate[strt:end].astype('boolean')


pd.DataFrame(data={'Duplicates':test, 'Duplicates.cumsum()': test.cumsum(),
                  'Duplicates.diff()': test.diff(), 'duplicates.diff().cumsum()':test.diff().cumsum()})
[380]:
Duplicates Duplicates.cumsum() Duplicates.diff() duplicates.diff().cumsum()
Date
2022-06-13 20:00:00 False 0 <NA> <NA>
2022-06-13 20:05:00 False 0 False 0
2022-06-13 20:10:00 False 0 False 0
2022-06-13 20:15:00 False 0 False 0
2022-06-13 20:20:00 False 0 False 0
2022-06-13 20:25:00 False 0 False 0
2022-06-13 20:30:00 False 0 False 0
2022-06-13 20:35:00 True 1 True 1
2022-06-13 20:40:00 True 2 False 1
2022-06-13 20:45:00 True 3 False 1
2022-06-13 20:50:00 False 3 True 2
2022-06-13 20:55:00 False 3 False 2
2022-06-13 21:00:00 True 4 True 3
2022-06-13 21:05:00 False 4 True 4
2022-06-13 21:10:00 False 4 False 4
2022-06-13 21:15:00 False 4 False 4
2022-06-13 21:20:00 False 4 False 4
2022-06-13 21:25:00 False 4 False 4

In the pandas/Numpy boolean type, cumsum can operate on the column, but increases whenever there is a True value and is constant when the value is False. By taking the diff, which is True whenver consecutive rows are different, cumsum simply numbers any group of values, incrementing to the next group number when the bolean values changed.

[381]:
grp = test.diff().cumsum()
test.groupby(grp).transform('size')
[381]:
Date
2022-06-13 20:00:00    <NA>
2022-06-13 20:05:00       6
2022-06-13 20:10:00       6
2022-06-13 20:15:00       6
2022-06-13 20:20:00       6
2022-06-13 20:25:00       6
2022-06-13 20:30:00       6
2022-06-13 20:35:00       3
2022-06-13 20:40:00       3
2022-06-13 20:45:00       3
2022-06-13 20:50:00       2
2022-06-13 20:55:00       2
2022-06-13 21:00:00       1
2022-06-13 21:05:00       5
2022-06-13 21:10:00       5
2022-06-13 21:15:00       5
2022-06-13 21:20:00       5
2022-06-13 21:25:00       5
Name: duplicate, dtype: Int64

When grouped by the cumsum values, the size of each group can then be measured and returned. This way we now have a length for each group. The example then goes further to make all series of False values negative and True values positive. But in tests, it is faster to just compare to our duplicate column using and &.

A quick test with a different test case.

[399]:
strt, end = pd.to_datetime('1/20/18 0415'), pd.to_datetime('1/20/18 0555')

test =  qc.qa_events.duplicate[strt:end].astype('boolean')
grp = test.diff().cumsum()
pd.DataFrame(data={'grp_count':test.groupby(grp).transform('size'),
                   'duplicates':test,
                   'INST':qc.df_orig.loc[strt:end,'INST_SA'],
                   'TOT':qc.df_orig.loc[strt:end,'TOT_SA']})
[399]:
grp_count duplicates INST TOT
Date
2018-01-20 04:15:00 <NA> False 209.600006 0.1
2018-01-20 04:20:00 8 False 209.800003 0.2
2018-01-20 04:25:00 8 False 209.800003 0.0
2018-01-20 04:30:00 8 False 209.800003 0.0
2018-01-20 04:35:00 8 False 209.800003 0.0
2018-01-20 04:40:00 8 False 209.800003 0.0
2018-01-20 04:45:00 8 False 209.800003 0.0
2018-01-20 04:50:00 8 False 209.800003 0.0
2018-01-20 04:55:00 8 False 210.0 0.2
2018-01-20 05:00:00 5 True 209.800003 0.2
2018-01-20 05:05:00 5 True 209.800003 0.2
2018-01-20 05:10:00 5 True 209.800003 0.2
2018-01-20 05:15:00 5 True 209.800003 0.2
2018-01-20 05:20:00 5 True 209.800003 0.2
2018-01-20 05:25:00 7 False 210.100006 0.0
2018-01-20 05:30:00 7 False 210.100006 0.0
2018-01-20 05:35:00 7 False 210.300003 0.0
2018-01-20 05:40:00 7 False 210.300003 0.0
2018-01-20 05:45:00 7 False 210.300003 0.0
2018-01-20 05:50:00 7 False 210.300003 0.0
2018-01-20 05:55:00 7 False 210.300003 0.0

Weird example, but hopefully we can catch that random 210 tanke value by broadening the flat tank precision.

[285]:
test = qc.qa_events.duplicate
[287]:
%%timeit
g = test.astype('boolean').diff().cumsum()

q = test.groupby(g).transform('size') * np.where(test, 1, -1)
q >= 3
77.6 ms ± 3.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
[289]:
%%timeit
g = test.astype('boolean').diff().cumsum()

q = test.groupby(g).transform('size')
test & (q >= 3)
64.8 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

TOTALLY FLAT TANK (NO BOUNCE EVER!)

[293]:
pd.set_option('display.min_rows', 4)
# reset
qc.qa_events.duplicate = False
# reflag

qc.flag_repeating_val_precip(ppt_col='TOT_SA', tank_col='INST_SA', flat_ppt_nprecision=0, flat_tank_nprecision=0,
                            min_ppt_nprecision=0)

qc.df_orig[duplicates]
[293]:
INST_SA INST_Flag_SA TOT_SA TOT_Flag_SA ACC_SA ACC_Flag_SA INST_SH INST_Flag_SH TOT_SH TOT_Flag_SH ACC_SH ACC_Flag_SH
Date
2017-10-19 14:45:00 346.5 MM 0.3 M 107.5 M 352.299988 M 0.2 M 103.699997 M
2017-10-19 14:50:00 346.5 MM 0.3 M 107.800003 M 352.299988 M 0.2 M 103.900002 M
... ... ... ... ... ... ... ... ... ... ... ... ...
2022-03-13 02:50:00 39.169998 MM 0.26 M 1271.699951 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:55:00 39.169998 MM 0.26 M 1271.959961 M 69.540001 M 0.0 M 1182.130005 M

1321 rows × 12 columns

Allow Normal Oscillation

[402]:
pd.set_option('display.max_rows', 40)
# reset
qc.qa_events.duplicate = False
qc.qa_flags.Set0 = False
# reflag

qc.flag_repeating_val_precip(ppt_col='TOT_SA', tank_col='INST_SA', flat_ppt_nprecision=0, flat_tank_nprecision=1,
                            min_ppt_nprecision=0)

duplicate_val = qc.qa_events['duplicate'].astype('boolean')
number_continuous_grps = duplicate_val.diff().cumsum()
how_many_in_group = duplicate_val.groupby(number_continuous_grps).transform('size')


duplicates = duplicate_val & (how_many_in_group >= 4)

qc.df_orig[duplicates]
[402]:
INST_SA INST_Flag_SA TOT_SA TOT_Flag_SA ACC_SA ACC_Flag_SA INST_SH INST_Flag_SH TOT_SH TOT_Flag_SH ACC_SH ACC_Flag_SH
Date
2017-10-19 14:45:00 346.5 MM 0.3 M 107.5 M 352.299988 M 0.2 M 103.699997 M
2017-10-19 14:50:00 346.5 MM 0.3 M 107.800003 M 352.299988 M 0.2 M 103.900002 M
2017-10-19 14:55:00 346.5 MM 0.3 M 108.099998 M 352.299988 M 0.2 M 104.099998 M
2017-10-19 15:00:00 346.5 MM 0.3 M 108.400002 M 352.299988 M 0.2 M 104.300003 M
2017-10-19 15:05:00 346.5 MM 0.3 M 108.699997 M 352.299988 M 0.2 M 104.5 M
2017-10-19 15:10:00 346.5 MM 0.3 M 109.0 M 352.299988 M 0.2 M 104.699997 M
2017-10-19 15:15:00 346.5 MM 0.3 M 109.300003 M 352.299988 M 0.2 M 104.900002 M
2017-10-19 15:20:00 346.5 MM 0.3 M 109.599998 M 352.299988 M 0.2 M 105.099998 M
2017-10-19 15:25:00 346.5 MM 0.3 M 109.900002 M 352.299988 M 0.2 M 105.300003 M
2017-10-19 15:30:00 346.5 MM 0.3 M 110.199997 M 352.299988 M 0.2 M 105.5 M
2018-01-20 05:00:00 209.800003 <NA> 0.2 <NA> 969.440002 <NA> 210.199997 <NA> 0.0 <NA> 905.469971 <NA>
2018-01-20 05:05:00 209.800003 <NA> 0.2 <NA> 969.640015 <NA> 210.399994 <NA> 0.0 <NA> 905.469971 <NA>
2018-01-20 05:10:00 209.800003 <NA> 0.2 <NA> 969.840027 <NA> 210.199997 <NA> 0.0 <NA> 905.469971 <NA>
2018-01-20 05:15:00 209.800003 <NA> 0.2 <NA> 970.039978 <NA> 210.399994 <NA> 0.0 <NA> 905.469971 <NA>
2018-01-20 05:20:00 209.800003 <NA> 0.2 <NA> 970.23999 <NA> 210.600006 <NA> 0.2 <NA> 905.669983 <NA>
2018-12-17 20:15:00 112.699997 <NA> 0.2 <NA> 458.609985 <NA> 122.400002 <NA> 0.0 <NA> 439.320007 <NA>
2018-12-17 20:20:00 112.900002 <NA> 0.2 <NA> 458.809998 <NA> 122.699997 <NA> 0.0 <NA> 439.320007 <NA>
2018-12-17 20:25:00 113.099998 <NA> 0.2 <NA> 459.01001 <NA> 123.0 <NA> 0.0 <NA> 439.320007 <NA>
2018-12-17 20:30:00 113.300003 <NA> 0.2 <NA> 459.209991 <NA> 123.5 <NA> 0.0 <NA> 439.320007 <NA>
2019-04-06 10:00:00 128.899994 M 0.2 M 1598.5 M 174.300003 <NA> 0.0 <NA> 1369.140015 <NA>
... ... ... ... ... ... ... ... ... ... ... ... ...
2020-06-04 17:10:00 19.610001 <NA> 0.05 <NA> 1491.689941 <NA> 38.939999 <NA> 0.0 <NA> 1380.25 <NA>
2020-06-04 17:15:00 19.610001 <NA> 0.05 <NA> 1491.73999 <NA> 38.779999 <NA> 0.0 <NA> 1380.25 <NA>
2020-06-04 17:20:00 19.74 <NA> 0.05 <NA> 1491.790039 <NA> 38.779999 <NA> 0.0 <NA> 1380.25 <NA>
2020-06-04 17:25:00 19.74 <NA> 0.05 <NA> 1491.839966 <NA> 38.779999 <NA> 0.0 <NA> 1380.25 <NA>
2021-10-26 22:55:00 126.0 <NA> 0.1 <NA> 120.5 <NA> 165.199997 <NA> 0.0 <NA> 129.830002 <NA>
2021-10-26 23:00:00 126.099998 <NA> 0.1 <NA> 120.599998 <NA> 165.399994 <NA> 0.2 <NA> 130.029999 <NA>
2021-10-26 23:05:00 126.199997 <NA> 0.1 <NA> 120.699997 <NA> 165.600006 <NA> 0.2 <NA> 130.229996 <NA>
2021-10-26 23:10:00 126.300003 <NA> 0.1 <NA> 120.800003 <NA> 165.5 <NA> 0.0 <NA> 130.229996 <NA>
2022-03-13 02:00:00 39.169998 MM 0.26 M 1269.099976 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:05:00 39.169998 MM 0.26 M 1269.359985 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:10:00 39.169998 MM 0.26 M 1269.619995 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:15:00 39.169998 MM 0.26 M 1269.880005 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:20:00 39.169998 MM 0.26 M 1270.140015 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:25:00 39.169998 MM 0.26 M 1270.400024 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:30:00 39.169998 MM 0.26 M 1270.660034 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:35:00 39.169998 MM 0.26 M 1270.920044 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:40:00 39.169998 MM 0.26 M 1271.180054 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:45:00 39.169998 MM 0.26 M 1271.439941 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:50:00 39.169998 MM 0.26 M 1271.699951 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:55:00 39.169998 MM 0.26 M 1271.959961 M 69.540001 M 0.0 M 1182.130005 M

1370 rows × 12 columns

[363]:
times = qc.df_orig[duplicates].index.astype('datetime64[ns]')
np.unique(times.date)
[363]:
array([datetime.date(2017, 10, 19), datetime.date(2018, 1, 20),
       datetime.date(2018, 12, 17), datetime.date(2019, 4, 6),
       datetime.date(2019, 4, 7), datetime.date(2019, 4, 8),
       datetime.date(2019, 4, 9), datetime.date(2019, 11, 15),
       datetime.date(2020, 2, 5), datetime.date(2020, 6, 3),
       datetime.date(2020, 6, 4), datetime.date(2021, 10, 26),
       datetime.date(2022, 3, 13)], dtype=object)

Now, let’s look at the problem cases

[309]:
plt.rcParams['figure.max_open_warning']=100
[329]:
duplicates.name = 'duplicate'
duplicates = pd.DataFrame(duplicates)

qa_events = qc.qa_events
qa_events['duplicates'] = duplicates

flag.apply_FSDB_flags(qa_events, qc.qa_flags)


ax = flag.plot_flagged_day(pd.to_datetime('6/3/20 1200'), 'CEN_01', tdelta='2D', auto_qa_event=duplicates, paired_tank=pair_tank)
plt.grid(True)
[330]:
ax = flag.plot_flagged_day(pd.to_datetime('6/4/20 0800'), 'CEN_01', tdelta='4h', auto_qa_event=duplicates, paired_tank=pair_tank)
plt.grid(True)
[331]:
ax = flag.plot_flagged_day(pd.to_datetime('12/17/18 1900'), 'CEN_01', tdelta='3h', auto_qa_event=duplicates, paired_tank=pair_tank)
plt.grid(True)

The red Set0 flags are set by the QaRules.flag_repeating_val_precip without adding a requirement for consecutive values. The green X’s are set after adding the consecutive value requirement. We can see a lot of improvement as a result. But there are still a couple values flagged here that probably shouldn’t be. Since this is a small case, it might be a good candidtate for a manual unflag. Or a good reason to adjust the threshold to 6 in a row.

[332]:
ax = flag.plot_flagged_day(pd.to_datetime('10/26/21 2130'), 'CEN_01', tdelta='3h', auto_qa_event=duplicates, paired_tank=pair_tank)
plt.grid(True)

This situation seems similar to 12/17/18. A lot of improvement. May be worth using 6 consecutive values, a full 30 min, as the requirement.

[335]:
ax = flag.plot_flagged_day(pd.to_datetime('1/20/18 0400'), 'CEN_01', tdelta='3h', auto_qa_event=duplicates, paired_tank=pair_tank)
plt.grid(True)
[341]:
strt = pd.to_datetime('1/20/18 0400')
end =  pd.to_datetime('1/20/18 0400') + pd.to_timedelta('3h')

print(f"total provisional precip 1/20/18 0400- 0700: {qc.df_orig.loc[strt:end, 'TOT_SA'].sum()}")
print(f"total QC'd precip 1/20/18 0400- 0700:        {qc.df_orig.loc[strt:end, 'TOT_SA'][duplicates['duplicate']].sum()}")
print(f"tank change over period:                     {qc.df_orig.loc[end,'INST_SA']-qc.df_orig.loc[strt, 'INST_SA']}")
total provisional precip 1/20/18 0400- 0700: 2.2000000402331352
total QC'd precip 1/20/18 0400- 0700:        1.0000000149011612
tank change over period:                     1.399993896484375

I’m not sure why that is coming in a point late, but the flag does seem justified here. There is 1.4 mm of tank change over 3 hours, but we see 2.2 mm of precip accumulating. This doesn’t grab all of that, but it does reduce the total for the period to a more reasonable number.

Based on these cases, 6 consecutive values will be used.

[369]:
ax = flag.plot_flagged_day(pd.to_datetime('11/15/19 0900'), 'CEN_01', tdelta='9h',
                           auto_qa_event=duplicates, paired_tank=pair_tank)
plt.grid(True)
[373]:
ax = flag.plot_flagged_day(pd.to_datetime('2/5/20 1200'), 'CEN_01', tdelta='4h',
                           auto_qa_event=duplicates, paired_tank=pair_tank)
plt.grid(True)

11/15/19 looks like the filter is working, but 2/5/20 looks like another reason to switch the consecutive requirement to 6.

[405]:
pd.set_option('display.min_row', 10)
duplicates = duplicate_val & (how_many_in_group >= 5)

times = qc.df_orig[duplicates].index.astype('datetime64[ns]')
np.unique(times.date)
[405]:
array([datetime.date(2017, 10, 19), datetime.date(2018, 1, 20),
       datetime.date(2019, 4, 6), datetime.date(2019, 4, 7),
       datetime.date(2019, 4, 8), datetime.date(2019, 4, 9),
       datetime.date(2019, 11, 15), datetime.date(2020, 6, 3),
       datetime.date(2020, 6, 4), datetime.date(2022, 3, 13)],
      dtype=object)
[408]:
qc.df_orig.loc[duplicates, 'TOT_SA'].sum()
[408]:
226.55000304803252

That removed all of the dates I wanted to remove, but kept all of the ones I wanted to keep. And it is a really solid amount of precip. Further, it should continue to work even if GCE starts sending real tank values for the missing periods.

Comparing tank change to precip

This section has some earlier attempts to see if some simple and obvious methods work. All of them were hard to make work right.

However, there are some potential benefits to looking at it this way:

  1. It could keep working even if GCE stops sending large chuncks of missing tank values

  2. There’s the potential to catch places where simplepre.m is preforming poorly

  3. This may be useful in filtering diurnal fluctuations found at other sites (UPLO and VARA)

Rolling sum of tank

1 Day

[49]:
pd.options.display.min_rows = 50

dly_tank_chg = cnsa['INST'].diff().rolling('1D').sum()
dly_precip = cnsa['TOT'].rolling('1D').sum()

not_drain = dly_tank_chg >= 0
raining = cnsa['TOT']>0

mismatch =  (dly_tank_chg < (dly_precip-(3*sa_precision))) & not_drain & raining
cnsa[mismatch]
[49]:
INST INST_Flag TOT TOT_Flag ACC ACC_Flag
Date
2017-10-19 14:55:00 346.5 MM 0.3 M 108.099998 M
2017-10-19 15:00:00 346.5 MM 0.3 M 108.400002 M
2017-10-19 15:05:00 346.5 MM 0.3 M 108.699997 M
2017-10-19 15:10:00 346.5 MM 0.3 M 109.0 M
2017-10-19 15:15:00 346.5 MM 0.3 M 109.300003 M
2017-10-19 15:20:00 346.5 MM 0.3 M 109.599998 M
2017-10-19 15:25:00 346.5 MM 0.3 M 109.900002 M
2017-10-19 15:30:00 346.5 MM 0.3 M 110.199997 M
2017-11-20 20:35:00 243.0 <NA> 0.2 <NA> 565.340027 <NA>
2018-01-20 05:15:00 209.800003 <NA> 0.2 <NA> 970.039978 <NA>
2018-01-20 05:20:00 209.800003 <NA> 0.2 <NA> 970.23999 <NA>
2018-01-20 06:00:00 210.300003 <NA> 0.3 F 970.539978 F
2018-01-20 06:35:00 210.5 <NA> 0.2 <NA> 970.73999 <NA>
2018-01-20 06:45:00 210.699997 <NA> 0.2 <NA> 970.940002 <NA>
2018-01-20 07:10:00 210.800003 <NA> 0.1 <NA> 971.039978 <NA>
2018-01-20 09:55:00 211.0 <NA> 0.2 <NA> 971.23999 <NA>
2018-01-20 10:00:00 211.100006 <NA> 0.1 <NA> 971.340027 <NA>
2018-01-20 12:05:00 211.199997 <NA> 0.1 <NA> 971.440002 <NA>
2018-01-20 12:10:00 211.5 <NA> 0.3 <NA> 971.73999 <NA>
2018-01-20 12:50:00 211.600006 <NA> 0.1 <NA> 971.840027 <NA>
2018-01-20 13:25:00 211.699997 <NA> 0.1 <NA> 971.940002 <NA>
2018-01-20 13:40:00 211.899994 <NA> 0.2 <NA> 972.140015 <NA>
2018-01-20 16:35:00 212.100006 <NA> 0.1 <NA> 972.340027 <NA>
2018-01-20 16:40:00 212.199997 <NA> 0.1 <NA> 972.440002 <NA>
2018-01-20 18:15:00 212.399994 <NA> 0.2 <NA> 972.640015 <NA>
... ... ... ... ... ... ...
2022-03-13 09:25:00 51.459999 <NA> 0.13 <NA> 1284.25 <NA>
2022-03-13 09:30:00 51.580002 <NA> 0.12 <NA> 1284.369995 <NA>
2022-03-13 09:45:00 52.09 <NA> 0.51 <NA> 1284.880005 <NA>
2022-03-13 09:50:00 52.599998 <NA> 0.51 <NA> 1285.390015 <NA>
2022-03-13 10:40:00 53.860001 <NA> 1.26 F 1286.650024 F
2022-03-13 11:10:00 53.990002 <NA> 0.13 <NA> 1286.780029 <NA>
2022-03-13 11:15:00 54.110001 <NA> 0.12 <NA> 1286.900024 <NA>
2022-03-13 11:25:00 54.240002 <NA> 0.13 <NA> 1287.030029 <NA>
2022-03-13 11:30:00 54.369999 <NA> 0.13 <NA> 1287.160034 <NA>
2022-03-13 11:35:00 54.490002 <NA> 0.12 <NA> 1287.280029 <NA>
2022-03-13 11:40:00 55.130001 <NA> 0.64 <NA> 1287.920044 <NA>
2022-03-13 12:30:00 55.25 <NA> 0.12 <NA> 1288.040039 <NA>
2022-03-13 12:45:00 55.509998 <NA> 0.26 <NA> 1288.300049 <NA>
2022-03-13 13:00:00 55.630001 <NA> 0.12 <NA> 1288.420044 <NA>
2022-03-13 13:05:00 56.139999 <NA> 0.51 <NA> 1288.930054 <NA>
2022-03-13 14:20:00 56.259998 <NA> 0.12 <NA> 1289.050049 <NA>
2022-03-13 14:30:00 56.389999 <NA> 0.13 <NA> 1289.180054 <NA>
2022-03-13 15:00:00 56.639999 <NA> 0.25 <NA> 1289.430054 <NA>
2022-03-13 15:30:00 56.900002 <NA> 0.26 <NA> 1289.689941 <NA>
2022-03-13 15:45:00 57.400002 <NA> 0.5 <NA> 1290.189941 <NA>
2022-03-13 16:00:00 57.529999 <NA> 0.13 <NA> 1290.319946 <NA>
2022-03-13 17:05:00 57.91 <NA> 0.38 <NA> 1290.699951 <NA>
2022-03-13 18:35:00 58.040001 <NA> 0.13 <NA> 1290.829956 <NA>
2022-03-13 19:55:00 58.16 <NA> 0.12 <NA> 1290.949951 <NA>
2022-03-13 22:50:00 58.290001 <NA> 0.13 <NA> 1291.079956 <NA>

1502 rows × 6 columns

[50]:
plt.figure()
dly_precip.plot(legend=True, grid=True)
dly_tank_chg.plot(legend=True, grid=True)
dly_precip[mismatch].plot(grid=True, linestyle='', marker='o', color='navy', legend=True)

plt.legend(['sum of daily precip', 'sum of daily tank.diff()', 'flag mismatch'])
[50]:
<matplotlib.legend.Legend at 0x2d94904b1c0>
[51]:
strt = pd.to_datetime('1/19/18 1200')
end = strt + pd.to_timedelta('2d')

plt.figure()
dly_precip[strt:end].plot(legend=True, grid=True)
dly_tank_chg[strt:end].plot(legend=True, grid=True)
dly_precip[strt:end][mismatch].plot(grid=True, linestyle='', marker='o', color='navy', legend=True)

ax1 = plt.gca()
cnsa.loc[strt:end, 'TOT'].plot(ax=ax1, grid=True, linestyle=':')
plt.legend(['sum of daily precip', 'sum of daily tank.diff()', 'flag mismatch', 'actual precip',])
ax2=plt.twinx(ax1)
cnsa.loc[strt:end, 'INST'].plot(ax=ax2, grid=True, linestyle=':')
plt.legend(['actual tank value'])


[51]:
<matplotlib.legend.Legend at 0x2d952d84610>
[52]:
strt = pd.to_datetime('3/13/22')
end = strt + pd.to_timedelta('2d')

plt.figure()
dly_precip[strt:end].plot(legend=True, grid=True)
dly_tank_chg[strt:end].plot(legend=True, grid=True)
dly_precip[strt:end][mismatch].plot(grid=True, linestyle='', marker='o', color='navy', legend=True)

ax1 = plt.gca()
cnsa.loc[strt:end, 'TOT'].plot(ax=ax1, grid=True, linestyle=':')
plt.legend(['sum of daily precip', 'sum of daily tank.diff()', 'flag mismatch', 'actual precip',])
ax2=plt.twinx(ax1)
cnsa.loc[strt:end, 'INST'].plot(ax=ax2, grid=True, linestyle=':')
plt.legend(['actual tank value'])


[52]:
<matplotlib.legend.Legend at 0x2d952ef3190>
[53]:
strt = pd.to_datetime('12/25/18')
end = strt + pd.to_timedelta('2D')

plt.figure()
dly_precip[strt:end].plot(legend=True, grid=True)
dly_tank_chg[strt:end].plot(legend=True, grid=True)
dly_precip[strt:end][mismatch].plot(grid=True, linestyle='', marker='o', color='navy', legend=True)

cnsa.loc[strt:end, ['INST', 'TOT']].plot(ax=plt.gca(), grid=True, linestyle=':')

plt.legend(['sum of daily precip', 'sum of daily tank.diff()', 'flag mismatch', 'actual tank value', 'actual precip'])
[53]:
<matplotlib.legend.Legend at 0x2d952e21490>
[54]:
plt.close(20)
[55]:
strt = pd.to_datetime('4/6/19')
end = strt + pd.to_timedelta('4D')

plt.figure()
dly_precip[strt:end].plot(legend=True, grid=True)
dly_tank_chg[strt:end].plot(legend=True, grid=True)
dly_precip[strt:end][mismatch].plot(grid=True, linestyle='', marker='o', color='navy', legend=True)

ax1 = plt.gca()
cnsa.loc[strt:end, 'TOT'].plot(ax=ax1, grid=True, linestyle=':')
plt.legend(['sum of daily precip', 'sum of daily tank.diff()', 'flag mismatch', 'actual precip',])
ax2=plt.twinx(ax1)
cnsa.loc[strt:end, 'INST'].plot(ax=ax2, grid=True, linestyle=':')
plt.legend(['actual tank value'])
[55]:
<matplotlib.legend.Legend at 0x2d9596bd5e0>

This isn’t catching repeat values we want. The sum pegs everything to a drain or an unclog with some huge value for 24 hours. The cumulative catch up in Dec 2019 is a great example. Then we have 2 examples where our QaRules.flag_duplicate_precip catches small repeats, less than an hour and totaling a mm or two. But with this method it doesn’t clog the first 10-30 min but goes on to flag the whole rest of the day. It is workign correctly for 4/9/2019 at least. Maybe a small window would work.

1Hr

[417]:
pd.options.display.min_rows = 50

hrly_tank_chg = cnsa['INST'].diff().rolling('1h').sum()
hrly_precip = cnsa['TOT'].rolling('1h').sum()

not_drain = hrly_tank_chg >= 0
raining = cnsa['TOT']>0

mismatch =  (hrly_tank_chg < (hrly_precip-(2*sa_precision))) & not_drain & raining
cnsa[mismatch]
[417]:
INST INST_Flag TOT TOT_Flag ACC ACC_Flag
Date
2017-10-19 14:50:00 346.5 MM 0.3 M 107.800003 M
2017-10-19 14:55:00 346.5 MM 0.3 M 108.099998 M
2017-10-19 15:00:00 346.5 MM 0.3 M 108.400002 M
2017-10-19 15:05:00 346.5 MM 0.3 M 108.699997 M
2017-10-19 15:10:00 346.5 MM 0.3 M 109.0 M
2017-10-19 15:15:00 346.5 MM 0.3 M 109.300003 M
2017-10-19 15:20:00 346.5 MM 0.3 M 109.599998 M
2017-10-19 15:25:00 346.5 MM 0.3 M 109.900002 M
2017-10-19 15:30:00 346.5 MM 0.3 M 110.199997 M
2017-11-19 21:35:00 182.300003 <NA> 0.4 <NA> 504.640015 <NA>
2017-11-28 13:40:00 355.799988 <NA> 0.3 <NA> 678.140015 <NA>
2017-11-28 13:45:00 356.0 <NA> 0.2 <NA> 678.340027 <NA>
2018-01-09 08:00:00 76.18 <NA> 0.38 <NA> 835.419983 <NA>
2018-01-20 05:10:00 209.800003 <NA> 0.2 <NA> 969.840027 <NA>
2018-01-20 05:15:00 209.800003 <NA> 0.2 <NA> 970.039978 <NA>
2018-01-20 05:20:00 209.800003 <NA> 0.2 <NA> 970.23999 <NA>
2018-01-20 06:00:00 210.300003 <NA> 0.3 F 970.539978 F
2018-01-29 21:20:00 82.099998 <NA> 0.4 <NA> 1120.030029 <NA>
2018-06-09 17:05:00 94.800003 <NA> 0.1 <NA> 1769.01001 <NA>
2018-06-09 17:10:00 94.900002 <NA> 0.1 <NA> 1769.109985 <NA>
... ... ... ... ... ... ...
2021-09-18 03:20:00 199.899994 <NA> 1.1 <NA> 1715.910034 <NA>
2021-10-05 18:05:00 344.0 <NA> 0.1 <NA> 3.1 <NA>
2021-10-05 18:10:00 344.100006 <NA> 0.1 <NA> 3.2 <NA>
2021-11-04 11:35:00 39.279999 <NA> 0.63 <NA> 191.460007 <NA>
2022-01-13 07:30:00 16.299999 <NA> 0.38 <NA> 1032.119995 <NA>
2022-01-30 19:35:00 54.330002 <NA> 0.63 <NA> 1070.150024 <NA>
2022-03-13 02:10:00 39.169998 MM 0.26 M 1269.619995 M
2022-03-13 02:15:00 39.169998 MM 0.26 M 1269.880005 M
2022-03-13 02:20:00 39.169998 MM 0.26 M 1270.140015 M
2022-03-13 02:25:00 39.169998 MM 0.26 M 1270.400024 M
2022-03-13 02:30:00 39.169998 MM 0.26 M 1270.660034 M
2022-03-13 02:35:00 39.169998 MM 0.26 M 1270.920044 M
2022-03-13 02:40:00 39.169998 MM 0.26 M 1271.180054 M
2022-03-13 02:45:00 39.169998 MM 0.26 M 1271.439941 M
2022-03-13 02:50:00 39.169998 MM 0.26 M 1271.699951 M
2022-03-13 02:55:00 39.169998 MM 0.26 M 1271.959961 M
2022-03-13 03:00:00 40.07 <NA> 0.9 <NA> 1272.859985 <NA>
2022-03-13 03:15:00 40.200001 <NA> 0.13 <NA> 1272.98999 <NA>
2022-03-13 11:35:00 54.490002 <NA> 0.12 <NA> 1287.280029 <NA>
2022-05-28 11:50:00 27.74 <NA> 0.26 <NA> 1941.930054 <NA>

1225 rows × 6 columns

[414]:
pd.options.display.min_rows = 20
flag.flags['Set0'] = mismatch
flag.event['duplicate'] = mismatch

flag.get_flagged_days()
[414]:
DatetimeIndex(['2017-10-19', '2017-11-19', '2017-11-28', '2018-01-09',
               '2018-01-20', '2018-01-29', '2018-06-09', '2018-11-04',
               '2018-11-21', '2018-12-16', '2018-12-25', '2019-04-01',
               '2019-04-06', '2019-04-07', '2019-04-08', '2019-04-09',
               '2019-06-07', '2019-06-28', '2019-09-08', '2019-11-15',
               '2019-11-30', '2020-03-13', '2020-06-03', '2020-06-04',
               '2020-10-10', '2020-10-23', '2020-12-06', '2021-01-08',
               '2021-01-12', '2021-02-15', '2021-03-05', '2021-03-09',
               '2021-03-28', '2021-04-30', '2021-09-10', '2021-09-18',
               '2021-10-05', '2021-11-04', '2022-01-13', '2022-01-30',
               '2022-03-13', '2022-05-28'],
              dtype='datetime64[ns]', freq=None)

This is very sensitive to the precision multiplier. If hrly_tank_chg < (hrly_precip-0*sa_precision) is used, it selects 17,000 records: all sorts of diconnected values in varying increments, event when the records are consecutive. This strays from our purpose of “repeating” values, which seems to be a specific artifact.

When increased to hrly_tank_chg < (hrly_precip-3*sa_precision), only a thousand records are filtered, although, they still have a lot of the same problems. A factor of 2 seems best, but it still is blantantly filtering a number of ‘F’ flagged values, which we know to be ‘catch-up’ precip. This seems better at catching anomolies than periods of repeating artificial values.

QaRules.flag_overaccum_precip

This function captures all of the isntances of duplcation that QaRules.flag_duplicate_precip does, but, instead of looking for flat tank and flat precip, it looks for places where the change in tank height doesn’t match the precip:

precip = df[ppt_col]
tank_change = df[tank_col].diff()
ppt_minus_tank = abs(precip - tank_change)

raining = precip > 0
overaccum_ppt = ppt_minus_tank > overaccum_threshold * self.precision

flag = raining & overaccum_ppt

Maybe this can be tuned to capture the artificial/repeating precip. And it should be more durable to changes in GCE processing.

Let’s sratrt by looking what it captures as it is currently parameterized.

[58]:
pd.options.display.min_rows = 20

# rerun
ai = qc.flag_overaccum_precip(ppt_col='TOT_SA', tank_col='INST_SA', overaccum_threshold=5)
qc.df_orig[ai]
[58]:
INST_SA INST_Flag_SA TOT_SA TOT_Flag_SA ACC_SA ACC_Flag_SA INST_SH INST_Flag_SH TOT_SH TOT_Flag_SH ACC_SH ACC_Flag_SH
Date
2018-11-21 17:00:00 217.199997 <NA> 5.1 F 205.630005 F 236.800003 <NA> 0.0 <NA> 192.649994 <NA>
2018-12-25 09:25:00 307.299988 <NA> 173.0 F 826.309998 F 311.399994 <NA> 0.0 <NA> 628.419983 <NA>
2019-06-28 00:20:00 141.300003 <NA> 21.200001 F 2247.336914 F 170.600006 <NA> 0.0 <NA> 1778.619995 <NA>
2019-09-08 13:20:00 35.189999 <NA> 4.48 F 2278.616943 F 59.5 <NA> 0.0 <NA> 1807.150024 <NA>
2022-09-06 12:55:00 8.97 <NA> 6.7 J 2228.149902 J 24.49 <NA> 24.49 E 2175.530029 E
2022-09-06 13:00:00 7.248 <NA> 4.978 F 2233.12793 F 24.66 <NA> 24.66 F 2200.189941 F
[59]:
pd.options.display.max_rows = 15
pd.options.display.min_rows = 15
# rerun
ai = qc.flag_overaccum_precip(ppt_col='TOT_SA', tank_col='INST_SA', overaccum_threshold=3)
qc.df_orig[ai]
[59]:
INST_SA INST_Flag_SA TOT_SA TOT_Flag_SA ACC_SA ACC_Flag_SA INST_SH INST_Flag_SH TOT_SH TOT_Flag_SH ACC_SH ACC_Flag_SH
Date
2017-11-19 20:40:00 179.699997 <NA> 1.2 F 502.040009 F 179.0 <NA> 0.2 <NA> 462.779999 <NA>
2018-06-16 18:05:00 111.599998 <NA> 1.2 F 1785.810059 F 130.0 <NA> 0.3 <NA> 1697.359985 <NA>
2018-11-21 17:00:00 217.199997 <NA> 5.1 F 205.630005 F 236.800003 <NA> 0.0 <NA> 192.649994 <NA>
2018-12-25 09:25:00 307.299988 <NA> 173.0 F 826.309998 F 311.399994 <NA> 0.0 <NA> 628.419983 <NA>
2019-03-06 07:05:00 355.700012 <NA> 1.2 F 1415.310059 F 356.899994 Q 0.3 <NA> 1195.329956 <NA>
2019-06-28 00:20:00 141.300003 <NA> 21.200001 F 2247.336914 F 170.600006 <NA> 0.0 <NA> 1778.619995 <NA>
2019-08-09 06:50:00 15.87 <NA> 1.06 F 2254.697021 F 40.459999 <NA> 0.0 <NA> 1785.52002 <NA>
... ... ... ... ... ... ... ... ... ... ... ... ...
2021-10-05 17:15:00 342.5 <NA> 1.0 F 1.6 F 373.899994 <NA> 0.0 <NA> 1.8 <NA>
2021-11-04 10:40:00 32.950001 <NA> 1.14 F 185.130005 F 60.869999 <NA> 0.17 <NA> 192.520004 <NA>
2021-12-23 04:35:00 39.189999 <NA> 2.28 F 693.349976 F 67.120003 <NA> 0.33 <NA> 657.570007 <NA>
2022-09-06 12:55:00 8.97 <NA> 6.7 J 2228.149902 J 24.49 <NA> 24.49 E 2175.530029 E
2022-09-06 13:00:00 7.248 <NA> 4.978 F 2233.12793 F 24.66 <NA> 24.66 F 2200.189941 F
2022-09-17 18:10:00 9.42 <NA> 1.36 F 2235.300049 F 26.26 <NA> 1.59 F 2201.790039 F
2022-09-29 00:45:00 24.15 <NA> 2.09 F 2250.030029 F 40.59 <NA> 0.05 <NA> 2216.120117 <NA>

22 rows × 12 columns

This is getting big “catch up” values, almost all of them flagged ‘F’. We need to get those 0.2 and 0.05 repeating values. Much like QaRules.flag_duplicate_precip, we need to make the threshold really small to catch those, we but are already getting all sorts real precip that we don’t want to capture with this filter.

[60]:
pd.options.display.min_rows = 8

# rerun
ai = qc.flag_overaccum_precip(ppt_col='TOT_SA', tank_col='INST_SA', overaccum_threshold=0)
qc.df_orig[ai]
[60]:
INST_SA INST_Flag_SA TOT_SA TOT_Flag_SA ACC_SA ACC_Flag_SA INST_SH INST_Flag_SH TOT_SH TOT_Flag_SH ACC_SH ACC_Flag_SH
Date
2017-10-01 00:05:00 239.699997 <NA> 0.2 Q 0.2 Q 259.799988 <NA> 0.2 Q 0.2 Q
2017-10-01 00:10:00 239.800003 <NA> 0.3 <NA> 0.5 <NA> 260.100006 <NA> 0.5 <NA> 0.7 <NA>
2017-10-01 00:15:00 239.899994 <NA> 0.1 <NA> 0.6 <NA> 260.100006 <NA> 0.0 <NA> 0.7 <NA>
2017-10-01 00:40:00 240.100006 <NA> 0.2 <NA> 0.8 <NA> 260.399994 <NA> 0.0 <NA> 1.0 <NA>
... ... ... ... ... ... ... ... ... ... ... ... ...
2022-09-30 01:05:00 29.6 <NA> 0.03 <NA> 2255.47998 <NA> 44.720001 <NA> 0.0 <NA> 2220.27002 <NA>
2022-09-30 02:30:00 29.610001 <NA> 0.01 <NA> 2255.48999 <NA> 44.650002 <NA> 0.0 <NA> 2220.300049 <NA>
2022-09-30 11:40:00 29.629999 <NA> 0.02 <NA> 2255.51001 <NA> 44.77 <NA> 0.0 <NA> 2220.300049 <NA>
2022-09-30 12:30:00 29.639999 <NA> 0.01 <NA> 2255.52002 <NA> 44.720001 <NA> 0.0 <NA> 2220.330078 <NA>

34838 rows × 12 columns

With 34,838 instances, including all the ‘F’-flagged values that catch the precip up to the tank accumulation during dry periods, this clearly isn’t working. This was designed to catch big differences, not little 0.2mm differences that repeat forever.

Only where precip > tank

Maybe if I smooth out tank values, or at least stop using the abs() of tank.diff(). With the absolute value, we are catching places where the precip increases more than the tank, but also places where the tank increases more than the precip, which is not the artificial precip problem we are after.

[61]:
pd.options.display.max_rows = 20
pd.options.display.min_rows = 20

ppt_col, tank_col, overaccum_threshold = 'TOT_SA', 'INST_SA', 0

precip = df[ppt_col]
tank_change = df[tank_col].diff()
# no abs! But!!! 0ppt - -5tankchg = 5, so those need to be tossed
tank_increase = tank_change > 0

ppt_minus_tank = precip - tank_change

raining = precip > 0
overaccum_ppt = ppt_minus_tank > overaccum_threshold * qc.precision

ai = raining & overaccum_ppt & tank_increase

qc.df_orig[ai]
[61]:
INST_SA INST_Flag_SA TOT_SA TOT_Flag_SA ACC_SA ACC_Flag_SA INST_SH INST_Flag_SH TOT_SH TOT_Flag_SH ACC_SH ACC_Flag_SH
Date
2017-10-01 00:05:00 239.699997 <NA> 0.2 Q 0.2 Q 259.799988 <NA> 0.2 Q 0.2 Q
2017-10-01 00:10:00 239.800003 <NA> 0.3 <NA> 0.5 <NA> 260.100006 <NA> 0.5 <NA> 0.7 <NA>
2017-10-01 00:15:00 239.899994 <NA> 0.1 <NA> 0.6 <NA> 260.100006 <NA> 0.0 <NA> 0.7 <NA>
2017-10-01 00:45:00 240.300003 <NA> 0.2 <NA> 1.0 <NA> 260.799988 <NA> 0.4 <NA> 1.4 <NA>
2017-10-01 01:55:00 240.399994 <NA> 0.1 <NA> 1.1 <NA> 260.799988 <NA> 0.0 <NA> 1.6 <NA>
2017-10-01 02:25:00 241.300003 <NA> 0.2 <NA> 2.0 <NA> 261.299988 <NA> 0.0 <NA> 2.2 <NA>
2017-10-01 02:50:00 241.399994 <NA> 0.1 <NA> 2.1 <NA> 261.700012 <NA> 0.0 <NA> 2.4 <NA>
2017-10-01 03:05:00 241.699997 <NA> 0.2 <NA> 2.4 <NA> 262.0 <NA> 0.0 <NA> 2.6 <NA>
2017-10-01 03:50:00 242.399994 <NA> 0.2 <NA> 3.1 <NA> 263.0 <NA> 0.5 <NA> 3.6 <NA>
2017-10-01 04:00:00 242.899994 <NA> 0.2 <NA> 3.6 <NA> 263.200012 <NA> 0.2 <NA> 3.8 <NA>
... ... ... ... ... ... ... ... ... ... ... ... ...
2022-09-29 03:10:00 26.879999 <NA> 0.01 <NA> 2252.76001 <NA> 41.110001 <NA> 0.0 <NA> 2216.76001 <NA>
2022-09-29 05:20:00 26.959999 <NA> 0.01 <NA> 2252.840088 <NA> 42.41 <NA> 0.01 <NA> 2217.939941 <NA>
2022-09-29 15:25:00 28.190001 <NA> 0.24 <NA> 2254.070068 <NA> 44.41 <NA> 0.0 <NA> 2220.01001 <NA>
2022-09-29 15:30:00 28.34 <NA> 0.15 <NA> 2254.219971 <NA> 44.32 <NA> 0.0 <NA> 2220.01001 <NA>
2022-09-29 15:35:00 28.49 <NA> 0.15 <NA> 2254.370117 <NA> 44.32 <NA> 0.0 <NA> 2220.01001 <NA>
2022-09-29 16:20:00 28.559999 <NA> 0.02 <NA> 2254.439941 <NA> 44.27 <NA> 0.0 <NA> 2220.01001 <NA>
2022-09-29 20:25:00 29.34 <NA> 0.14 <NA> 2255.219971 <NA> 44.650002 <NA> 0.06 <NA> 2220.179932 <NA>
2022-09-29 20:30:00 29.379999 <NA> 0.04 <NA> 2255.26001 <NA> 44.639999 <NA> 0.0 <NA> 2220.179932 <NA>
2022-09-29 20:50:00 29.51 <NA> 0.08 <NA> 2255.389893 <NA> 44.720001 <NA> 0.0 <NA> 2220.25 <NA>
2022-09-29 23:50:00 29.559999 <NA> 0.03 <NA> 2255.439941 <NA> 44.68 <NA> 0.0 <NA> 2220.25 <NA>

14603 rows × 12 columns

These timestamps don’t make a lot of sense to me. Let’s plot a day to confirm, but I think this filter isn’t working right.

[62]:
del flag
[63]:
flag = ApplyFlags(df.index)

# add our new FSDB_auto_flag to this instance of ApplyFlags
dbl_flag = pd.DataFrame(ai, columns=['Set0'], index=pd.to_datetime(df.index))
dbl_event = pd.DataFrame(ai, columns=['duplicate'], index=df.index)
dbl_event[['drain_event', 'neg_delta_tank', 'clog']] = False

flag.apply_FSDB_flags(dbl_event, dbl_flag)


# add the original data to the instance to plot
flag.import_provisional_data(qc.df_orig, tank_col='INST_SA', ppt_col='TOT_SA',
                             ppt_flag_col='TOT_Flag_SA')

pair_tank = qc.df_orig['INST_SH']

ax = flag.plot_flagged_day(pd.to_datetime('9/29/22'), 'CEN_01', tdelta='1D', auto_qa_event=dbl_event, paired_tank=pair_tank)
[64]:
ax = flag.plot_flagged_day(pd.to_datetime('9/20/22 0700'), 'CEN_01', tdelta='12h', auto_qa_event=dbl_event, paired_tank=pair_tank)

That looks totally random. While I agree that we have some precip when the tank is effectively flat, I don’t think there is any reason we can consistently classify the points above as artificial repeats.

As is, this method can only be tuned to check large precip amounts against tank changes.

Smooth tank first

Instead of looking for large differences, let’s reverse the process for finding drains:

  1. Smooth the tank,

  2. Look for increases

  3. Look at what precip is doing when the tank isn’t rising

[492]:
tank_col = 'INST_SA'

tank_window = df[tank_col].rolling('45min')

tank_smooth = tank_window.median()
#tank_upr_bounce = tank_window.quantile(0.90)
rising_tank = df[tank_col] > (tank_smooth + qc.precision*.5)

raining = df[ppt_col] > 0
phantom_precip = ~rising_tank & raining

df[phantom_precip]
[492]:
INST_SA INST_Flag_SA TOT_SA TOT_Flag_SA ACC_SA ACC_Flag_SA INST_SH INST_Flag_SH TOT_SH TOT_Flag_SH ACC_SH ACC_Flag_SH
Date
2017-10-01 00:05:00 239.699997 <NA> 0.2 Q 0.2 Q 259.799988 <NA> 0.2 Q 0.2 Q
2017-10-01 00:10:00 239.800003 <NA> 0.3 <NA> 0.5 <NA> 260.100006 <NA> 0.5 <NA> 0.7 <NA>
2017-10-01 01:55:00 240.399994 <NA> 0.1 <NA> 1.1 <NA> 260.799988 <NA> 0.0 <NA> 1.6 <NA>
2017-10-01 03:30:00 241.800003 <NA> 0.1 <NA> 2.5 <NA> 262.200012 <NA> 0.0 <NA> 3.1 <NA>
2017-10-01 05:05:00 243.100006 <NA> 0.1 <NA> 3.8 <NA> 263.700012 <NA> 0.0 <NA> 4.3 <NA>
... ... ... ... ... ... ... ... ... ... ... ... ...
2022-09-30 00:45:00 29.57 <NA> 0.01 <NA> 2255.449951 <NA> 44.700001 <NA> 0.0 <NA> 2220.27002 <NA>
2022-09-30 01:05:00 29.6 <NA> 0.03 <NA> 2255.47998 <NA> 44.720001 <NA> 0.0 <NA> 2220.27002 <NA>
2022-09-30 02:30:00 29.610001 <NA> 0.01 <NA> 2255.48999 <NA> 44.650002 <NA> 0.0 <NA> 2220.300049 <NA>
2022-09-30 11:40:00 29.629999 <NA> 0.02 <NA> 2255.51001 <NA> 44.77 <NA> 0.0 <NA> 2220.300049 <NA>
2022-09-30 12:30:00 29.639999 <NA> 0.01 <NA> 2255.52002 <NA> 44.720001 <NA> 0.0 <NA> 2220.330078 <NA>

4990 rows × 12 columns

[493]:
df.loc[phantom_precip & (how_many_in_group >= 5), 'TOT_SA'].describe()
[493]:
count      1353.0
mean     0.166563
std      0.062929
min          0.01
25%           0.2
50%           0.2
75%           0.2
max          0.37
Name: TOT_SA, dtype: double[pyarrow]

Yet again, this is just too wide a net. It catches all sorts of small precip values during periods where the tank isn’t doing much. I tried a bunch of different time steps and precision values, but, while this may be a strong inditement of the accuracy of simple_pre.m, it does not function as a filter for the artificial repeating values.

It’s worth seeing what this looks like if I require cosnecutive values, like I did with QaRules.falg_duplicate_precip

[598]:
rising_tank = df[tank_col] > (tank_smooth + qc.precision*1)

phantom_precip = ~rising_tank & raining

duplicate_val = phantom_precip.astype('boolean')
number_continuous_grps = duplicate_val.diff().cumsum()
how_many_in_group = duplicate_val.groupby(number_continuous_grps).transform('size')


df[phantom_precip & (how_many_in_group >= 5)]
[598]:
INST_SA INST_Flag_SA TOT_SA TOT_Flag_SA ACC_SA ACC_Flag_SA INST_SH INST_Flag_SH TOT_SH TOT_Flag_SH ACC_SH ACC_Flag_SH
Date
2017-10-19 15:00:00 346.5 MM 0.3 M 108.400002 M 352.299988 M 0.2 M 104.300003 M
2017-10-19 15:05:00 346.5 MM 0.3 M 108.699997 M 352.299988 M 0.2 M 104.5 M
2017-10-19 15:10:00 346.5 MM 0.3 M 109.0 M 352.299988 M 0.2 M 104.699997 M
2017-10-19 15:15:00 346.5 MM 0.3 M 109.300003 M 352.299988 M 0.2 M 104.900002 M
2017-10-19 15:20:00 346.5 MM 0.3 M 109.599998 M 352.299988 M 0.2 M 105.099998 M
2017-10-19 15:25:00 346.5 MM 0.3 M 109.900002 M 352.299988 M 0.2 M 105.300003 M
2017-10-19 15:30:00 346.5 MM 0.3 M 110.199997 M 352.299988 M 0.2 M 105.5 M
2017-10-19 15:40:00 18.049999 <NA> 0.37 <NA> 110.57 <NA> 30.1 <NA> 0.32 <NA> 105.82 <NA>
2017-10-19 15:45:00 18.309999 <NA> 0.26 <NA> 110.830002 <NA> 30.1 <NA> 0.0 <NA> 105.82 <NA>
2017-10-19 15:50:00 18.559999 <NA> 0.25 <NA> 111.080002 <NA> 30.1 <NA> 0.0 <NA> 105.82 <NA>
2017-10-19 15:55:00 18.809999 <NA> 0.25 <NA> 111.330002 <NA> 29.780001 <NA> 0.0 <NA> 105.82 <NA>
2017-10-19 16:00:00 19.07 <NA> 0.26 <NA> 111.589996 <NA> 30.43 <NA> 0.33 <NA> 106.150002 <NA>
2018-01-20 04:55:00 210.0 <NA> 0.2 <NA> 969.23999 <NA> 210.100006 <NA> 0.0 <NA> 905.469971 <NA>
2018-01-20 05:00:00 209.800003 <NA> 0.2 <NA> 969.440002 <NA> 210.199997 <NA> 0.0 <NA> 905.469971 <NA>
2018-01-20 05:05:00 209.800003 <NA> 0.2 <NA> 969.640015 <NA> 210.399994 <NA> 0.0 <NA> 905.469971 <NA>
2018-01-20 05:10:00 209.800003 <NA> 0.2 <NA> 969.840027 <NA> 210.199997 <NA> 0.0 <NA> 905.469971 <NA>
2018-01-20 05:15:00 209.800003 <NA> 0.2 <NA> 970.039978 <NA> 210.399994 <NA> 0.0 <NA> 905.469971 <NA>
2018-01-20 05:20:00 209.800003 <NA> 0.2 <NA> 970.23999 <NA> 210.600006 <NA> 0.2 <NA> 905.669983 <NA>
2019-04-06 10:10:00 128.899994 M 0.2 M 1598.900024 M 174.800003 <NA> 0.4 <NA> 1369.640015 <NA>
2019-04-06 10:15:00 128.899994 M 0.2 M 1599.099976 M 174.899994 <NA> 0.1 <NA> 1369.73999 <NA>
... ... ... ... ... ... ... ... ... ... ... ... ...
2020-06-04 17:05:00 19.74 <NA> 0.05 <NA> 1491.640015 <NA> 38.939999 <NA> 0.0 <NA> 1380.25 <NA>
2020-06-04 17:10:00 19.610001 <NA> 0.05 <NA> 1491.689941 <NA> 38.939999 <NA> 0.0 <NA> 1380.25 <NA>
2020-06-04 17:15:00 19.610001 <NA> 0.05 <NA> 1491.73999 <NA> 38.779999 <NA> 0.0 <NA> 1380.25 <NA>
2020-06-04 17:20:00 19.74 <NA> 0.05 <NA> 1491.790039 <NA> 38.779999 <NA> 0.0 <NA> 1380.25 <NA>
2020-06-04 17:25:00 19.74 <NA> 0.05 <NA> 1491.839966 <NA> 38.779999 <NA> 0.0 <NA> 1380.25 <NA>
2021-10-23 01:50:00 68.769997 <NA> 0.01 <NA> 63.27 <NA> 112.699997 <NA> 0.0 <NA> 77.529999 <NA>
2021-10-23 01:55:00 68.779999 <NA> 0.01 <NA> 63.279999 <NA> 112.699997 <NA> 0.0 <NA> 77.529999 <NA>
2021-10-23 02:00:00 68.910004 <NA> 0.13 <NA> 63.41 <NA> 112.900002 <NA> 0.0 <NA> 77.529999 <NA>
2021-10-23 02:05:00 69.019997 <NA> 0.11 <NA> 63.52 <NA> 112.900002 <NA> 0.0 <NA> 77.529999 <NA>
2021-10-23 02:10:00 69.029999 <NA> 0.01 <NA> 63.529999 <NA> 113.199997 <NA> 0.3 <NA> 77.830002 <NA>
2022-03-13 02:10:00 39.169998 MM 0.26 M 1269.619995 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:15:00 39.169998 MM 0.26 M 1269.880005 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:20:00 39.169998 MM 0.26 M 1270.140015 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:25:00 39.169998 MM 0.26 M 1270.400024 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:30:00 39.169998 MM 0.26 M 1270.660034 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:35:00 39.169998 MM 0.26 M 1270.920044 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:40:00 39.169998 MM 0.26 M 1271.180054 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:45:00 39.169998 MM 0.26 M 1271.439941 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:50:00 39.169998 MM 0.26 M 1271.699951 M 69.540001 M 0.0 M 1182.130005 M
2022-03-13 02:55:00 39.169998 MM 0.26 M 1271.959961 M 69.540001 M 0.0 M 1182.130005 M

1364 rows × 12 columns

With the consequtive requirement, this does quite well. I’m quite confident with how well QaRules.flag_duplicate_precip works above, this matches it pretty closely. It grabs 4 fewer timesteps, but one extra day and a smidge more precip (~0.01”)

To compare:

# days flagged by `QaRules.falg_duplicate_precip`
array([datetime.date(2017, 10, 19), datetime.date(2018, 1, 20),
       datetime.date(2019, 4, 6), datetime.date(2019, 4, 7),
       datetime.date(2019, 4, 8), datetime.date(2019, 4, 9),
       datetime.date(2019, 11, 15), datetime.date(2020, 6, 3),
       datetime.date(2020, 6, 4), datetime.date(2022, 3, 13)],
      dtype=object)

# timesteps flagged by `QaRules.falg_duplicate_precip`
1370 rows

# precip flagged by `QaRules.falg_duplicate_precip`
226.55000304803252
[429]:
pd.set_option('display.min_row', 10)
flag.flags[:] = False

flag.flags['Set0'] = phantom_precip & (how_many_in_group >= 5)
flag.get_flagged_days()
[429]:
DatetimeIndex(['2017-10-19', '2018-01-20', '2019-04-06', '2019-04-07',
               '2019-04-08', '2019-04-09', '2019-11-15', '2020-06-03',
               '2020-06-04', '2021-10-23', '2022-03-13'],
              dtype='datetime64[ns]', freq=None)
[448]:
df.loc[phantom_precip & (how_many_in_group >= 5), 'TOT_SA'].sum()
[448]:
226.76000300608575

Let’s take a quick look at the extra day caught by this method.

[584]:
duplicates = pd.DataFrame((phantom_precip & (how_many_in_group >= 5)))
duplicates.name = 'duplicate'

qa_events[:] = False
qa_events['duplicate'] = duplicates

flag.apply_FSDB_flags(qa_events, flag.flags)


ax = flag.plot_flagged_day(pd.to_datetime('10/23/2021 0100'), 'CEN_01', tdelta='90min', auto_qa_event=qa_events, paired_tank=pair_tank)
plt.grid(True)
[585]:
strt, end = pd.to_datetime('10/23/2021 0100'), pd.to_datetime('10/23/2021 0230')

(tank_smooth.loc[strt:end]).plot(ax=ax[1], color='g', marker='d')
(tank_smooth.loc[strt:end]+1*sa_precision).plot(ax=ax[1], color='g', style='d:')
[585]:
<Axes: title={'center': 'CEN_01 - 2021-10-23 01:00:00'}, xlabel='Date'>
[586]:
lbl = ax[1].get_legend_handles_labels()[1]
lbl[-2:] = ['Smoothed tank', 'Smoothed tank + precision']
ax[1].legend(labels=lbl)
[586]:
<matplotlib.legend.Legend at 0x2d915ecae50>
[457]:
print(f"Tank change: {df.loc[end, 'INST_SA'] - df.loc[strt, 'INST_SA']:.4f}")
print(f"Tot precip : {df.loc[strt:end, 'TOT_SA'].sum():.4f}")
Tank change: 0.7700
Tot precip : 0.7600

This example doesn’t look like it needs to be flagged.

This seems like a very effective way to find a when the tank is flat. By smoothing the tank to the median with a precision buffer, rather than looking at absolute differences between noisy tank values every 5 minutes, the filter naturally grabs all of our repeating values that we’ve been finding.

In the search for these artificially repeating numbers, it makes sense to require the values to be consecutive. And, since this filter is naturally smoothing signal noise with a buffer the precip captured is already limited to sub-precision values. Which means there is no need to identify whether the values are repeating themselves or not, as long as they are consecutive. This is very effective, but fundamentally is looking more at whether or not we are accumulating signal noise. That is a question which is worth addressing, but has not been sytematically investigated here, but rather incidentally investigated as a byproduct of the search for artificially repeating values.

For now, the well developed method for searching for duplicates is dually useful both to find large, doubled values and small constant ones.