Finding Artificial Precip¶
In final evaluation of Capturing-clogs-from-ACC-ratio, a large amount of precipitation was discovered during a clog that turned out to be artificial. At first glance, it looked like a slow drip past a clog, but on closer inspection, it was an artifact of our GCE processing. It is unclear if the clog triggered the process. However, it has some overlap with the searches for duplicate values that were part of cleaning the data before searching for clogs.
Below the data is loaded into this Jupyter Lab before investigating the artificial precip.
[1]:
# setup environment.
import pandas as pd
import matplotlib.pyplot as plt
from numpy import nan, arange
# Jupyter magic to make plots display interactive
# must install ipympl (Ipython-matplotlib) and nodejs
from ipywidgets.embed import embed_minimal_html
%matplotlib widget
# expand all plots to comfortable viewing size
plt.rcParams['figure.figsize'] = [8, 5]
import sys
sys.path.append("../post_gce_qc/")
%load_ext autoreload
%autoreload explicit
%aimport qaqc
from qaqc import *
[2]:
# load data
#################
prov = qaqc.LoadProvisionalData('../config.yaml')
prov.load_ppt_data()
cnsh = prov.pivot_on_probe(prov.df, 'CEN', '02')
cnsa = prov.pivot_on_probe(prov.df, 'CEN', '01')
cn_ppt = cnsa.merge(cnsh, left_index=True, right_index=True, suffixes=('_SA', '_SH'))
[3]:
# clean data for comparison
##############################
sh_precision = 0.3810
sa_precision = 0.2794
probe = {'paired_sensor': 'CEN_02',
'precision': sa_precision,
'tank_min': 0,
'tank_max': 412,
'drain_event_max_recharge': 2.67,
'use_pair': False,
'drain_event_window': 3,
'ppt_runavg_window': 4,
'drain_event_nstd': 2}
qc = qaqc.QaRules(cn_ppt, qa_params=probe)
qc.flag_double_precip(ppt_col='TOT_SA', tank_col='INST_SA')
qc.flag_empty_tank(tank_col='INST_SA', pause_nsteps=2)
new_acc = qc.reset_wy_acc(ppt_col='TOT_SA')
ratio = (new_acc - cn_ppt['ACC_SH'])/new_acc
[4]:
# filter just our test year
df = cn_ppt#[pd.to_datetime('10/1/18'):pd.to_datetime('9/30/19')]
#ratio = ratio[pd.to_datetime('10/1/18'):pd.to_datetime('9/30/19')]
ratio.name='pair_ratio'
df_ratio = pd.DataFrame(ratio)
# find clogs
#######################
min_accum = 34
above_min_accum = (cn_ppt[['ACC_SH', 'ACC_SA']] > min_accum).all(axis=1)
below_normal = ratio < 0.025
dropping_ratio = qc.find_drops(df_ratio, 0.01, col='pair_ratio', wind=f'8D')
clog = (below_normal | dropping_ratio) & above_min_accum
[5]:
# create flag object for plotting
#################################
flag = ApplyFlags(df.index)
# add our new FSDB_auto_flag to this instance of ApplyFlags
clog_flag = pd.DataFrame(clog, columns=['U'], index=pd.to_datetime(df.index))
clog_event = pd.DataFrame(clog, columns=['clog'], index=df.index)
clog_event[['drain_event', 'neg_delta_tank', 'duplicate']] = False
flag.apply_FSDB_flags(clog_event, clog_flag)
# add the original data to the instance to plot
flag.import_provisional_data(qc.df_orig, tank_col='INST_SA', ppt_col='TOT_SA',
ppt_flag_col='TOT_Flag_SA')
False precip with flat tank¶
Where we left off, there was around an inch of precip before the clog was flagged on 4/5/19. It’s accumulating a constant amount of precip for 20 hours straight!! 0.01 inches was measured every 5 minutes from 4/6/19 0955 - 4/9/19 2045 without any gaps. It may require some cleaning before dealing with the clog.
[6]:
strt = pd.to_datetime('4/5/19')
end = strt + pd.to_timedelta('60h')
ax1 = df_ratio[strt:end].plot(legend=True, grid=True, color='g')
ax2=plt.twinx(ax1)
df.loc[strt:end,['ACC_SA']].plot(ax=ax2, legend=True, grid=True)
strt_diff = df.loc[strt,'ACC_SH']-df.loc[strt,'ACC_SA']
(df.loc[strt:end,'ACC_SH']-strt_diff).plot(ax=ax2, legend=True, grid=True)
df.loc[strt:end,'ACC_SA'][clog].plot(ax=ax2, grid=True, linestyle='', marker='o', color='navy')
ratio[strt:end][clog].plot(ax=ax1, grid=True, linestyle='', marker='o', color='navy')
rolling_rat = ratio[strt:end].rolling('8D').mean(engine=numba)
rolling_rat.plot(ax=ax1, grid=True, color='g', linestyle=':', legend=True)
rolling_rat[clog].plot(ax=ax1, grid=True, linestyle='', marker='o', color='navy')
[6]:
<Axes: xlabel='Date'>
[7]:
ax2.legend(['ACC_SA', 'ACC_SH', 'clog']).set_loc('lower left')
A closer look at the ACC and ratio show that the constant 0.01” accumulation swamp out the clog by creating an initial positive change in the ratio, delaying the clog flag. Need to dig into the raw data to see how this can be flagged, or even what’s causing it.
[8]:
strt = pd.to_datetime('4/6/19 0955')
end = strt + pd.to_timedelta(f'{5*100}min')#pd.to_datetime('4/9/19 2045')
pd.set_option('display.max_rows', 101)
cnsa[strt:end]
[8]:
| INST | INST_Flag | TOT | TOT_Flag | ACC | ACC_Flag | |
|---|---|---|---|---|---|---|
| Date | ||||||
| 2019-04-06 09:55:00 | 128.899994 | <NA> | 0.2 | <NA> | 1598.300049 | <NA> |
| 2019-04-06 10:00:00 | 128.899994 | M | 0.2 | M | 1598.5 | M |
| 2019-04-06 10:05:00 | 128.899994 | M | 0.2 | M | 1598.699951 | M |
| 2019-04-06 10:10:00 | 128.899994 | M | 0.2 | M | 1598.900024 | M |
| 2019-04-06 10:15:00 | 128.899994 | M | 0.2 | M | 1599.099976 | M |
| 2019-04-06 10:20:00 | 128.899994 | M | 0.2 | M | 1599.300049 | M |
| 2019-04-06 10:25:00 | 128.899994 | M | 0.2 | M | 1599.5 | M |
| 2019-04-06 10:30:00 | 128.899994 | M | 0.2 | M | 1599.699951 | M |
| 2019-04-06 10:35:00 | 128.899994 | M | 0.2 | M | 1599.900024 | M |
| 2019-04-06 10:40:00 | 128.899994 | M | 0.2 | M | 1600.099976 | M |
| 2019-04-06 10:45:00 | 128.899994 | M | 0.2 | M | 1600.300049 | M |
| 2019-04-06 10:50:00 | 128.899994 | M | 0.2 | M | 1600.5 | M |
| 2019-04-06 10:55:00 | 128.899994 | M | 0.2 | M | 1600.699951 | M |
| 2019-04-06 11:00:00 | 128.899994 | M | 0.2 | M | 1600.900024 | M |
| 2019-04-06 11:05:00 | 128.899994 | M | 0.2 | M | 1601.099976 | M |
| 2019-04-06 11:10:00 | 128.899994 | M | 0.2 | M | 1601.300049 | M |
| 2019-04-06 11:15:00 | 128.899994 | M | 0.2 | M | 1601.5 | M |
| 2019-04-06 11:20:00 | 128.899994 | M | 0.2 | M | 1601.699951 | M |
| 2019-04-06 11:25:00 | 128.899994 | M | 0.2 | M | 1601.900024 | M |
| 2019-04-06 11:30:00 | 128.899994 | M | 0.2 | M | 1602.099976 | M |
| 2019-04-06 11:35:00 | 128.899994 | M | 0.2 | M | 1602.300049 | M |
| 2019-04-06 11:40:00 | 128.899994 | M | 0.2 | M | 1602.5 | M |
| 2019-04-06 11:45:00 | 128.899994 | M | 0.2 | M | 1602.699951 | M |
| 2019-04-06 11:50:00 | 128.899994 | M | 0.2 | M | 1602.900024 | M |
| 2019-04-06 11:55:00 | 128.899994 | M | 0.2 | M | 1603.099976 | M |
| 2019-04-06 12:00:00 | 128.899994 | M | 0.2 | M | 1603.300049 | M |
| 2019-04-06 12:05:00 | 128.899994 | M | 0.2 | M | 1603.5 | M |
| 2019-04-06 12:10:00 | 128.899994 | M | 0.2 | M | 1603.699951 | M |
| 2019-04-06 12:15:00 | 128.899994 | M | 0.2 | M | 1603.900024 | M |
| 2019-04-06 12:20:00 | 128.899994 | M | 0.2 | M | 1604.099976 | M |
| 2019-04-06 12:25:00 | 128.899994 | M | 0.2 | M | 1604.300049 | M |
| 2019-04-06 12:30:00 | 128.899994 | M | 0.2 | M | 1604.5 | M |
| 2019-04-06 12:35:00 | 128.899994 | M | 0.2 | M | 1604.699951 | M |
| 2019-04-06 12:40:00 | 128.899994 | M | 0.2 | M | 1604.900024 | M |
| 2019-04-06 12:45:00 | 128.899994 | M | 0.2 | M | 1605.099976 | M |
| 2019-04-06 12:50:00 | 128.899994 | M | 0.2 | M | 1605.300049 | M |
| 2019-04-06 12:55:00 | 128.899994 | M | 0.2 | M | 1605.5 | M |
| 2019-04-06 13:00:00 | 128.899994 | M | 0.2 | M | 1605.699951 | M |
| 2019-04-06 13:05:00 | 128.899994 | M | 0.2 | M | 1605.900024 | M |
| 2019-04-06 13:10:00 | 128.899994 | M | 0.2 | M | 1606.099976 | M |
| 2019-04-06 13:15:00 | 128.899994 | M | 0.2 | M | 1606.300049 | M |
| 2019-04-06 13:20:00 | 128.899994 | M | 0.2 | M | 1606.5 | M |
| 2019-04-06 13:25:00 | 128.899994 | M | 0.2 | M | 1606.699951 | M |
| 2019-04-06 13:30:00 | 128.899994 | M | 0.2 | M | 1606.900024 | M |
| 2019-04-06 13:35:00 | 128.899994 | M | 0.2 | M | 1607.099976 | M |
| 2019-04-06 13:40:00 | 128.899994 | M | 0.2 | M | 1607.300049 | M |
| 2019-04-06 13:45:00 | 128.899994 | M | 0.2 | M | 1607.5 | M |
| 2019-04-06 13:50:00 | 128.899994 | M | 0.2 | M | 1607.699951 | M |
| 2019-04-06 13:55:00 | 128.899994 | M | 0.2 | M | 1607.900024 | M |
| 2019-04-06 14:00:00 | 128.899994 | M | 0.2 | M | 1608.099976 | M |
| 2019-04-06 14:05:00 | 128.899994 | M | 0.2 | M | 1608.300049 | M |
| 2019-04-06 14:10:00 | 128.899994 | M | 0.2 | M | 1608.5 | M |
| 2019-04-06 14:15:00 | 128.899994 | M | 0.2 | M | 1608.699951 | M |
| 2019-04-06 14:20:00 | 128.899994 | M | 0.2 | M | 1608.900024 | M |
| 2019-04-06 14:25:00 | 128.899994 | M | 0.2 | M | 1609.099976 | M |
| 2019-04-06 14:30:00 | 128.899994 | M | 0.2 | M | 1609.300049 | M |
| 2019-04-06 14:35:00 | 128.899994 | M | 0.2 | M | 1609.5 | M |
| 2019-04-06 14:40:00 | 128.899994 | M | 0.2 | M | 1609.699951 | M |
| 2019-04-06 14:45:00 | 128.899994 | M | 0.2 | M | 1609.900024 | M |
| 2019-04-06 14:50:00 | 128.899994 | M | 0.2 | M | 1610.099976 | M |
| 2019-04-06 14:55:00 | 128.899994 | M | 0.2 | M | 1610.300049 | M |
| 2019-04-06 15:00:00 | 128.899994 | M | 0.2 | M | 1610.5 | M |
| 2019-04-06 15:05:00 | 128.899994 | M | 0.2 | M | 1610.699951 | M |
| 2019-04-06 15:10:00 | 128.899994 | M | 0.2 | M | 1610.900024 | M |
| 2019-04-06 15:15:00 | 128.899994 | M | 0.2 | M | 1611.099976 | M |
| 2019-04-06 15:20:00 | 128.899994 | M | 0.2 | M | 1611.300049 | M |
| 2019-04-06 15:25:00 | 128.899994 | M | 0.2 | M | 1611.5 | M |
| 2019-04-06 15:30:00 | 128.899994 | M | 0.2 | M | 1611.699951 | M |
| 2019-04-06 15:35:00 | 128.899994 | M | 0.2 | M | 1611.900024 | M |
| 2019-04-06 15:40:00 | 128.899994 | M | 0.2 | M | 1612.099976 | M |
| 2019-04-06 15:45:00 | 128.899994 | M | 0.2 | M | 1612.300049 | M |
| 2019-04-06 15:50:00 | 128.899994 | M | 0.2 | M | 1612.5 | M |
| 2019-04-06 15:55:00 | 128.899994 | M | 0.2 | M | 1612.699951 | M |
| 2019-04-06 16:00:00 | 128.899994 | M | 0.2 | M | 1612.900024 | M |
| 2019-04-06 16:05:00 | 128.899994 | M | 0.2 | M | 1613.099976 | M |
| 2019-04-06 16:10:00 | 128.899994 | M | 0.2 | M | 1613.300049 | M |
| 2019-04-06 16:15:00 | 128.899994 | M | 0.2 | M | 1613.5 | M |
| 2019-04-06 16:20:00 | 128.899994 | M | 0.2 | M | 1613.699951 | M |
| 2019-04-06 16:25:00 | 128.899994 | M | 0.2 | M | 1613.900024 | M |
| 2019-04-06 16:30:00 | 128.899994 | M | 0.2 | M | 1614.099976 | M |
| 2019-04-06 16:35:00 | 128.899994 | M | 0.2 | M | 1614.300049 | M |
| 2019-04-06 16:40:00 | 128.899994 | M | 0.2 | M | 1614.5 | M |
| 2019-04-06 16:45:00 | 128.899994 | M | 0.2 | M | 1614.699951 | M |
| 2019-04-06 16:50:00 | 128.899994 | M | 0.2 | M | 1614.900024 | M |
| 2019-04-06 16:55:00 | 128.899994 | M | 0.2 | M | 1615.099976 | M |
| 2019-04-06 17:00:00 | 128.899994 | M | 0.2 | M | 1615.300049 | M |
| 2019-04-06 17:05:00 | 128.899994 | M | 0.2 | M | 1615.5 | M |
| 2019-04-06 17:10:00 | 128.899994 | M | 0.2 | M | 1615.699951 | M |
| 2019-04-06 17:15:00 | 128.899994 | M | 0.2 | M | 1615.900024 | M |
| 2019-04-06 17:20:00 | 128.899994 | M | 0.2 | M | 1616.099976 | M |
| 2019-04-06 17:25:00 | 128.899994 | M | 0.2 | M | 1616.300049 | M |
| 2019-04-06 17:30:00 | 128.899994 | M | 0.2 | M | 1616.5 | M |
| 2019-04-06 17:35:00 | 128.899994 | M | 0.2 | M | 1616.699951 | M |
| 2019-04-06 17:40:00 | 128.899994 | M | 0.2 | M | 1616.900024 | M |
| 2019-04-06 17:45:00 | 128.899994 | M | 0.2 | M | 1617.099976 | M |
| 2019-04-06 17:50:00 | 128.899994 | M | 0.2 | M | 1617.300049 | M |
| 2019-04-06 17:55:00 | 128.899994 | M | 0.2 | M | 1617.5 | M |
| 2019-04-06 18:00:00 | 128.899994 | M | 0.2 | M | 1617.699951 | M |
| 2019-04-06 18:05:00 | 128.899994 | M | 0.2 | M | 1617.900024 | M |
| 2019-04-06 18:10:00 | 128.899994 | M | 0.2 | M | 1618.099976 | M |
| 2019-04-06 18:15:00 | 128.899994 | M | 0.2 | M | 1618.300049 | M |
[588]:
strt = pd.to_datetime('4/6/19 0955')
end = pd.to_datetime('4/9/19 2045')
cnsa[strt:end].describe()
[588]:
| INST | TOT | ACC | |
|---|---|---|---|
| count | 995.0 | 995.0 | 995.0 |
| mean | 128.899994 | 0.2 | 1697.698467 |
| std | 0.0 | 0.0 | 57.474289 |
| min | 128.899994 | 0.2 | 1598.300049 |
| 25% | 128.899994 | 0.2 | 1647.999023 |
| 50% | 128.899994 | 0.2 | 1697.697998 |
| 75% | 128.899994 | 0.2 | 1747.39801 |
| max | 128.899994 | 0.2 | 1797.097046 |
[10]:
cnsa.loc[strt:end, 'TOT'].sum()
[10]:
199.00000296533108
The tank value is 100% static! This was flagged ‘M’, the tank values were deleted, and had pd.ffill() applied during load. Tank value does not change because it is missing data that was forward filled from the last known value. Yet 199 mm of precip accumulate! This is totally bad data! So the question is how to filter it.
[11]:
#strt = pd.to_datetime('4/5/19')
end = strt + pd.to_timedelta('11D')
ax1 = df_ratio[strt:end].plot(legend=True, grid=True, color='g')
ax2=plt.twinx(ax1)
df.loc[strt:end,['ACC_SA']].plot(ax=ax2, legend=True, grid=True)
strt_diff = df.loc[strt,'ACC_SH']-df.loc[strt,'ACC_SA']
(df.loc[strt:end,'ACC_SH']-strt_diff).plot(ax=ax2, legend=True, grid=True)
df.loc[strt:end,'ACC_SA'][clog].plot(ax=ax2, grid=True, linestyle='', marker='o', color='navy', legend=True)
ratio[strt:end][clog].plot(ax=ax1, grid=True, linestyle='', marker='o', color='navy', legend=True)
rolling_rat = ratio[strt:end].rolling('8D').mean(engine=numba)
rolling_rat.plot(ax=ax1, grid=True, color='g', linestyle=':', legend=True)
rolling_rat[clog].plot(ax=ax1, grid=True, linestyle='', marker='o', color='navy', legend=True)
[11]:
<Axes: xlabel='Date'>
[12]:
ax2.legend().set_loc('lower left')
Compare to raw data¶
fpath = input()[562]:
raw = pd.read_csv(fpath, header=1, parse_dates=True, index_col='TIMESTAMP',
encoding='latin-1', engine='pyarrow', dtype_backend='pyarrow')
raw = raw.iloc[2:]
raw.index = pd.to_datetime(raw.index)
[563]:
pd.options.display.min_rows = 45
pd.merge(raw.loc[strt:end, 'SA_PRECIP'],cnsa.loc[strt:end, 'TOT'], left_index=True, right_index=True)
[563]:
| SA_PRECIP | TOT | |
|---|---|---|
| TIMESTAMP | ||
| 2019-04-06 09:55:00 | 128.9 | 0.2 |
| 2019-04-06 10:00:00 | 129.1 | 0.2 |
| 2019-04-06 10:05:00 | 129.3 | 0.2 |
| 2019-04-06 10:10:00 | 129.7 | 0.2 |
| 2019-04-06 10:15:00 | 129.8 | 0.2 |
| 2019-04-06 10:20:00 | 129.8 | 0.2 |
| 2019-04-06 10:25:00 | 129.9 | 0.2 |
| 2019-04-06 10:30:00 | 130.3 | 0.2 |
| 2019-04-06 10:35:00 | 130.4 | 0.2 |
| 2019-04-06 10:40:00 | 130.7 | 0.2 |
| 2019-04-06 10:45:00 | 131 | 0.2 |
| 2019-04-06 10:50:00 | 130.9 | 0.2 |
| 2019-04-06 10:55:00 | 131 | 0.2 |
| 2019-04-06 11:00:00 | 131.1 | 0.2 |
| 2019-04-06 11:05:00 | 131.1 | 0.2 |
| 2019-04-06 11:10:00 | 131.2 | 0.2 |
| 2019-04-06 11:15:00 | 131.2 | 0.2 |
| 2019-04-06 11:20:00 | 131.2 | 0.2 |
| 2019-04-06 11:25:00 | 131.2 | 0.2 |
| 2019-04-06 11:30:00 | 131.3 | 0.2 |
| ... | ... | ... |
| 2019-04-09 19:10:00 | 133.5 | 0.2 |
| 2019-04-09 19:15:00 | 133.5 | 0.2 |
| 2019-04-09 19:20:00 | 133.5 | 0.2 |
| 2019-04-09 19:25:00 | 133.5 | 0.2 |
| 2019-04-09 19:30:00 | 133.6 | 0.2 |
| 2019-04-09 19:35:00 | 133.5 | 0.2 |
| 2019-04-09 19:40:00 | 133.5 | 0.2 |
| 2019-04-09 19:45:00 | 133.5 | 0.2 |
| 2019-04-09 19:50:00 | 133.5 | 0.2 |
| 2019-04-09 19:55:00 | 133.5 | 0.2 |
| 2019-04-09 20:00:00 | 133.5 | 0.2 |
| 2019-04-09 20:05:00 | 133.4 | 0.2 |
| 2019-04-09 20:10:00 | 133.5 | 0.2 |
| 2019-04-09 20:15:00 | 133.5 | 0.2 |
| 2019-04-09 20:20:00 | 133.5 | 0.2 |
| 2019-04-09 20:25:00 | 133.5 | 0.2 |
| 2019-04-09 20:30:00 | 133.5 | 0.2 |
| 2019-04-09 20:35:00 | 133.5 | 0.2 |
| 2019-04-09 20:40:00 | 133.5 | 0.2 |
| 2019-04-09 20:45:00 | 133.5 | 0.2 |
995 rows × 2 columns
[592]:
plt.figure()
raw.loc[strt:end, 'SA_PRECIP'].astype('float[pyarrow]').plot(grid=True, legend=True)
tot = cnsa.loc[strt:end, 'TOT'].cumsum()
tot.index = tot.index.astype('datetime64[ns]')
ax2 = plt.twinx()
tot.plot(ax=ax2, grid=True, color='orange')
plt.legend(loc='lower right')
[592]:
<matplotlib.legend.Legend at 0x2d9189b2bb0>
So, the raw data isn’t perfectly flat, but we have 3.6 mm of tank change and 199 mm of precip. We can see that, right away, the precip is doubling the incriment that the tank is increasing by, and within 5 timesteps, the tank pauses it’s accumulation, but the precip still remains constant. So, while not absolutely flat, it shows the same problem. It would be nice if the raw tank values weren’t deleted. Let’s see if we can filter out this issue.
Filtering for a flat tank¶
Max == min¶
The extreme example above might have a simple solution.
[13]:
dly_tank = cnsa['INST'].rolling('1D')
dly_max = dly_tank.max(engine=numba)
dly_min = dly_tank.min(engine=numba)
flat_tank = dly_max == dly_min
raining = cnsa['TOT'] > 0
cnsa[flat_tank&raining]
[13]:
| INST | INST_Flag | TOT | TOT_Flag | ACC | ACC_Flag | |
|---|---|---|---|---|---|---|
| Date | ||||||
| 2019-04-07 09:50:00 | 128.899994 | M | 0.2 | M | 1655.698975 | M |
| 2019-04-07 09:55:00 | 128.899994 | M | 0.2 | M | 1655.899048 | M |
| 2019-04-07 10:00:00 | 128.899994 | M | 0.2 | M | 1656.098999 | M |
| 2019-04-07 10:05:00 | 128.899994 | M | 0.2 | M | 1656.29895 | M |
| 2019-04-07 10:10:00 | 128.899994 | M | 0.2 | M | 1656.499023 | M |
| ... | ... | ... | ... | ... | ... | ... |
| 2019-04-09 20:25:00 | 128.899994 | M | 0.2 | M | 1796.296997 | M |
| 2019-04-09 20:30:00 | 128.899994 | M | 0.2 | M | 1796.496948 | M |
| 2019-04-09 20:35:00 | 128.899994 | M | 0.2 | M | 1796.697021 | M |
| 2019-04-09 20:40:00 | 128.899994 | M | 0.2 | M | 1796.896973 | M |
| 2019-04-09 20:45:00 | 128.899994 | M | 0.2 | M | 1797.097046 | M |
708 rows × 6 columns
[14]:
# test with Shelter
dly_tank = cnsh['INST'].rolling('1D')
dly_max = dly_tank.max(engine=numba)
dly_min = dly_tank.min(engine=numba)
flat_tank = dly_max == dly_min
raining = cnsh['TOT'] > 0
cnsh[flat_tank&raining]
[14]:
| INST | INST_Flag | TOT | TOT_Flag | ACC | ACC_Flag | |
|---|---|---|---|---|---|---|
| Date |
That caught the problem right away. A few issues though:
This caught the one and only known case, but I still don’t know what caused it, and it didn’t catch any other cases. So are there cases we still don’t know about?
This starts a day late. It will need to be adjusted back by the size of the window (‘1D’).
I already have a qa rule to catch flat tanks…I’m suspicious that I need another one. Maybe I should take another look at
QaRules.flag_duplicate_precip(), plus an unused ruleQaRules.flag_overaccum_precip()
Let me try another method that might work better with our diurnal fluctuations we see at UPLO and VARA.
Max==min w/ a buffer¶
[15]:
dly_tank = cnsa['INST'].rolling('1D')
dly_max = dly_tank.max(engine=numba)
dly_min = dly_tank.min(engine=numba)
flat_tank = abs(dly_max - dly_min) < 0.288
raining = cnsa['TOT'] > 0
cnsa[flat_tank&raining]
[15]:
| INST | INST_Flag | TOT | TOT_Flag | ACC | ACC_Flag | |
|---|---|---|---|---|---|---|
| Date | ||||||
| 2017-10-01 00:05:00 | 239.699997 | <NA> | 0.2 | Q | 0.2 | Q |
| 2018-02-20 13:55:00 | 62.82 | <NA> | 0.09 | <NA> | 1216.300049 | <NA> |
| 2018-10-01 21:55:00 | 16.76 | <NA> | 0.01 | <NA> | 0.19 | <NA> |
| 2019-04-07 09:45:00 | 128.899994 | M | 0.2 | M | 1655.499023 | M |
| 2019-04-07 09:50:00 | 128.899994 | M | 0.2 | M | 1655.698975 | M |
| ... | ... | ... | ... | ... | ... | ... |
| 2019-04-09 20:40:00 | 128.899994 | M | 0.2 | M | 1796.896973 | M |
| 2019-04-09 20:45:00 | 128.899994 | M | 0.2 | M | 1797.097046 | M |
| 2019-09-30 22:25:00 | 64.290001 | <NA> | 0.12 | <NA> | 2423.826904 | <NA> |
| 2019-11-21 14:20:00 | 59.259998 | <NA> | 0.04 | <NA> | 166.479996 | <NA> |
| 2022-09-19 00:15:00 | 12.14 | <NA> | 0.03 | <NA> | 2238.02002 | <NA> |
715 rows × 6 columns
This still catches the April 2019 event and the rows not shown are all during that event. I’ll run down the newly identified times and see what they look like.
[16]:
cnsa.head(15)
[16]:
| INST | INST_Flag | TOT | TOT_Flag | ACC | ACC_Flag | |
|---|---|---|---|---|---|---|
| Date | ||||||
| 2017-10-01 00:00:00 | 239.5 | <NA> | 0.0 | <NA> | 0.0 | <NA> |
| 2017-10-01 00:05:00 | 239.699997 | <NA> | 0.2 | Q | 0.2 | Q |
| 2017-10-01 00:10:00 | 239.800003 | <NA> | 0.3 | <NA> | 0.5 | <NA> |
| 2017-10-01 00:15:00 | 239.899994 | <NA> | 0.1 | <NA> | 0.6 | <NA> |
| 2017-10-01 00:20:00 | 239.800003 | <NA> | 0.0 | <NA> | 0.6 | <NA> |
| 2017-10-01 00:25:00 | 239.899994 | <NA> | 0.0 | <NA> | 0.6 | <NA> |
| 2017-10-01 00:30:00 | 239.899994 | <NA> | 0.0 | <NA> | 0.6 | <NA> |
| 2017-10-01 00:35:00 | 239.899994 | <NA> | 0.0 | <NA> | 0.6 | <NA> |
| 2017-10-01 00:40:00 | 240.100006 | <NA> | 0.2 | <NA> | 0.8 | <NA> |
| 2017-10-01 00:45:00 | 240.300003 | <NA> | 0.2 | <NA> | 1.0 | <NA> |
| 2017-10-01 00:50:00 | 240.100006 | <NA> | 0.0 | <NA> | 1.0 | <NA> |
| 2017-10-01 00:55:00 | 240.300003 | <NA> | 0.0 | <NA> | 1.0 | <NA> |
| 2017-10-01 01:00:00 | 240.199997 | <NA> | 0.0 | <NA> | 1.0 | <NA> |
| 2017-10-01 01:05:00 | 240.300003 | <NA> | 0.0 | <NA> | 1.0 | <NA> |
| 2017-10-01 01:10:00 | 240.300003 | <NA> | 0.0 | <NA> | 1.0 | <NA> |
[17]:
pair_tank = qc.df_orig['INST_SH']
ax = flag.plot_flagged_day(pd.to_datetime('2/19/2018 1355'), 'CEN_01', tdelta='1D', auto_qa_event=clog_event, paired_tank=pair_tank)
[18]:
pair_tank = qc.df_orig['INST_SH']
ax = flag.plot_flagged_day(pd.to_datetime('9/30/2018 2055'), 'CEN_01', tdelta='1D', auto_qa_event=clog_event, paired_tank=pair_tank)
The first one seems like it is just because there aren’t enough values. I will set a minnumum number of datapoints to fix that. The second one looks like it’s just the first rain after a long rain-free period. The third one is the most telling. There is minimal bounce in the tank values, but simplepre.m is outputting some small precip values. So, after a full day where the tank value doesn’t change anymore than it’s precision, the little bit of precip being spit out by Fox’s algorythm is
caught by the filter. A great example of why this filter will not work.
The flat tank is only important because, in our first example, there were a long series of NAN values that were filled with a forward fill. If we move away from the flat tank criteria, this filter would be more focused on precip when there was tank bounce that may or may not be valid. Since we already have a fully fleshed out method for dealing with duplicates, let’s see if we can make that logic work to capture this instance.
QaRules.flag_duplicate_precip¶
This was developed above in Remove-duplicate-values, a sub-section of GCE-flagging-during-clogs, which was focused on doubling of precip after clogs dumped large amounts into the tanks all at once. It usually was flagged by GCE as an ‘F’ following a ‘J’.
The logic was: duplicate = large_ppt & repeat_val & flat_tank
where large_ppt was defined as:
precision = self.precision * int(large_ppt_size)
large_ppt = rainfall > large_ppt_size
and both flat_tank and repeat_val were Series.diff() < precision.
I’ll start by looking at what the filter grabed when I created the clean ACC data above with large_ppt = 1* precision, which is hopefully the duplicate values I already know about. Then I’ll start expanding and see if it will capture the April 2019 event.
[19]:
qc.df_orig[qc.qa_events.duplicate]
[19]:
| INST_SA | INST_Flag_SA | TOT_SA | TOT_Flag_SA | ACC_SA | ACC_Flag_SA | INST_SH | INST_Flag_SH | TOT_SH | TOT_Flag_SH | ACC_SH | ACC_Flag_SH | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | ||||||||||||
| 2018-11-21 17:00:00 | 217.199997 | <NA> | 5.1 | F | 205.630005 | F | 236.800003 | <NA> | 0.0 | <NA> | 192.649994 | <NA> |
| 2018-12-25 09:25:00 | 307.299988 | <NA> | 173.0 | F | 826.309998 | F | 311.399994 | <NA> | 0.0 | <NA> | 628.419983 | <NA> |
| 2019-06-28 00:20:00 | 141.300003 | <NA> | 21.200001 | F | 2247.336914 | F | 170.600006 | <NA> | 0.0 | <NA> | 1778.619995 | <NA> |
| 2019-09-08 13:20:00 | 35.189999 | <NA> | 4.48 | F | 2278.616943 | F | 59.5 | <NA> | 0.0 | <NA> | 1807.150024 | <NA> |
This had been locked so that large_ppt_size was an integer. Therefore 1mm was the minimum. I adjusted this to be increments of precision. This lets me adjust the definition of what is large ppt more finely to test this. ** large_ppt relabeled min_ppt.**
minimum_ppt > 0¶
Criteria:
# flat_tank and repeat_val use 1*precision
large_ppt = rainfall > 0
flat_tank & repeat_val & large_ppt
[20]:
pd.options.display.min_rows = 20
# reset
qc.qa_events.duplicate = False
# rerun
qc.flag_double_precip(ppt_col='TOT_SA', tank_col='INST_SA', min_ppt_nprecision=0,
flat_tank_nprecision=1, flat_ppt_nprecision=1)
print(qc.df_orig[qc.qa_events.duplicate].describe())
qc.df_orig[qc.qa_events.duplicate].head(20)
INST_SA TOT_SA ACC_SA INST_SH TOT_SH ACC_SH
count 24031.0 24031.0 24031.0 24031.0 24031.0 24031.0
mean 139.274711 0.15218 1055.776442 166.930365 0.10182 964.259268
std 99.73674 1.125805 635.093782 99.344013 0.162233 569.434429
min 7.499 0.01 0.08 22.74 0.0 0.0
25% 58.919998 0.1 454.029999 83.699997 0.0 432.779999
50% 116.900002 0.12 1089.650024 142.600006 0.0 987.150024
75% 205.300003 0.2 1544.019958 236.399994 0.2 1421.699951
max 498.200012 173.0 2423.826904 507.100006 1.7 2220.330078
[20]:
| INST_SA | INST_Flag_SA | TOT_SA | TOT_Flag_SA | ACC_SA | ACC_Flag_SA | INST_SH | INST_Flag_SH | TOT_SH | TOT_Flag_SH | ACC_SH | ACC_Flag_SH | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | ||||||||||||
| 2017-10-01 00:05:00 | 239.699997 | <NA> | 0.2 | Q | 0.2 | Q | 259.799988 | <NA> | 0.2 | Q | 0.2 | Q |
| 2017-10-01 00:10:00 | 239.800003 | <NA> | 0.3 | <NA> | 0.5 | <NA> | 260.100006 | <NA> | 0.5 | <NA> | 0.7 | <NA> |
| 2017-10-01 00:15:00 | 239.899994 | <NA> | 0.1 | <NA> | 0.6 | <NA> | 260.100006 | <NA> | 0.0 | <NA> | 0.7 | <NA> |
| 2017-10-01 00:40:00 | 240.100006 | <NA> | 0.2 | <NA> | 0.8 | <NA> | 260.399994 | <NA> | 0.0 | <NA> | 1.0 | <NA> |
| 2017-10-01 00:45:00 | 240.300003 | <NA> | 0.2 | <NA> | 1.0 | <NA> | 260.799988 | <NA> | 0.4 | <NA> | 1.4 | <NA> |
| 2017-10-01 01:55:00 | 240.399994 | <NA> | 0.1 | <NA> | 1.1 | <NA> | 260.799988 | <NA> | 0.0 | <NA> | 1.6 | <NA> |
| 2017-10-01 02:25:00 | 241.300003 | <NA> | 0.2 | <NA> | 2.0 | <NA> | 261.299988 | <NA> | 0.0 | <NA> | 2.2 | <NA> |
| 2017-10-01 02:50:00 | 241.399994 | <NA> | 0.1 | <NA> | 2.1 | <NA> | 261.700012 | <NA> | 0.0 | <NA> | 2.4 | <NA> |
| 2017-10-01 02:55:00 | 241.5 | <NA> | 0.1 | <NA> | 2.2 | <NA> | 261.799988 | <NA> | 0.0 | <NA> | 2.4 | <NA> |
| 2017-10-01 03:05:00 | 241.699997 | <NA> | 0.2 | <NA> | 2.4 | <NA> | 262.0 | <NA> | 0.0 | <NA> | 2.6 | <NA> |
| 2017-10-01 03:30:00 | 241.800003 | <NA> | 0.1 | <NA> | 2.5 | <NA> | 262.200012 | <NA> | 0.0 | <NA> | 3.1 | <NA> |
| 2017-10-01 03:50:00 | 242.399994 | <NA> | 0.2 | <NA> | 3.1 | <NA> | 263.0 | <NA> | 0.5 | <NA> | 3.6 | <NA> |
| 2017-10-01 04:00:00 | 242.899994 | <NA> | 0.2 | <NA> | 3.6 | <NA> | 263.200012 | <NA> | 0.2 | <NA> | 3.8 | <NA> |
| 2017-10-01 04:15:00 | 243.0 | <NA> | 0.1 | <NA> | 3.7 | <NA> | 263.399994 | <NA> | 0.1 | <NA> | 4.0 | <NA> |
| 2017-10-01 05:05:00 | 243.100006 | <NA> | 0.1 | <NA> | 3.8 | <NA> | 263.700012 | <NA> | 0.0 | <NA> | 4.3 | <NA> |
| 2017-10-01 06:05:00 | 243.199997 | <NA> | 0.1 | <NA> | 3.9 | <NA> | 263.5 | <NA> | 0.0 | <NA> | 4.3 | <NA> |
| 2017-10-01 06:25:00 | 243.399994 | <NA> | 0.2 | <NA> | 4.1 | <NA> | 263.899994 | <NA> | 0.2 | <NA> | 4.5 | <NA> |
| 2017-10-01 06:55:00 | 244.600006 | <NA> | 0.1 | <NA> | 5.3 | <NA> | 265.5 | <NA> | 0.3 | <NA> | 6.1 | <NA> |
| 2017-10-01 07:00:00 | 244.699997 | <NA> | 0.1 | <NA> | 5.4 | <NA> | 265.5 | <NA> | 0.0 | <NA> | 6.1 | <NA> |
| 2017-10-01 07:10:00 | 244.899994 | <NA> | 0.2 | <NA> | 5.6 | <NA> | 265.200012 | <NA> | 0.0 | <NA> | 6.1 | <NA> |
precip > 0 capture 24,310 rows, and the first 20 appear to be perfectly valid precip. It’s just real gradual. This threshold is clearly catching a lot of false positives. Although, if the tank is changing less than the precision of the sensor, it does make you wonder why we’re accumulating so much precip in this mode.
minimum_ppt > 0.2794¶
Criteria:
# flat_tank and repeat_val use 1*precision
large_ppt = rainfall > 0.2794
flat_tank & repeat_val & large_ppt
[21]:
pd.options.display.min_rows = 20
# reset
qc.qa_events.duplicate = False
# rerun
qc.flag_double_precip(ppt_col='TOT_SA', tank_col='INST_SA', min_ppt_nprecision=1,
flat_tank_nprecision=1, flat_ppt_nprecision=1)
qc.df_orig[qc.qa_events.duplicate]
[21]:
| INST_SA | INST_Flag_SA | TOT_SA | TOT_Flag_SA | ACC_SA | ACC_Flag_SA | INST_SH | INST_Flag_SH | TOT_SH | TOT_Flag_SH | ACC_SH | ACC_Flag_SH | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | ||||||||||||
| 2017-10-01 00:10:00 | 239.800003 | <NA> | 0.3 | <NA> | 0.5 | <NA> | 260.100006 | <NA> | 0.5 | <NA> | 0.7 | <NA> |
| 2017-10-19 14:45:00 | 346.5 | MM | 0.3 | M | 107.5 | M | 352.299988 | M | 0.2 | M | 103.699997 | M |
| 2017-10-19 14:50:00 | 346.5 | MM | 0.3 | M | 107.800003 | M | 352.299988 | M | 0.2 | M | 103.900002 | M |
| 2017-10-19 14:55:00 | 346.5 | MM | 0.3 | M | 108.099998 | M | 352.299988 | M | 0.2 | M | 104.099998 | M |
| 2017-10-19 15:00:00 | 346.5 | MM | 0.3 | M | 108.400002 | M | 352.299988 | M | 0.2 | M | 104.300003 | M |
| 2017-10-19 15:05:00 | 346.5 | MM | 0.3 | M | 108.699997 | M | 352.299988 | M | 0.2 | M | 104.5 | M |
| 2017-10-19 15:10:00 | 346.5 | MM | 0.3 | M | 109.0 | M | 352.299988 | M | 0.2 | M | 104.699997 | M |
| 2017-10-19 15:15:00 | 346.5 | MM | 0.3 | M | 109.300003 | M | 352.299988 | M | 0.2 | M | 104.900002 | M |
| 2017-10-19 15:20:00 | 346.5 | MM | 0.3 | M | 109.599998 | M | 352.299988 | M | 0.2 | M | 105.099998 | M |
| 2017-10-19 15:25:00 | 346.5 | MM | 0.3 | M | 109.900002 | M | 352.299988 | M | 0.2 | M | 105.300003 | M |
| 2017-10-19 15:30:00 | 346.5 | MM | 0.3 | M | 110.199997 | M | 352.299988 | M | 0.2 | M | 105.5 | M |
| 2018-11-21 17:00:00 | 217.199997 | <NA> | 5.1 | F | 205.630005 | F | 236.800003 | <NA> | 0.0 | <NA> | 192.649994 | <NA> |
| 2018-12-25 09:25:00 | 307.299988 | <NA> | 173.0 | F | 826.309998 | F | 311.399994 | <NA> | 0.0 | <NA> | 628.419983 | <NA> |
| 2019-06-28 00:20:00 | 141.300003 | <NA> | 21.200001 | F | 2247.336914 | F | 170.600006 | <NA> | 0.0 | <NA> | 1778.619995 | <NA> |
| 2019-09-08 13:20:00 | 35.189999 | <NA> | 4.48 | F | 2278.616943 | F | 59.5 | <NA> | 0.0 | <NA> | 1807.150024 | <NA> |
[22]:
clog_event['duplicate'] = qc.qa_events.duplicate
flag.apply_FSDB_flags(clog_event, clog_flag)
# add the original data to the instance to plot
flag.import_provisional_data(qc.df_orig, tank_col='INST_SA', ppt_col='TOT_SA',
ppt_flag_col='TOT_Flag_SA')
pair_tank = qc.df_orig['INST_SH']
ax = flag.plot_flagged_day(pd.to_datetime('10/19/17 1400'), 'CEN_01', tdelta='90min', auto_qa_event=clog_event, paired_tank=pair_tank)
[23]:
strt = pd.to_datetime('10/19/17 1400')
this_event = cn_ppt.loc[strt:pd.to_datetime('10/19/17 1530'), ['ACC_SA', 'ACC_SH']]
strt_diff = this_event.loc[strt, 'ACC_SA'] - this_event.loc[strt, 'ACC_SH']
this_event['ACC_SA'] -= strt_diff
this_event.plot(grid=True, legend=True, marker='*')
[23]:
<Axes: xlabel='Date'>
[24]:
ax2 = plt.twinx(plt.gca())
strt = pd.to_datetime('10/19/17 1400')
this_event = cn_ppt.loc[strt:pd.to_datetime('10/19/17 1530'), ['INST_SA', 'INST_SH']]
strt_diff = this_event.loc[strt, 'INST_SA'] - this_event.loc[strt, 'INST_SH']
this_event['INST_SA'] -= strt_diff
this_event.plot(ax=ax2, linestyle='--', marker='.', grid=True, legend=True).legend(loc='lower right')
[24]:
<matplotlib.legend.Legend at 0x2d94fc32a30>
precip > 0.2794 captures our original group, and a cluster from 2017 that look like it might need to be flagged. However, the April 2019 case is already excluded, because it accumulates at a rate of 0.2, and precision is 0.2794. So this is underflagging already. Let’s try half of precision.
minimum_ppt > 0.2794 / 2¶
Criteria:
# flat_tank and repeat_val use 1*precision
large_ppt = rainfall > (0.2794/2)
flat_tank & repeat_val & large_ppt
[25]:
pd.options.display.min_rows = 15
# reset
qc.qa_events.duplicate = False
# reflag
qc.flag_double_precip(ppt_col='TOT_SA', tank_col='INST_SA', min_ppt_nprecision=0.5,
flat_tank_nprecision=1, flat_ppt_nprecision=1)
qc.df_orig[qc.qa_events.duplicate]
[25]:
| INST_SA | INST_Flag_SA | TOT_SA | TOT_Flag_SA | ACC_SA | ACC_Flag_SA | INST_SH | INST_Flag_SH | TOT_SH | TOT_Flag_SH | ACC_SH | ACC_Flag_SH | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | ||||||||||||
| 2017-10-01 00:05:00 | 239.699997 | <NA> | 0.2 | Q | 0.2 | Q | 259.799988 | <NA> | 0.2 | Q | 0.2 | Q |
| 2017-10-01 00:10:00 | 239.800003 | <NA> | 0.3 | <NA> | 0.5 | <NA> | 260.100006 | <NA> | 0.5 | <NA> | 0.7 | <NA> |
| 2017-10-01 00:40:00 | 240.100006 | <NA> | 0.2 | <NA> | 0.8 | <NA> | 260.399994 | <NA> | 0.0 | <NA> | 1.0 | <NA> |
| 2017-10-01 00:45:00 | 240.300003 | <NA> | 0.2 | <NA> | 1.0 | <NA> | 260.799988 | <NA> | 0.4 | <NA> | 1.4 | <NA> |
| 2017-10-01 02:25:00 | 241.300003 | <NA> | 0.2 | <NA> | 2.0 | <NA> | 261.299988 | <NA> | 0.0 | <NA> | 2.2 | <NA> |
| 2017-10-01 03:05:00 | 241.699997 | <NA> | 0.2 | <NA> | 2.4 | <NA> | 262.0 | <NA> | 0.0 | <NA> | 2.6 | <NA> |
| 2017-10-01 03:50:00 | 242.399994 | <NA> | 0.2 | <NA> | 3.1 | <NA> | 263.0 | <NA> | 0.5 | <NA> | 3.6 | <NA> |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2022-09-29 00:55:00 | 24.35 | <NA> | 0.17 | <NA> | 2250.22998 | <NA> | 40.610001 | <NA> | 0.02 | <NA> | 2216.139893 | <NA> |
| 2022-09-29 02:30:00 | 24.879999 | <NA> | 0.16 | <NA> | 2250.76001 | <NA> | 40.860001 | <NA> | 0.12 | <NA> | 2216.389893 | <NA> |
| 2022-09-29 13:20:00 | 27.48 | <NA> | 0.16 | <NA> | 2253.360107 | <NA> | 44.450001 | <NA> | 0.11 | <NA> | 2219.97998 | <NA> |
| 2022-09-29 15:25:00 | 28.190001 | <NA> | 0.24 | <NA> | 2254.070068 | <NA> | 44.41 | <NA> | 0.0 | <NA> | 2220.01001 | <NA> |
| 2022-09-29 15:30:00 | 28.34 | <NA> | 0.15 | <NA> | 2254.219971 | <NA> | 44.32 | <NA> | 0.0 | <NA> | 2220.01001 | <NA> |
| 2022-09-29 15:35:00 | 28.49 | <NA> | 0.15 | <NA> | 2254.370117 | <NA> | 44.32 | <NA> | 0.0 | <NA> | 2220.01001 | <NA> |
| 2022-09-29 20:25:00 | 29.34 | <NA> | 0.14 | <NA> | 2255.219971 | <NA> | 44.650002 | <NA> | 0.06 | <NA> | 2220.179932 | <NA> |
9236 rows × 12 columns
OK, so if any precip captrues 24,000 rows, and precip greater than precision (0.2794) captures 15 rows, half precision does just a bit better than splitting the difference at 9,236. That might still be too much for me to drill down into. Maybe a list of dates will be more clear.
[26]:
qc.df_orig[qc.qa_events.duplicate].index
[26]:
Index([2017-10-01 00:05:00, 2017-10-01 00:10:00, 2017-10-01 00:40:00,
2017-10-01 00:45:00, 2017-10-01 02:25:00, 2017-10-01 03:05:00,
2017-10-01 03:50:00, 2017-10-01 04:00:00, 2017-10-01 06:25:00,
2017-10-01 07:10:00,
...
2022-09-21 10:25:00, 2022-09-21 17:35:00, 2022-09-21 19:40:00,
2022-09-29 00:55:00, 2022-09-29 02:30:00, 2022-09-29 13:20:00,
2022-09-29 15:25:00, 2022-09-29 15:30:00, 2022-09-29 15:35:00,
2022-09-29 20:25:00],
dtype='timestamp[s][pyarrow]', name='Date', length=9236)
minimum_ppt > 0.196¶
Criteria:
# flat_tank and repeat_val use 1*precision
large_ppt = rainfall > 0.196
flat_tank & repeat_val & large_ppt
[27]:
pd.options.display.min_rows = 5
# reset
qc.qa_events.duplicate = False
# reflag
qc.flag_double_precip(ppt_col='TOT_SA', tank_col='INST_SA', min_ppt_nprecision=0.7,
flat_tank_nprecision=1, flat_ppt_nprecision=1)
qc.df_orig[qc.qa_events.duplicate]
[27]:
| INST_SA | INST_Flag_SA | TOT_SA | TOT_Flag_SA | ACC_SA | ACC_Flag_SA | INST_SH | INST_Flag_SH | TOT_SH | TOT_Flag_SH | ACC_SH | ACC_Flag_SH | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | ||||||||||||
| 2017-10-01 00:05:00 | 239.699997 | <NA> | 0.2 | Q | 0.2 | Q | 259.799988 | <NA> | 0.2 | Q | 0.2 | Q |
| 2017-10-01 00:10:00 | 239.800003 | <NA> | 0.3 | <NA> | 0.5 | <NA> | 260.100006 | <NA> | 0.5 | <NA> | 0.7 | <NA> |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2022-09-21 19:40:00 | 19.91 | <NA> | 0.25 | <NA> | 2245.790039 | <NA> | 37.380001 | <NA> | 0.19 | <NA> | 2212.909912 | <NA> |
| 2022-09-29 15:25:00 | 28.190001 | <NA> | 0.24 | <NA> | 2254.070068 | <NA> | 44.41 | <NA> | 0.0 | <NA> | 2220.01001 | <NA> |
9119 rows × 12 columns
That’s not getting much better. It seems like to include the April 2019 event, I end up including all sorts of other individual moments that seem like they’re mostly false positives. Plus, sometimes it grabs a single timestamp from a day, which is more likely a result of the algorithm “catching up” to tank accumulation. Of course, I don’t really understand what in GCE causes this error, so I’m guessing a little, but just one timestamp in a day seems like it is part of a bigger and complicated picture for at least that hour’s precip. This method doesn’t seem to be working at finding consistent periods of repeat artifical values.
One key that seems to be missing here is looking over a longer period of time than just one timestep to the next. This method was developed to find individual timesteps that were duplicated. This precip question seems to need a longer looking approach, like the rolling window (tested below).
However, the focus has been on tuning the size of the precip. With a modificaiton, I could tune how flat the tank must be and how close the precip values.
Identical precision for flat and repeat¶
In the specific case of artificial precipitation, as found in April 2019, the tank is precisely flat, because it is the same value filled forward. And for some strange reason, the precipitation is also completely identical. The original method was looking for nearly identical tank values and nearly duplicated precipitation. This case is actually repeating, so I can make it exact. I had to adjust the function to allow for >= 0 difference.
Criteria:
# flat_tank and repeat_val use 0*precision
repeat_val = (abs(rainfall.diff()) <= 0) & (rainfall > 0)
flat_tank = df[tank_col].diff() <= 0
minimum_ppt = rainfall > 0
flat_tank & repeat_val & minimum_ppt
[85]:
#pd.set_option('display.min_rows', 100)
pd.options.display.max_rows = 35
[157]:
# reset
qc.qa_events.duplicate = False
# reflag
qc.flag_repeating_val_precip(ppt_col='TOT_SA', tank_col='INST_SA', flat_ppt_nprecision=0, flat_tank_nprecision=0,
min_ppt_nprecision=0)
qc.df_orig[qc.qa_events.duplicate]
[157]:
| INST_SA | INST_Flag_SA | TOT_SA | TOT_Flag_SA | ACC_SA | ACC_Flag_SA | INST_SH | INST_Flag_SH | TOT_SH | TOT_Flag_SH | ACC_SH | ACC_Flag_SH | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | ||||||||||||
| 2017-10-19 14:45:00 | 346.5 | MM | 0.3 | M | 107.5 | M | 352.299988 | M | 0.2 | M | 103.699997 | M |
| 2017-10-19 14:50:00 | 346.5 | MM | 0.3 | M | 107.800003 | M | 352.299988 | M | 0.2 | M | 103.900002 | M |
| 2017-10-19 14:55:00 | 346.5 | MM | 0.3 | M | 108.099998 | M | 352.299988 | M | 0.2 | M | 104.099998 | M |
| 2017-10-19 15:00:00 | 346.5 | MM | 0.3 | M | 108.400002 | M | 352.299988 | M | 0.2 | M | 104.300003 | M |
| 2017-10-19 15:05:00 | 346.5 | MM | 0.3 | M | 108.699997 | M | 352.299988 | M | 0.2 | M | 104.5 | M |
| 2017-10-19 15:10:00 | 346.5 | MM | 0.3 | M | 109.0 | M | 352.299988 | M | 0.2 | M | 104.699997 | M |
| 2017-10-19 15:15:00 | 346.5 | MM | 0.3 | M | 109.300003 | M | 352.299988 | M | 0.2 | M | 104.900002 | M |
| 2017-10-19 15:20:00 | 346.5 | MM | 0.3 | M | 109.599998 | M | 352.299988 | M | 0.2 | M | 105.099998 | M |
| 2017-10-19 15:25:00 | 346.5 | MM | 0.3 | M | 109.900002 | M | 352.299988 | M | 0.2 | M | 105.300003 | M |
| 2017-10-19 15:30:00 | 346.5 | MM | 0.3 | M | 110.199997 | M | 352.299988 | M | 0.2 | M | 105.5 | M |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2022-03-13 02:10:00 | 39.169998 | MM | 0.26 | M | 1269.619995 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:15:00 | 39.169998 | MM | 0.26 | M | 1269.880005 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:20:00 | 39.169998 | MM | 0.26 | M | 1270.140015 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:25:00 | 39.169998 | MM | 0.26 | M | 1270.400024 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:30:00 | 39.169998 | MM | 0.26 | M | 1270.660034 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:35:00 | 39.169998 | MM | 0.26 | M | 1270.920044 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:40:00 | 39.169998 | MM | 0.26 | M | 1271.180054 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:45:00 | 39.169998 | MM | 0.26 | M | 1271.439941 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:50:00 | 39.169998 | MM | 0.26 | M | 1271.699951 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:55:00 | 39.169998 | MM | 0.26 | M | 1271.959961 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
1309 rows × 12 columns
[158]:
qc.df_orig.loc[qc.qa_events.duplicate,'TOT_SA'].sum()
[158]:
224.22000300884247
That looks like it is working really well. It always has a string of several values and they look suspicious. And I don’t see any singular timestamps. The only downside is, if we start getting real tank values from provisional, it will stop working. I tried adjusting it to allow even 0.05*0.279 bounce in the tank value, and it adds a bunch singular timestamps without the timestamp before or after. So… this works on this particular version of GCE output, but will have a future break point.
It makes sense to see if we can put a QA rule in place that will continue to catch this problem, but will succeed against the future break point.
I want to plot one of the examples where the data wasn’t flagged missing before moving on.
[30]:
clog_event['duplicate'] = qc.qa_events.duplicate
flag.apply_FSDB_flags(clog_event, clog_flag)
# add the original data to the instance to plot
flag.import_provisional_data(qc.df_orig, tank_col='INST_SA', ppt_col='TOT_SA',
ppt_flag_col='TOT_Flag_SA')
pair_tank = qc.df_orig['INST_SH']
ax = flag.plot_flagged_day(pd.to_datetime('6/3/20 1200'), 'CEN_01', tdelta='2D', auto_qa_event=clog_event, paired_tank=pair_tank)
[31]:
plt.grid(True)
This is a little confusing. The tank value isn’t constant, so let’s double check this.
[32]:
ax = flag.plot_flagged_day(pd.to_datetime('6/4/20 0900'), 'CEN_01', tdelta='4h', auto_qa_event=clog_event, paired_tank=pair_tank)
[33]:
strt, end = pd.to_datetime('6/4/20 0800'), pd.to_datetime('6/4/20 1000')
qc.df_orig.loc[strt:end, ['INST_SA','INST_Flag_SA','TOT_SA','TOT_Flag_SA']]
[33]:
| INST_SA | INST_Flag_SA | TOT_SA | TOT_Flag_SA | |
|---|---|---|---|---|
| Date | ||||
| 2020-06-04 08:00:00 | 19.5 | <NA> | 0.05 | <NA> |
| 2020-06-04 08:05:00 | 19.5 | <NA> | 0.05 | <NA> |
| 2020-06-04 08:10:00 | 19.5 | <NA> | 0.05 | <NA> |
| 2020-06-04 08:15:00 | 19.5 | <NA> | 0.05 | <NA> |
| 2020-06-04 08:20:00 | 19.5 | <NA> | 0.05 | <NA> |
| 2020-06-04 08:25:00 | 19.5 | <NA> | 0.05 | <NA> |
| 2020-06-04 08:30:00 | 19.5 | <NA> | 0.05 | <NA> |
| 2020-06-04 08:35:00 | 19.5 | <NA> | 0.05 | <NA> |
| 2020-06-04 08:40:00 | 19.5 | <NA> | 0.05 | <NA> |
| 2020-06-04 08:45:00 | 19.629999 | <NA> | 0.05 | <NA> |
| 2020-06-04 08:50:00 | 19.75 | <NA> | 0.05 | <NA> |
| 2020-06-04 08:55:00 | 19.629999 | <NA> | 0.05 | <NA> |
| 2020-06-04 09:00:00 | 19.75 | <NA> | 0.05 | <NA> |
| 2020-06-04 09:05:00 | 19.75 | <NA> | 0.05 | <NA> |
| 2020-06-04 09:10:00 | 19.75 | <NA> | 0.05 | <NA> |
| 2020-06-04 09:15:00 | 19.629999 | <NA> | 0.05 | <NA> |
| 2020-06-04 09:20:00 | 19.629999 | <NA> | 0.05 | <NA> |
| 2020-06-04 09:25:00 | 19.75 | <NA> | 0.05 | <NA> |
| 2020-06-04 09:30:00 | 19.75 | <NA> | 0.05 | <NA> |
| 2020-06-04 09:35:00 | 19.5 | <NA> | 0.05 | <NA> |
| 2020-06-04 09:40:00 | 19.5 | <NA> | 0.05 | <NA> |
| 2020-06-04 09:45:00 | 19.629999 | <NA> | 0.05 | <NA> |
| 2020-06-04 09:50:00 | 19.629999 | <NA> | 0.05 | <NA> |
| 2020-06-04 09:55:00 | 19.629999 | <NA> | 0.05 | <NA> |
| 2020-06-04 10:00:00 | 19.75 | <NA> | 0.05 | <NA> |
[34]:
cnsa.loc[pd.to_datetime('6/3/20 1200'):pd.to_datetime('6/4/20 1200'), 'TOT'].sum()
[34]:
10.950000163167715
We’re only catching parts! Every time the tank shifts 0.12, we stop flagging. We need to figure out how to make that continuous. 0.12 does seem to be the SD of the reading with a flat tank. But, as already stated above, that leads to all sorts of lone time stamps. Which, may highlight problems with simple_pre.m, but are not the case we are trying to run down. So I need to find a way to broaden the criteria for a flat tank to allow at least 0.12 mm of bounce, but only capture values that
are consequtive.
Around 1/2” during a rain free period is worth adressing.
Requiring consecutive values¶
Hopefully with a requirement for consequitive values I can increase the precision for assessing when a tank is flat to capture the sensor bounce.
Shift 1¶
I’ll shift before and after and see if it also is True. I’ll start by making sure it works with the flat precision before trying to expand.
[159]:
after_repeat = qc.qa_events.duplicate.shift(1)
before_repeat = qc.qa_events.duplicate.shift(-1)
duplicates = qc.qa_events.duplicate & (after_repeat | before_repeat )
strt, end = pd.to_datetime('4/9/19 2015'), pd.to_datetime('4/9/19 2115')
qc.df_orig.loc[strt:end, ['INST_SA','INST_Flag_SA','TOT_SA','TOT_Flag_SA']]
[159]:
| INST_SA | INST_Flag_SA | TOT_SA | TOT_Flag_SA | |
|---|---|---|---|---|
| Date | ||||
| 2019-04-09 20:15:00 | 128.899994 | M | 0.2 | M |
| 2019-04-09 20:20:00 | 128.899994 | M | 0.2 | M |
| 2019-04-09 20:25:00 | 128.899994 | M | 0.2 | M |
| 2019-04-09 20:30:00 | 128.899994 | M | 0.2 | M |
| 2019-04-09 20:35:00 | 128.899994 | M | 0.2 | M |
| 2019-04-09 20:40:00 | 128.899994 | M | 0.2 | M |
| 2019-04-09 20:45:00 | 128.899994 | M | 0.2 | M |
| 2019-04-09 20:50:00 | 128.899994 | M | 0.0 | M |
| 2019-04-09 20:55:00 | 128.899994 | M | 0.0 | M |
| 2019-04-09 21:00:00 | 128.899994 | M | 0.0 | M |
| 2019-04-09 21:05:00 | 128.899994 | M | 0.0 | M |
| 2019-04-09 21:10:00 | 128.899994 | M | 0.0 | M |
| 2019-04-09 21:15:00 | 128.899994 | M | 0.0 | M |
[160]:
duplicates[strt:end]
[160]:
Date
2019-04-09 20:15:00 True
2019-04-09 20:20:00 True
2019-04-09 20:25:00 True
2019-04-09 20:30:00 True
2019-04-09 20:35:00 True
2019-04-09 20:40:00 True
2019-04-09 20:45:00 True
2019-04-09 20:50:00 False
2019-04-09 20:55:00 False
2019-04-09 21:00:00 False
2019-04-09 21:05:00 False
2019-04-09 21:10:00 False
2019-04-09 21:15:00 False
Name: duplicate, dtype: bool[pyarrow]
Let’s make sure it catches the start of repeats cleanly too, then I’ll see how it does globally.
[161]:
strt, end = pd.to_datetime('4/6/19 0900'), pd.to_datetime('4/6/19 1030')
qc.df_orig.loc[strt:end, ['INST_SA','INST_Flag_SA','TOT_SA','TOT_Flag_SA']]
[161]:
| INST_SA | INST_Flag_SA | TOT_SA | TOT_Flag_SA | |
|---|---|---|---|---|
| Date | ||||
| 2019-04-06 09:00:00 | 126.199997 | <NA> | 0.6 | <NA> |
| 2019-04-06 09:05:00 | 126.300003 | <NA> | 0.1 | <NA> |
| 2019-04-06 09:10:00 | 126.400002 | <NA> | 0.1 | <NA> |
| 2019-04-06 09:15:00 | 126.5 | <NA> | 0.1 | <NA> |
| 2019-04-06 09:20:00 | 126.699997 | <NA> | 0.2 | <NA> |
| 2019-04-06 09:25:00 | 126.900002 | <NA> | 0.2 | <NA> |
| 2019-04-06 09:30:00 | 127.300003 | <NA> | 0.4 | <NA> |
| 2019-04-06 09:35:00 | 127.800003 | <NA> | 0.5 | <NA> |
| 2019-04-06 09:40:00 | 128.300003 | <NA> | 0.5 | <NA> |
| 2019-04-06 09:45:00 | 128.600006 | <NA> | 0.3 | <NA> |
| 2019-04-06 09:50:00 | 128.699997 | <NA> | 0.1 | <NA> |
| 2019-04-06 09:55:00 | 128.899994 | <NA> | 0.2 | <NA> |
| 2019-04-06 10:00:00 | 128.899994 | M | 0.2 | M |
| 2019-04-06 10:05:00 | 128.899994 | M | 0.2 | M |
| 2019-04-06 10:10:00 | 128.899994 | M | 0.2 | M |
| 2019-04-06 10:15:00 | 128.899994 | M | 0.2 | M |
| 2019-04-06 10:20:00 | 128.899994 | M | 0.2 | M |
| 2019-04-06 10:25:00 | 128.899994 | M | 0.2 | M |
| 2019-04-06 10:30:00 | 128.899994 | M | 0.2 | M |
[162]:
duplicates[strt:end]
[162]:
Date
2019-04-06 09:00:00 False
2019-04-06 09:05:00 False
2019-04-06 09:10:00 False
2019-04-06 09:15:00 False
2019-04-06 09:20:00 False
2019-04-06 09:25:00 False
2019-04-06 09:30:00 False
2019-04-06 09:35:00 False
2019-04-06 09:40:00 False
2019-04-06 09:45:00 False
2019-04-06 09:50:00 False
2019-04-06 09:55:00 False
2019-04-06 10:00:00 True
2019-04-06 10:05:00 True
2019-04-06 10:10:00 True
2019-04-06 10:15:00 True
2019-04-06 10:20:00 True
2019-04-06 10:25:00 True
2019-04-06 10:30:00 True
Name: duplicate, dtype: bool[pyarrow]
Let’s look at what’s selected as a whole.
[124]:
qc.df_orig[duplicates]
[124]:
| INST_SA | INST_Flag_SA | TOT_SA | TOT_Flag_SA | ACC_SA | ACC_Flag_SA | INST_SH | INST_Flag_SH | TOT_SH | TOT_Flag_SH | ACC_SH | ACC_Flag_SH | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | ||||||||||||
| 2017-10-01 08:15:00 | 245.100006 | <NA> | 0.1 | <NA> | 5.8 | <NA> | 265.5 | <NA> | 0.0 | <NA> | 6.1 | <NA> |
| 2017-10-01 08:20:00 | 245.199997 | <NA> | 0.1 | <NA> | 5.9 | <NA> | 265.799988 | <NA> | 0.3 | <NA> | 6.4 | <NA> |
| 2017-10-13 03:05:00 | 324.600006 | T | 0.1 | <NA> | 85.300003 | <NA> | 343.700012 | <NA> | 0.0 | <NA> | 84.300003 | <NA> |
| 2017-10-13 03:10:00 | 324.700012 | T | 0.1 | <NA> | 85.400002 | <NA> | 343.700012 | <NA> | 0.0 | <NA> | 84.300003 | <NA> |
| 2017-10-13 03:15:00 | 324.799988 | T | 0.1 | <NA> | 85.5 | <NA> | 343.799988 | <NA> | 0.1 | <NA> | 84.400002 | <NA> |
| 2017-10-13 06:45:00 | 328.700012 | T | 0.1 | <NA> | 89.400002 | <NA> | 347.200012 | <NA> | 0.3 | <NA> | 87.800003 | <NA> |
| 2017-10-13 06:50:00 | 328.799988 | T | 0.1 | <NA> | 89.5 | <NA> | 347.200012 | <NA> | 0.0 | <NA> | 87.800003 | <NA> |
| 2017-10-19 14:45:00 | 346.5 | MM | 0.3 | M | 107.5 | M | 352.299988 | M | 0.2 | M | 103.699997 | M |
| 2017-10-19 14:50:00 | 346.5 | MM | 0.3 | M | 107.800003 | M | 352.299988 | M | 0.2 | M | 103.900002 | M |
| 2017-10-19 14:55:00 | 346.5 | MM | 0.3 | M | 108.099998 | M | 352.299988 | M | 0.2 | M | 104.099998 | M |
| 2017-10-19 15:00:00 | 346.5 | MM | 0.3 | M | 108.400002 | M | 352.299988 | M | 0.2 | M | 104.300003 | M |
| 2017-10-19 15:05:00 | 346.5 | MM | 0.3 | M | 108.699997 | M | 352.299988 | M | 0.2 | M | 104.5 | M |
| 2017-10-19 15:10:00 | 346.5 | MM | 0.3 | M | 109.0 | M | 352.299988 | M | 0.2 | M | 104.699997 | M |
| 2017-10-19 15:15:00 | 346.5 | MM | 0.3 | M | 109.300003 | M | 352.299988 | M | 0.2 | M | 104.900002 | M |
| 2017-10-19 15:20:00 | 346.5 | MM | 0.3 | M | 109.599998 | M | 352.299988 | M | 0.2 | M | 105.099998 | M |
| 2017-10-19 15:25:00 | 346.5 | MM | 0.3 | M | 109.900002 | M | 352.299988 | M | 0.2 | M | 105.300003 | M |
| 2017-10-19 15:30:00 | 346.5 | MM | 0.3 | M | 110.199997 | M | 352.299988 | M | 0.2 | M | 105.5 | M |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2022-05-14 05:05:00 | 309.700012 | <NA> | 0.1 | <NA> | 1902.439941 | <NA> | 321.299988 | <NA> | 0.3 | <NA> | 1773.589966 | <NA> |
| 2022-05-14 05:10:00 | 309.799988 | <NA> | 0.1 | <NA> | 1902.540039 | <NA> | 321.600006 | <NA> | 0.3 | <NA> | 1773.890015 | <NA> |
| 2022-05-30 09:05:00 | 86.099998 | <NA> | 0.1 | <NA> | 2000.290039 | <NA> | 113.699997 | <NA> | 0.5 | <NA> | 1944.400024 | <NA> |
| 2022-05-30 09:10:00 | 86.199997 | <NA> | 0.1 | <NA> | 2000.390015 | <NA> | 113.599998 | <NA> | 0.0 | <NA> | 1944.400024 | <NA> |
| 2022-05-30 13:40:00 | 88.099998 | <NA> | 0.1 | <NA> | 2002.290039 | <NA> | 115.5 | <NA> | 0.3 | <NA> | 1946.199951 | <NA> |
| 2022-05-30 13:45:00 | 88.199997 | <NA> | 0.1 | <NA> | 2002.390015 | <NA> | 115.5 | <NA> | 0.0 | <NA> | 1946.199951 | <NA> |
| 2022-06-04 22:55:00 | 114.400002 | <NA> | 0.1 | <NA> | 2028.589966 | <NA> | 141.199997 | <NA> | 0.3 | <NA> | 1971.900024 | <NA> |
| 2022-06-04 23:00:00 | 114.5 | <NA> | 0.1 | <NA> | 2028.689941 | <NA> | 141.100006 | <NA> | 0.0 | <NA> | 1971.900024 | <NA> |
| 2022-06-05 09:30:00 | 128.5 | <NA> | 0.1 | <NA> | 2042.689941 | <NA> | 154.199997 | <NA> | 0.0 | <NA> | 1984.900024 | <NA> |
| 2022-06-05 09:35:00 | 128.600006 | <NA> | 0.1 | <NA> | 2042.790039 | <NA> | 154.300003 | <NA> | 0.1 | <NA> | 1985.0 | <NA> |
| 2022-06-12 20:10:00 | 250.899994 | <NA> | 0.1 | <NA> | 2165.090088 | <NA> | 270.700012 | <NA> | 0.0 | <NA> | 2101.399902 | <NA> |
| 2022-06-12 20:15:00 | 251.0 | <NA> | 0.1 | <NA> | 2165.189941 | <NA> | 270.299988 | <NA> | 0.0 | <NA> | 2101.399902 | <NA> |
| 2022-06-13 20:35:00 | 263.200012 | <NA> | 0.1 | <NA> | 2177.389893 | <NA> | 281.600006 | <NA> | 0.0 | <NA> | 2112.5 | <NA> |
| 2022-06-13 20:40:00 | 263.299988 | <NA> | 0.1 | <NA> | 2177.48999 | <NA> | 281.899994 | <NA> | 0.1 | <NA> | 2112.600098 | <NA> |
| 2022-06-13 20:45:00 | 263.399994 | <NA> | 0.1 | <NA> | 2177.590088 | <NA> | 281.5 | <NA> | 0.0 | <NA> | 2112.600098 | <NA> |
| 2022-06-14 09:05:00 | 266.399994 | <NA> | 0.1 | <NA> | 2180.590088 | <NA> | 283.799988 | <NA> | 0.0 | <NA> | 2114.899902 | <NA> |
| 2022-06-14 09:10:00 | 266.5 | <NA> | 0.1 | <NA> | 2180.689941 | <NA> | 283.899994 | <NA> | 0.0 | <NA> | 2114.899902 | <NA> |
1798 rows × 12 columns
Just 2 in a row seems… a little small. I don’t love all the small events this is catching. It becomes less about catching an artificial repeat and more about questioning simple_pre.m. Which is not a task I’ve taken on.
3 in a row (shift 2)¶
The requirement can be both 1 and 2 spaces behind (or in front) of the value also have to be true. That would mean a minimum of 3. A couple speed testst to determine the fastest way to calculate this, and then see if we still capture the right start and stop times.
[107]:
%timeit qc.qa_events['duplicate'].shift([1,2]).all()
2.27 ms ± 116 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
[108]:
%%timeit
after_repeat = qc.qa_events['duplicate'].shift(1)
after_repeat2 = after_repeat & qc.qa_events.duplicate.shift(2)
443 µs ± 23.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
[109]:
%%timeit
after_repeat = qc.qa_events['duplicate'].shift(1)
after_repeat &= qc.qa_events['duplicate'].shift(2)
581 µs ± 37.2 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
[164]:
after_repeat = qc.qa_events.duplicate.shift(1)
after_repeat2 = after_repeat & qc.qa_events.duplicate.shift(2)
before_repeat = qc.qa_events.duplicate.shift(-1)
before_repeat2 = before_repeat & qc.qa_events.duplicate.shift(-2)
duplicates = qc.qa_events.duplicate & (after_repeat2 | before_repeat2 )
strt, end = pd.to_datetime('4/9/19 2015'), pd.to_datetime('4/9/19 2115')
print(duplicates[strt:end])
strt, end = pd.to_datetime('4/6/19 0900'), pd.to_datetime('4/6/19 1030')
duplicates[strt:end]
Date
2019-04-09 20:15:00 True
2019-04-09 20:20:00 True
2019-04-09 20:25:00 True
2019-04-09 20:30:00 True
2019-04-09 20:35:00 True
2019-04-09 20:40:00 True
2019-04-09 20:45:00 True
2019-04-09 20:50:00 False
2019-04-09 20:55:00 False
2019-04-09 21:00:00 False
2019-04-09 21:05:00 False
2019-04-09 21:10:00 False
2019-04-09 21:15:00 False
Name: duplicate, dtype: bool[pyarrow]
[164]:
Date
2019-04-06 09:00:00 False
2019-04-06 09:05:00 False
2019-04-06 09:10:00 False
2019-04-06 09:15:00 False
2019-04-06 09:20:00 False
2019-04-06 09:25:00 False
2019-04-06 09:30:00 False
2019-04-06 09:35:00 False
2019-04-06 09:40:00 False
2019-04-06 09:45:00 False
2019-04-06 09:50:00 False
2019-04-06 09:55:00 False
2019-04-06 10:00:00 True
2019-04-06 10:05:00 True
2019-04-06 10:10:00 True
2019-04-06 10:15:00 True
2019-04-06 10:20:00 True
2019-04-06 10:25:00 True
2019-04-06 10:30:00 True
Name: duplicate, dtype: bool[pyarrow]
That confirms that it still works to capture the beginning and end of the big event that started this analysis. Now to see if the whole process works.
widen the precision for defining a flat tank
look to capture all of the 6/13/22 event
check that other events captured are consecutive
[42]:
0.12 / qc.precision
[42]:
0.4294917680744453
[43]:
0.43*qc.precision
[43]:
0.12014199999999998
[170]:
pd.set_option('display.min_rows', 30)
# reset
qc.qa_events.duplicate = False
# reflag
qc.flag_repeating_val_precip(ppt_col='TOT_SA', tank_col='INST_SA', flat_ppt_nprecision=0, flat_tank_nprecision=0.43,
min_ppt_nprecision=0)
after_repeat = qc.qa_events.duplicate.shift(1)
after_repeat2 = after_repeat & qc.qa_events.duplicate.shift(2)
before_repeat = qc.qa_events.duplicate.shift(-1)
before_repeat2 = before_repeat & qc.qa_events.duplicate.shift(-2)
duplicates = qc.qa_events.duplicate & (after_repeat2 | before_repeat2 )
qc.df_orig[duplicates]
[170]:
| INST_SA | INST_Flag_SA | TOT_SA | TOT_Flag_SA | ACC_SA | ACC_Flag_SA | INST_SH | INST_Flag_SH | TOT_SH | TOT_Flag_SH | ACC_SH | ACC_Flag_SH | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | ||||||||||||
| 2017-10-13 03:05:00 | 324.600006 | T | 0.1 | <NA> | 85.300003 | <NA> | 343.700012 | <NA> | 0.0 | <NA> | 84.300003 | <NA> |
| 2017-10-13 03:15:00 | 324.799988 | T | 0.1 | <NA> | 85.5 | <NA> | 343.799988 | <NA> | 0.1 | <NA> | 84.400002 | <NA> |
| 2017-10-19 14:45:00 | 346.5 | MM | 0.3 | M | 107.5 | M | 352.299988 | M | 0.2 | M | 103.699997 | M |
| 2017-10-19 14:50:00 | 346.5 | MM | 0.3 | M | 107.800003 | M | 352.299988 | M | 0.2 | M | 103.900002 | M |
| 2017-10-19 14:55:00 | 346.5 | MM | 0.3 | M | 108.099998 | M | 352.299988 | M | 0.2 | M | 104.099998 | M |
| 2017-10-19 15:00:00 | 346.5 | MM | 0.3 | M | 108.400002 | M | 352.299988 | M | 0.2 | M | 104.300003 | M |
| 2017-10-19 15:05:00 | 346.5 | MM | 0.3 | M | 108.699997 | M | 352.299988 | M | 0.2 | M | 104.5 | M |
| 2017-10-19 15:10:00 | 346.5 | MM | 0.3 | M | 109.0 | M | 352.299988 | M | 0.2 | M | 104.699997 | M |
| 2017-10-19 15:15:00 | 346.5 | MM | 0.3 | M | 109.300003 | M | 352.299988 | M | 0.2 | M | 104.900002 | M |
| 2017-10-19 15:20:00 | 346.5 | MM | 0.3 | M | 109.599998 | M | 352.299988 | M | 0.2 | M | 105.099998 | M |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2022-03-13 02:50:00 | 39.169998 | MM | 0.26 | M | 1271.699951 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:55:00 | 39.169998 | MM | 0.26 | M | 1271.959961 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-04-19 12:15:00 | 354.100006 | <NA> | 0.1 | <NA> | 1586.890015 | <NA> | 362.600006 | <NA> | 0.2 | <NA> | 1475.040039 | <NA> |
| 2022-04-19 12:25:00 | 354.299988 | <NA> | 0.1 | <NA> | 1587.089966 | <NA> | 362.700012 | <NA> | 0.1 | <NA> | 1475.140015 | <NA> |
| 2022-04-20 15:55:00 | 370.299988 | <NA> | 0.1 | <NA> | 1603.089966 | <NA> | 377.799988 | <NA> | 0.0 | <NA> | 1490.439941 | <NA> |
| 2022-04-20 16:05:00 | 370.5 | <NA> | 0.1 | <NA> | 1603.290039 | <NA> | 378.100006 | <NA> | 0.0 | <NA> | 1490.73999 | <NA> |
| 2022-05-06 19:20:00 | 182.399994 | <NA> | 0.1 | <NA> | 1775.140015 | <NA> | 201.199997 | <NA> | 0.0 | <NA> | 1653.48999 | <NA> |
| 2022-05-06 19:30:00 | 182.600006 | <NA> | 0.1 | <NA> | 1775.339966 | <NA> | 201.399994 | <NA> | 0.2 | <NA> | 1653.689941 | <NA> |
| 2022-06-13 20:35:00 | 263.200012 | <NA> | 0.1 | <NA> | 2177.389893 | <NA> | 281.600006 | <NA> | 0.0 | <NA> | 2112.5 | <NA> |
| 2022-06-13 20:45:00 | 263.399994 | <NA> | 0.1 | <NA> | 2177.590088 | <NA> | 281.5 | <NA> | 0.0 | <NA> | 2112.600098 | <NA> |
1359 rows × 12 columns
We only add 50 new flags, which is good, but there is some weirdness here…It is ocassionally grabbing every other timestep…
[96]:
pd.set_option('display.max_rows', 20)
strt, end = pd.to_datetime('6/13/22 2000'), pd.to_datetime('6/13/22 2100')
dups = pd.DataFrame({'qaDuplicates':qc.qa_events['duplicate'],'after1':after_repeat,
'after2':after_repeat2, 'before1':before_repeat,
'before2':before_repeat2, 'combined':duplicates,} )
dups[strt:end]
[96]:
| qaDuplicates | after1 | after2 | before1 | before2 | combined | |
|---|---|---|---|---|---|---|
| Date | ||||||
| 2022-06-13 20:00:00 | False | False | False | False | False | False |
| 2022-06-13 20:05:00 | False | False | False | False | False | False |
| 2022-06-13 20:10:00 | False | False | False | False | False | False |
| 2022-06-13 20:15:00 | False | False | False | False | False | False |
| 2022-06-13 20:20:00 | False | False | False | False | False | False |
| 2022-06-13 20:25:00 | False | False | False | False | False | False |
| 2022-06-13 20:30:00 | False | False | False | True | True | False |
| 2022-06-13 20:35:00 | True | False | False | True | True | True |
| 2022-06-13 20:40:00 | True | True | False | True | False | False |
| 2022-06-13 20:45:00 | True | True | True | False | False | True |
| 2022-06-13 20:50:00 | False | True | True | False | False | False |
| 2022-06-13 20:55:00 | False | False | False | True | False | False |
| 2022-06-13 21:00:00 | True | False | False | False | False | False |
Ok, so basically, if the rule is it has to be preceded or followed by 2 conesqutive True’s, that is the case for the first and the last, but not the middle when there is only three. Let’s change some & cases to | cases and see if that get’s resolved.
[167]:
pd.set_option('display.max_rows', 20)
strt, end = pd.to_datetime('6/13/22 2000'), pd.to_datetime('6/13/22 2100')
after_repeat = qc.qa_events.duplicate.shift(1)
after_repeat2 = after_repeat | qc.qa_events.duplicate.shift(2)
before_repeat = qc.qa_events.duplicate.shift(-1)
before_repeat2 = before_repeat | qc.qa_events.duplicate.shift(-2)
duplicates = qc.qa_events.duplicate & (after_repeat2 | before_repeat2 )
dups = pd.DataFrame({'qaDuplicates':qc.qa_events['duplicate'],'after1':after_repeat,
'after2':after_repeat2, 'before1':before_repeat,
'before2':before_repeat2, 'combined':duplicates,} )
dups[strt:end]
[167]:
| qaDuplicates | after1 | after2 | before1 | before2 | combined | |
|---|---|---|---|---|---|---|
| Date | ||||||
| 2022-06-13 20:00:00 | False | False | False | False | False | False |
| 2022-06-13 20:05:00 | False | False | False | False | False | False |
| 2022-06-13 20:10:00 | False | False | False | False | False | False |
| 2022-06-13 20:15:00 | False | False | False | False | False | False |
| 2022-06-13 20:20:00 | False | False | False | False | False | False |
| 2022-06-13 20:25:00 | False | False | False | False | True | False |
| 2022-06-13 20:30:00 | False | False | False | True | True | False |
| 2022-06-13 20:35:00 | True | False | False | True | True | True |
| 2022-06-13 20:40:00 | True | True | True | True | True | True |
| 2022-06-13 20:45:00 | True | True | True | False | False | True |
| 2022-06-13 20:50:00 | False | True | True | False | True | False |
| 2022-06-13 20:55:00 | False | False | True | True | True | False |
| 2022-06-13 21:00:00 | True | False | False | False | False | False |
That’s working better.
[169]:
pd.set_option('display.min_rows', 15)
qc.df_orig[duplicates]
[169]:
| INST_SA | INST_Flag_SA | TOT_SA | TOT_Flag_SA | ACC_SA | ACC_Flag_SA | INST_SH | INST_Flag_SH | TOT_SH | TOT_Flag_SH | ACC_SH | ACC_Flag_SH | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | ||||||||||||
| 2017-10-01 08:15:00 | 245.100006 | <NA> | 0.1 | <NA> | 5.8 | <NA> | 265.5 | <NA> | 0.0 | <NA> | 6.1 | <NA> |
| 2017-10-01 08:20:00 | 245.199997 | <NA> | 0.1 | <NA> | 5.9 | <NA> | 265.799988 | <NA> | 0.3 | <NA> | 6.4 | <NA> |
| 2017-10-13 03:05:00 | 324.600006 | T | 0.1 | <NA> | 85.300003 | <NA> | 343.700012 | <NA> | 0.0 | <NA> | 84.300003 | <NA> |
| 2017-10-13 03:10:00 | 324.700012 | T | 0.1 | <NA> | 85.400002 | <NA> | 343.700012 | <NA> | 0.0 | <NA> | 84.300003 | <NA> |
| 2017-10-13 03:15:00 | 324.799988 | T | 0.1 | <NA> | 85.5 | <NA> | 343.799988 | <NA> | 0.1 | <NA> | 84.400002 | <NA> |
| 2017-10-13 06:45:00 | 328.700012 | T | 0.1 | <NA> | 89.400002 | <NA> | 347.200012 | <NA> | 0.3 | <NA> | 87.800003 | <NA> |
| 2017-10-13 06:50:00 | 328.799988 | T | 0.1 | <NA> | 89.5 | <NA> | 347.200012 | <NA> | 0.0 | <NA> | 87.800003 | <NA> |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2022-06-12 20:10:00 | 250.899994 | <NA> | 0.1 | <NA> | 2165.090088 | <NA> | 270.700012 | <NA> | 0.0 | <NA> | 2101.399902 | <NA> |
| 2022-06-12 20:15:00 | 251.0 | <NA> | 0.1 | <NA> | 2165.189941 | <NA> | 270.299988 | <NA> | 0.0 | <NA> | 2101.399902 | <NA> |
| 2022-06-13 20:35:00 | 263.200012 | <NA> | 0.1 | <NA> | 2177.389893 | <NA> | 281.600006 | <NA> | 0.0 | <NA> | 2112.5 | <NA> |
| 2022-06-13 20:40:00 | 263.299988 | <NA> | 0.1 | <NA> | 2177.48999 | <NA> | 281.899994 | <NA> | 0.1 | <NA> | 2112.600098 | <NA> |
| 2022-06-13 20:45:00 | 263.399994 | <NA> | 0.1 | <NA> | 2177.590088 | <NA> | 281.5 | <NA> | 0.0 | <NA> | 2112.600098 | <NA> |
| 2022-06-14 09:05:00 | 266.399994 | <NA> | 0.1 | <NA> | 2180.590088 | <NA> | 283.799988 | <NA> | 0.0 | <NA> | 2114.899902 | <NA> |
| 2022-06-14 09:10:00 | 266.5 | <NA> | 0.1 | <NA> | 2180.689941 | <NA> | 283.899994 | <NA> | 0.0 | <NA> | 2114.899902 | <NA> |
1807 rows × 12 columns
Changing the condition to or let’s in sequences of just 2 in a row. That definitely isn’t what we set out to look for here. And some of them are just small little blips. Not something we’ve decided to filter, and nothing that is completely unreasonable at the measurement precision being used.
Going back to the original strategy using & but make it 3 before or 3 after fixes this case. Of course, it will continue to have the problem seen above: if we require it to be preceded or followed by 3, then a string of 4 will only flag the first and last.
And if it is changed to be proceeded & followed the beginning and end are always trimmed off. Plus the added problem of, in the case looked at above on 6/13/22, only selecting the middle timestep and discarding all the others.
Time for a new approach.
duplcates.diff() as DataFrame (xor)¶
When preforming difference between a row and it’s preceeding row, in Pandas and NumPy boolean values become ^ which is a logical XOR. That means that only one value can be True. E.g.
True ^ True = False
False ^ False = False
True ^ False = True
So that would mean that anytime the value is False, it hasn’t changed. That might work well.
Let’s try some things. For starters, take a look at our duplicate filter falgs:
[193]:
pd.set_option('display.max_rows', 25)
strt, end = pd.to_datetime('4/6/19 0845'), pd.to_datetime('4/6/19 1030')
qc.qa_events.duplicate[strt:end]
[193]:
Date
2019-04-06 08:45:00 False
2019-04-06 08:50:00 False
2019-04-06 08:55:00 False
2019-04-06 09:00:00 False
2019-04-06 09:05:00 False
2019-04-06 09:10:00 True
2019-04-06 09:15:00 True
2019-04-06 09:20:00 False
2019-04-06 09:25:00 False
2019-04-06 09:30:00 False
2019-04-06 09:35:00 False
2019-04-06 09:40:00 False
2019-04-06 09:45:00 False
2019-04-06 09:50:00 False
2019-04-06 09:55:00 False
2019-04-06 10:00:00 True
2019-04-06 10:05:00 True
2019-04-06 10:10:00 True
2019-04-06 10:15:00 True
2019-04-06 10:20:00 True
2019-04-06 10:25:00 True
2019-04-06 10:30:00 True
Name: duplicate, dtype: bool[pyarrow]
[194]:
qc.qa_events.duplicate[strt:end].diff()
[194]:
Date
2019-04-06 08:45:00 <NA>
2019-04-06 08:50:00 False
2019-04-06 08:55:00 False
2019-04-06 09:00:00 False
2019-04-06 09:05:00 False
2019-04-06 09:10:00 True
2019-04-06 09:15:00 False
2019-04-06 09:20:00 True
2019-04-06 09:25:00 False
2019-04-06 09:30:00 False
2019-04-06 09:35:00 False
2019-04-06 09:40:00 False
2019-04-06 09:45:00 False
2019-04-06 09:50:00 False
2019-04-06 09:55:00 False
2019-04-06 10:00:00 True
2019-04-06 10:05:00 False
2019-04-06 10:10:00 False
2019-04-06 10:15:00 False
2019-04-06 10:20:00 False
2019-04-06 10:25:00 False
2019-04-06 10:30:00 False
Name: duplicate, dtype: bool[pyarrow]
So a False first, second, and third differential will mean I’m requiring 4 in a row as a minimum…? For example, 0910 and 0915 will both become True. And then I can just use a ~ to transform everything to True based instead of True duplicate and False consecutive. But this is looking at following values only, so I’ll need to also look at preceeding values.
[209]:
next_the_same = qc.qa_events.duplicate[strt:end].diff()
two_the_same = next_the_same.diff()
three_the_same = two_the_same.diff()
three_the_same.diff()
[209]:
Date
2019-04-06 08:45:00 <NA>
2019-04-06 08:50:00 <NA>
2019-04-06 08:55:00 <NA>
2019-04-06 09:00:00 <NA>
2019-04-06 09:05:00 False
2019-04-06 09:10:00 True
2019-04-06 09:15:00 True
2019-04-06 09:20:00 False
2019-04-06 09:25:00 False
2019-04-06 09:30:00 True
2019-04-06 09:35:00 True
2019-04-06 09:40:00 False
2019-04-06 09:45:00 False
2019-04-06 09:50:00 False
2019-04-06 09:55:00 False
2019-04-06 10:00:00 True
2019-04-06 10:05:00 True
2019-04-06 10:10:00 True
2019-04-06 10:15:00 True
2019-04-06 10:20:00 False
2019-04-06 10:25:00 False
2019-04-06 10:30:00 False
Name: duplicate, dtype: bool[pyarrow]
[199]:
qc.qa_events.duplicate[strt:end].diff(periods=-1)
[199]:
Date
2019-04-06 08:45:00 False
2019-04-06 08:50:00 False
2019-04-06 08:55:00 False
2019-04-06 09:00:00 False
2019-04-06 09:05:00 True
2019-04-06 09:10:00 False
2019-04-06 09:15:00 True
2019-04-06 09:20:00 False
2019-04-06 09:25:00 False
2019-04-06 09:30:00 False
2019-04-06 09:35:00 False
2019-04-06 09:40:00 False
2019-04-06 09:45:00 False
2019-04-06 09:50:00 False
2019-04-06 09:55:00 True
2019-04-06 10:00:00 False
2019-04-06 10:05:00 False
2019-04-06 10:10:00 False
2019-04-06 10:15:00 False
2019-04-06 10:20:00 False
2019-04-06 10:25:00 False
2019-04-06 10:30:00 <NA>
Name: duplicate, dtype: bool[pyarrow]
[215]:
req_consecutive = 3
cnt = 0
next_the_same = last_the_same = qc.qa_events.duplicate[strt:end]
while cnt < req_consecutive:
next_the_same = next_the_same.diff()
last_the_same = last_the_same.diff(periods=-1)
cnt += 1
print(~next_the_same)
print(~last_the_same)
~next_the_same|~last_the_same
Date
2019-04-06 08:45:00 <NA>
2019-04-06 08:50:00 <NA>
2019-04-06 08:55:00 <NA>
2019-04-06 09:00:00 True
2019-04-06 09:05:00 True
2019-04-06 09:10:00 False
2019-04-06 09:15:00 True
2019-04-06 09:20:00 True
2019-04-06 09:25:00 True
2019-04-06 09:30:00 False
2019-04-06 09:35:00 True
2019-04-06 09:40:00 True
2019-04-06 09:45:00 True
2019-04-06 09:50:00 True
2019-04-06 09:55:00 True
2019-04-06 10:00:00 False
2019-04-06 10:05:00 True
2019-04-06 10:10:00 False
2019-04-06 10:15:00 True
2019-04-06 10:20:00 True
2019-04-06 10:25:00 True
2019-04-06 10:30:00 True
Name: duplicate, dtype: bool[pyarrow]
Date
2019-04-06 08:45:00 True
2019-04-06 08:50:00 True
2019-04-06 08:55:00 False
2019-04-06 09:00:00 True
2019-04-06 09:05:00 True
2019-04-06 09:10:00 True
2019-04-06 09:15:00 False
2019-04-06 09:20:00 True
2019-04-06 09:25:00 True
2019-04-06 09:30:00 True
2019-04-06 09:35:00 True
2019-04-06 09:40:00 True
2019-04-06 09:45:00 False
2019-04-06 09:50:00 True
2019-04-06 09:55:00 False
2019-04-06 10:00:00 True
2019-04-06 10:05:00 True
2019-04-06 10:10:00 True
2019-04-06 10:15:00 True
2019-04-06 10:20:00 <NA>
2019-04-06 10:25:00 <NA>
2019-04-06 10:30:00 <NA>
Name: duplicate, dtype: bool[pyarrow]
[215]:
Date
2019-04-06 08:45:00 True
2019-04-06 08:50:00 True
2019-04-06 08:55:00 <NA>
2019-04-06 09:00:00 True
2019-04-06 09:05:00 True
2019-04-06 09:10:00 True
2019-04-06 09:15:00 True
2019-04-06 09:20:00 True
2019-04-06 09:25:00 True
2019-04-06 09:30:00 True
2019-04-06 09:35:00 True
2019-04-06 09:40:00 True
2019-04-06 09:45:00 True
2019-04-06 09:50:00 True
2019-04-06 09:55:00 True
2019-04-06 10:00:00 True
2019-04-06 10:05:00 True
2019-04-06 10:10:00 True
2019-04-06 10:15:00 True
2019-04-06 10:20:00 True
2019-04-06 10:25:00 True
2019-04-06 10:30:00 True
Name: duplicate, dtype: bool[pyarrow]
That works with even numbers of differentials, but not with odd numbers. It fails on our little blip at 0910 and 0915. Plus, XOR is fairly confusing. Even if I’m the one to revisit this, I’ll have trouble keeping it straight, so clarity and future maintenance could easily be a problematic.
duplicates.diff() as int (1/-1)¶
If the booleans are converted to int first then True = 1 and False = 0.
True - True = 0
False - False = 0
False - True = -1
True - False = 1
So, let’s look at how the differential’s progress in both directions.
[228]:
duplicates = qc.qa_events.duplicate[strt:end].astype(int)
dups = pd.DataFrame({'qaDuplicates':qc.qa_events.loc[strt:end, 'duplicate'],'duplicates_AsInt':duplicates,
'diff_1':duplicates.diff(), 'diff_2':duplicates.diff().diff(),
'diff_3':duplicates.diff().diff().diff(),
'diff_4':duplicates.diff().diff().diff().diff(),
'-diff_1':duplicates.diff(periods=-1), '-diff_2':duplicates.diff(periods=-1).diff(periods=-1),
'-diff_3':duplicates.diff(periods=-1).diff(periods=-1).diff(periods=-1),
'-diff_4':duplicates.diff(periods=-1).diff(periods=-1).diff(periods=-1).diff(periods=-1),} )
dups
[228]:
| qaDuplicates | duplicates_AsInt | diff_1 | diff_2 | diff_3 | diff_4 | -diff_1 | -diff_2 | -diff_3 | -diff_4 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Date | ||||||||||
| 2019-04-06 08:45:00 | False | 0 | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 |
| 2019-04-06 08:50:00 | False | 0 | 0.0 | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 1.0 |
| 2019-04-06 08:55:00 | False | 0 | 0.0 | 0.0 | NaN | NaN | 0.0 | 0.0 | -1.0 | -3.0 |
| 2019-04-06 09:00:00 | False | 0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 1.0 | 2.0 | 2.0 |
| 2019-04-06 09:05:00 | False | 0 | 0.0 | 0.0 | 0.0 | 0.0 | -1.0 | -1.0 | 0.0 | 2.0 |
| 2019-04-06 09:10:00 | True | 1 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | -1.0 | -2.0 | -3.0 |
| 2019-04-06 09:15:00 | True | 1 | 0.0 | -1.0 | -2.0 | -3.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| 2019-04-06 09:20:00 | False | 0 | -1.0 | -1.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2019-04-06 09:25:00 | False | 0 | 0.0 | 1.0 | 2.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2019-04-06 09:30:00 | False | 0 | 0.0 | 0.0 | -1.0 | -3.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2019-04-06 09:35:00 | False | 0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2019-04-06 09:40:00 | False | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 2019-04-06 09:45:00 | False | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -1.0 | -3.0 |
| 2019-04-06 09:50:00 | False | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | 3.0 |
| 2019-04-06 09:55:00 | False | 0 | 0.0 | 0.0 | 0.0 | 0.0 | -1.0 | -1.0 | -1.0 | -1.0 |
| 2019-04-06 10:00:00 | True | 1 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2019-04-06 10:05:00 | True | 1 | 0.0 | -1.0 | -2.0 | -3.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2019-04-06 10:10:00 | True | 1 | 0.0 | 0.0 | 1.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2019-04-06 10:15:00 | True | 1 | 0.0 | 0.0 | 0.0 | -1.0 | 0.0 | 0.0 | 0.0 | NaN |
| 2019-04-06 10:20:00 | True | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN |
| 2019-04-06 10:25:00 | True | 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN |
| 2019-04-06 10:30:00 | True | 1 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN |
So, if the criteria is the Ndifferential == 0 with a logical OR between preceeding and following, then this is looking pretty good. Let’s grab one of our examples from above to make sure it filters more than just two points.
[379]:
strt, end = pd.to_datetime('6/13/22 2000'), pd.to_datetime('6/13/22 2125')
duplicates = qc.qa_events.duplicate[strt:end].astype(int)
dups = pd.DataFrame({'qaDuplicates':qc.qa_events.loc[strt:end, 'duplicate'],'duplicates_AsInt':duplicates,
'diff_1':duplicates.diff(), 'diff_2':duplicates.diff().diff(),
'diff_3':duplicates.diff().diff().diff(),
'diff_4':duplicates.diff().diff().diff().diff(),
'-diff_1':duplicates.diff(periods=-1), '-diff_2':duplicates.diff(periods=-1).diff(periods=-1),
'-diff_3':duplicates.diff(periods=-1).diff(periods=-1).diff(periods=-1),
'-diff_4':duplicates.diff(periods=-1).diff(periods=-1).diff(periods=-1).diff(periods=-1),} )
dups
[379]:
| qaDuplicates | duplicates_AsInt | diff_1 | diff_2 | diff_3 | diff_4 | -diff_1 | -diff_2 | -diff_3 | -diff_4 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Date | ||||||||||
| 2022-06-13 20:00:00 | False | 0 | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 |
| 2022-06-13 20:05:00 | False | 0 | 0.0 | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 |
| 2022-06-13 20:10:00 | False | 0 | 0.0 | 0.0 | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 |
| 2022-06-13 20:15:00 | False | 0 | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 0.0 | 1.0 |
| 2022-06-13 20:20:00 | False | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | -1.0 | -3.0 |
| 2022-06-13 20:25:00 | False | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | 3.0 |
| 2022-06-13 20:30:00 | False | 0 | 0.0 | 0.0 | 0.0 | 0.0 | -1.0 | -1.0 | -1.0 | -2.0 |
| 2022-06-13 20:35:00 | True | 1 | 1.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 1.0 | 3.0 |
| 2022-06-13 20:40:00 | True | 1 | 0.0 | -1.0 | -2.0 | -3.0 | 0.0 | -1.0 | -2.0 | -2.0 |
| 2022-06-13 20:45:00 | True | 1 | 0.0 | 0.0 | 1.0 | 3.0 | 1.0 | 1.0 | 0.0 | -3.0 |
| 2022-06-13 20:50:00 | False | 0 | -1.0 | -1.0 | -1.0 | -2.0 | 0.0 | 1.0 | 3.0 | 6.0 |
| 2022-06-13 20:55:00 | False | 0 | 0.0 | 1.0 | 2.0 | 3.0 | -1.0 | -2.0 | -3.0 | -4.0 |
| 2022-06-13 21:00:00 | True | 1 | 1.0 | 1.0 | 0.0 | -2.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| 2022-06-13 21:05:00 | False | 0 | -1.0 | -2.0 | -3.0 | -3.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2022-06-13 21:10:00 | False | 0 | 0.0 | 1.0 | 3.0 | 6.0 | 0.0 | 0.0 | 0.0 | NaN |
| 2022-06-13 21:15:00 | False | 0 | 0.0 | 0.0 | -1.0 | -4.0 | 0.0 | 0.0 | NaN | NaN |
| 2022-06-13 21:20:00 | False | 0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | NaN | NaN | NaN |
| 2022-06-13 21:25:00 | False | 0 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | NaN | NaN | NaN |
I think that’s working as intended, but has the same problem as the methods that used .shift().
first differential would allow the 3 in a row.
second differential would allow the first and last value, but not the middle.
third differential blocks all. Weird exception here because there is a random detached True value 3 time steps away.
Answer from web¶
I found an answer on stack overflow. I found that it does seem to work, but not with pyarrow booleans. I change the type, then we’re all good. I did hope to avoid any sort of complicated running averages, but this seems vairly fast.
I walk through the method below.
[380]:
test = qc.qa_events.duplicate[strt:end].astype('boolean')
pd.DataFrame(data={'Duplicates':test, 'Duplicates.cumsum()': test.cumsum(),
'Duplicates.diff()': test.diff(), 'duplicates.diff().cumsum()':test.diff().cumsum()})
[380]:
| Duplicates | Duplicates.cumsum() | Duplicates.diff() | duplicates.diff().cumsum() | |
|---|---|---|---|---|
| Date | ||||
| 2022-06-13 20:00:00 | False | 0 | <NA> | <NA> |
| 2022-06-13 20:05:00 | False | 0 | False | 0 |
| 2022-06-13 20:10:00 | False | 0 | False | 0 |
| 2022-06-13 20:15:00 | False | 0 | False | 0 |
| 2022-06-13 20:20:00 | False | 0 | False | 0 |
| 2022-06-13 20:25:00 | False | 0 | False | 0 |
| 2022-06-13 20:30:00 | False | 0 | False | 0 |
| 2022-06-13 20:35:00 | True | 1 | True | 1 |
| 2022-06-13 20:40:00 | True | 2 | False | 1 |
| 2022-06-13 20:45:00 | True | 3 | False | 1 |
| 2022-06-13 20:50:00 | False | 3 | True | 2 |
| 2022-06-13 20:55:00 | False | 3 | False | 2 |
| 2022-06-13 21:00:00 | True | 4 | True | 3 |
| 2022-06-13 21:05:00 | False | 4 | True | 4 |
| 2022-06-13 21:10:00 | False | 4 | False | 4 |
| 2022-06-13 21:15:00 | False | 4 | False | 4 |
| 2022-06-13 21:20:00 | False | 4 | False | 4 |
| 2022-06-13 21:25:00 | False | 4 | False | 4 |
In the pandas/Numpy boolean type, cumsum can operate on the column, but increases whenever there is a True value and is constant when the value is False. By taking the diff, which is True whenver consecutive rows are different, cumsum simply numbers any group of values, incrementing to the next group number when the bolean values changed.
[381]:
grp = test.diff().cumsum()
test.groupby(grp).transform('size')
[381]:
Date
2022-06-13 20:00:00 <NA>
2022-06-13 20:05:00 6
2022-06-13 20:10:00 6
2022-06-13 20:15:00 6
2022-06-13 20:20:00 6
2022-06-13 20:25:00 6
2022-06-13 20:30:00 6
2022-06-13 20:35:00 3
2022-06-13 20:40:00 3
2022-06-13 20:45:00 3
2022-06-13 20:50:00 2
2022-06-13 20:55:00 2
2022-06-13 21:00:00 1
2022-06-13 21:05:00 5
2022-06-13 21:10:00 5
2022-06-13 21:15:00 5
2022-06-13 21:20:00 5
2022-06-13 21:25:00 5
Name: duplicate, dtype: Int64
When grouped by the cumsum values, the size of each group can then be measured and returned. This way we now have a length for each group. The example then goes further to make all series of False values negative and True values positive. But in tests, it is faster to just compare to our duplicate column using and &.
A quick test with a different test case.
[399]:
strt, end = pd.to_datetime('1/20/18 0415'), pd.to_datetime('1/20/18 0555')
test = qc.qa_events.duplicate[strt:end].astype('boolean')
grp = test.diff().cumsum()
pd.DataFrame(data={'grp_count':test.groupby(grp).transform('size'),
'duplicates':test,
'INST':qc.df_orig.loc[strt:end,'INST_SA'],
'TOT':qc.df_orig.loc[strt:end,'TOT_SA']})
[399]:
| grp_count | duplicates | INST | TOT | |
|---|---|---|---|---|
| Date | ||||
| 2018-01-20 04:15:00 | <NA> | False | 209.600006 | 0.1 |
| 2018-01-20 04:20:00 | 8 | False | 209.800003 | 0.2 |
| 2018-01-20 04:25:00 | 8 | False | 209.800003 | 0.0 |
| 2018-01-20 04:30:00 | 8 | False | 209.800003 | 0.0 |
| 2018-01-20 04:35:00 | 8 | False | 209.800003 | 0.0 |
| 2018-01-20 04:40:00 | 8 | False | 209.800003 | 0.0 |
| 2018-01-20 04:45:00 | 8 | False | 209.800003 | 0.0 |
| 2018-01-20 04:50:00 | 8 | False | 209.800003 | 0.0 |
| 2018-01-20 04:55:00 | 8 | False | 210.0 | 0.2 |
| 2018-01-20 05:00:00 | 5 | True | 209.800003 | 0.2 |
| 2018-01-20 05:05:00 | 5 | True | 209.800003 | 0.2 |
| 2018-01-20 05:10:00 | 5 | True | 209.800003 | 0.2 |
| 2018-01-20 05:15:00 | 5 | True | 209.800003 | 0.2 |
| 2018-01-20 05:20:00 | 5 | True | 209.800003 | 0.2 |
| 2018-01-20 05:25:00 | 7 | False | 210.100006 | 0.0 |
| 2018-01-20 05:30:00 | 7 | False | 210.100006 | 0.0 |
| 2018-01-20 05:35:00 | 7 | False | 210.300003 | 0.0 |
| 2018-01-20 05:40:00 | 7 | False | 210.300003 | 0.0 |
| 2018-01-20 05:45:00 | 7 | False | 210.300003 | 0.0 |
| 2018-01-20 05:50:00 | 7 | False | 210.300003 | 0.0 |
| 2018-01-20 05:55:00 | 7 | False | 210.300003 | 0.0 |
Weird example, but hopefully we can catch that random 210 tanke value by broadening the flat tank precision.
[285]:
test = qc.qa_events.duplicate
[287]:
%%timeit
g = test.astype('boolean').diff().cumsum()
q = test.groupby(g).transform('size') * np.where(test, 1, -1)
q >= 3
77.6 ms ± 3.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
[289]:
%%timeit
g = test.astype('boolean').diff().cumsum()
q = test.groupby(g).transform('size')
test & (q >= 3)
64.8 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
TOTALLY FLAT TANK (NO BOUNCE EVER!)
[293]:
pd.set_option('display.min_rows', 4)
# reset
qc.qa_events.duplicate = False
# reflag
qc.flag_repeating_val_precip(ppt_col='TOT_SA', tank_col='INST_SA', flat_ppt_nprecision=0, flat_tank_nprecision=0,
min_ppt_nprecision=0)
qc.df_orig[duplicates]
[293]:
| INST_SA | INST_Flag_SA | TOT_SA | TOT_Flag_SA | ACC_SA | ACC_Flag_SA | INST_SH | INST_Flag_SH | TOT_SH | TOT_Flag_SH | ACC_SH | ACC_Flag_SH | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | ||||||||||||
| 2017-10-19 14:45:00 | 346.5 | MM | 0.3 | M | 107.5 | M | 352.299988 | M | 0.2 | M | 103.699997 | M |
| 2017-10-19 14:50:00 | 346.5 | MM | 0.3 | M | 107.800003 | M | 352.299988 | M | 0.2 | M | 103.900002 | M |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2022-03-13 02:50:00 | 39.169998 | MM | 0.26 | M | 1271.699951 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:55:00 | 39.169998 | MM | 0.26 | M | 1271.959961 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
1321 rows × 12 columns
Allow Normal Oscillation
[402]:
pd.set_option('display.max_rows', 40)
# reset
qc.qa_events.duplicate = False
qc.qa_flags.Set0 = False
# reflag
qc.flag_repeating_val_precip(ppt_col='TOT_SA', tank_col='INST_SA', flat_ppt_nprecision=0, flat_tank_nprecision=1,
min_ppt_nprecision=0)
duplicate_val = qc.qa_events['duplicate'].astype('boolean')
number_continuous_grps = duplicate_val.diff().cumsum()
how_many_in_group = duplicate_val.groupby(number_continuous_grps).transform('size')
duplicates = duplicate_val & (how_many_in_group >= 4)
qc.df_orig[duplicates]
[402]:
| INST_SA | INST_Flag_SA | TOT_SA | TOT_Flag_SA | ACC_SA | ACC_Flag_SA | INST_SH | INST_Flag_SH | TOT_SH | TOT_Flag_SH | ACC_SH | ACC_Flag_SH | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | ||||||||||||
| 2017-10-19 14:45:00 | 346.5 | MM | 0.3 | M | 107.5 | M | 352.299988 | M | 0.2 | M | 103.699997 | M |
| 2017-10-19 14:50:00 | 346.5 | MM | 0.3 | M | 107.800003 | M | 352.299988 | M | 0.2 | M | 103.900002 | M |
| 2017-10-19 14:55:00 | 346.5 | MM | 0.3 | M | 108.099998 | M | 352.299988 | M | 0.2 | M | 104.099998 | M |
| 2017-10-19 15:00:00 | 346.5 | MM | 0.3 | M | 108.400002 | M | 352.299988 | M | 0.2 | M | 104.300003 | M |
| 2017-10-19 15:05:00 | 346.5 | MM | 0.3 | M | 108.699997 | M | 352.299988 | M | 0.2 | M | 104.5 | M |
| 2017-10-19 15:10:00 | 346.5 | MM | 0.3 | M | 109.0 | M | 352.299988 | M | 0.2 | M | 104.699997 | M |
| 2017-10-19 15:15:00 | 346.5 | MM | 0.3 | M | 109.300003 | M | 352.299988 | M | 0.2 | M | 104.900002 | M |
| 2017-10-19 15:20:00 | 346.5 | MM | 0.3 | M | 109.599998 | M | 352.299988 | M | 0.2 | M | 105.099998 | M |
| 2017-10-19 15:25:00 | 346.5 | MM | 0.3 | M | 109.900002 | M | 352.299988 | M | 0.2 | M | 105.300003 | M |
| 2017-10-19 15:30:00 | 346.5 | MM | 0.3 | M | 110.199997 | M | 352.299988 | M | 0.2 | M | 105.5 | M |
| 2018-01-20 05:00:00 | 209.800003 | <NA> | 0.2 | <NA> | 969.440002 | <NA> | 210.199997 | <NA> | 0.0 | <NA> | 905.469971 | <NA> |
| 2018-01-20 05:05:00 | 209.800003 | <NA> | 0.2 | <NA> | 969.640015 | <NA> | 210.399994 | <NA> | 0.0 | <NA> | 905.469971 | <NA> |
| 2018-01-20 05:10:00 | 209.800003 | <NA> | 0.2 | <NA> | 969.840027 | <NA> | 210.199997 | <NA> | 0.0 | <NA> | 905.469971 | <NA> |
| 2018-01-20 05:15:00 | 209.800003 | <NA> | 0.2 | <NA> | 970.039978 | <NA> | 210.399994 | <NA> | 0.0 | <NA> | 905.469971 | <NA> |
| 2018-01-20 05:20:00 | 209.800003 | <NA> | 0.2 | <NA> | 970.23999 | <NA> | 210.600006 | <NA> | 0.2 | <NA> | 905.669983 | <NA> |
| 2018-12-17 20:15:00 | 112.699997 | <NA> | 0.2 | <NA> | 458.609985 | <NA> | 122.400002 | <NA> | 0.0 | <NA> | 439.320007 | <NA> |
| 2018-12-17 20:20:00 | 112.900002 | <NA> | 0.2 | <NA> | 458.809998 | <NA> | 122.699997 | <NA> | 0.0 | <NA> | 439.320007 | <NA> |
| 2018-12-17 20:25:00 | 113.099998 | <NA> | 0.2 | <NA> | 459.01001 | <NA> | 123.0 | <NA> | 0.0 | <NA> | 439.320007 | <NA> |
| 2018-12-17 20:30:00 | 113.300003 | <NA> | 0.2 | <NA> | 459.209991 | <NA> | 123.5 | <NA> | 0.0 | <NA> | 439.320007 | <NA> |
| 2019-04-06 10:00:00 | 128.899994 | M | 0.2 | M | 1598.5 | M | 174.300003 | <NA> | 0.0 | <NA> | 1369.140015 | <NA> |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2020-06-04 17:10:00 | 19.610001 | <NA> | 0.05 | <NA> | 1491.689941 | <NA> | 38.939999 | <NA> | 0.0 | <NA> | 1380.25 | <NA> |
| 2020-06-04 17:15:00 | 19.610001 | <NA> | 0.05 | <NA> | 1491.73999 | <NA> | 38.779999 | <NA> | 0.0 | <NA> | 1380.25 | <NA> |
| 2020-06-04 17:20:00 | 19.74 | <NA> | 0.05 | <NA> | 1491.790039 | <NA> | 38.779999 | <NA> | 0.0 | <NA> | 1380.25 | <NA> |
| 2020-06-04 17:25:00 | 19.74 | <NA> | 0.05 | <NA> | 1491.839966 | <NA> | 38.779999 | <NA> | 0.0 | <NA> | 1380.25 | <NA> |
| 2021-10-26 22:55:00 | 126.0 | <NA> | 0.1 | <NA> | 120.5 | <NA> | 165.199997 | <NA> | 0.0 | <NA> | 129.830002 | <NA> |
| 2021-10-26 23:00:00 | 126.099998 | <NA> | 0.1 | <NA> | 120.599998 | <NA> | 165.399994 | <NA> | 0.2 | <NA> | 130.029999 | <NA> |
| 2021-10-26 23:05:00 | 126.199997 | <NA> | 0.1 | <NA> | 120.699997 | <NA> | 165.600006 | <NA> | 0.2 | <NA> | 130.229996 | <NA> |
| 2021-10-26 23:10:00 | 126.300003 | <NA> | 0.1 | <NA> | 120.800003 | <NA> | 165.5 | <NA> | 0.0 | <NA> | 130.229996 | <NA> |
| 2022-03-13 02:00:00 | 39.169998 | MM | 0.26 | M | 1269.099976 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:05:00 | 39.169998 | MM | 0.26 | M | 1269.359985 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:10:00 | 39.169998 | MM | 0.26 | M | 1269.619995 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:15:00 | 39.169998 | MM | 0.26 | M | 1269.880005 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:20:00 | 39.169998 | MM | 0.26 | M | 1270.140015 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:25:00 | 39.169998 | MM | 0.26 | M | 1270.400024 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:30:00 | 39.169998 | MM | 0.26 | M | 1270.660034 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:35:00 | 39.169998 | MM | 0.26 | M | 1270.920044 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:40:00 | 39.169998 | MM | 0.26 | M | 1271.180054 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:45:00 | 39.169998 | MM | 0.26 | M | 1271.439941 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:50:00 | 39.169998 | MM | 0.26 | M | 1271.699951 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:55:00 | 39.169998 | MM | 0.26 | M | 1271.959961 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
1370 rows × 12 columns
[363]:
times = qc.df_orig[duplicates].index.astype('datetime64[ns]')
np.unique(times.date)
[363]:
array([datetime.date(2017, 10, 19), datetime.date(2018, 1, 20),
datetime.date(2018, 12, 17), datetime.date(2019, 4, 6),
datetime.date(2019, 4, 7), datetime.date(2019, 4, 8),
datetime.date(2019, 4, 9), datetime.date(2019, 11, 15),
datetime.date(2020, 2, 5), datetime.date(2020, 6, 3),
datetime.date(2020, 6, 4), datetime.date(2021, 10, 26),
datetime.date(2022, 3, 13)], dtype=object)
Now, let’s look at the problem cases
[309]:
plt.rcParams['figure.max_open_warning']=100
[329]:
duplicates.name = 'duplicate'
duplicates = pd.DataFrame(duplicates)
qa_events = qc.qa_events
qa_events['duplicates'] = duplicates
flag.apply_FSDB_flags(qa_events, qc.qa_flags)
ax = flag.plot_flagged_day(pd.to_datetime('6/3/20 1200'), 'CEN_01', tdelta='2D', auto_qa_event=duplicates, paired_tank=pair_tank)
plt.grid(True)
[330]:
ax = flag.plot_flagged_day(pd.to_datetime('6/4/20 0800'), 'CEN_01', tdelta='4h', auto_qa_event=duplicates, paired_tank=pair_tank)
plt.grid(True)
[331]:
ax = flag.plot_flagged_day(pd.to_datetime('12/17/18 1900'), 'CEN_01', tdelta='3h', auto_qa_event=duplicates, paired_tank=pair_tank)
plt.grid(True)
The red Set0 flags are set by the QaRules.flag_repeating_val_precip without adding a requirement for consecutive values. The green X’s are set after adding the consecutive value requirement. We can see a lot of improvement as a result. But there are still a couple values flagged here that probably shouldn’t be. Since this is a small case, it might be a good candidtate for a manual unflag. Or a good reason to adjust the threshold to 6 in a row.
[332]:
ax = flag.plot_flagged_day(pd.to_datetime('10/26/21 2130'), 'CEN_01', tdelta='3h', auto_qa_event=duplicates, paired_tank=pair_tank)
plt.grid(True)
This situation seems similar to 12/17/18. A lot of improvement. May be worth using 6 consecutive values, a full 30 min, as the requirement.
[335]:
ax = flag.plot_flagged_day(pd.to_datetime('1/20/18 0400'), 'CEN_01', tdelta='3h', auto_qa_event=duplicates, paired_tank=pair_tank)
plt.grid(True)
[341]:
strt = pd.to_datetime('1/20/18 0400')
end = pd.to_datetime('1/20/18 0400') + pd.to_timedelta('3h')
print(f"total provisional precip 1/20/18 0400- 0700: {qc.df_orig.loc[strt:end, 'TOT_SA'].sum()}")
print(f"total QC'd precip 1/20/18 0400- 0700: {qc.df_orig.loc[strt:end, 'TOT_SA'][duplicates['duplicate']].sum()}")
print(f"tank change over period: {qc.df_orig.loc[end,'INST_SA']-qc.df_orig.loc[strt, 'INST_SA']}")
total provisional precip 1/20/18 0400- 0700: 2.2000000402331352
total QC'd precip 1/20/18 0400- 0700: 1.0000000149011612
tank change over period: 1.399993896484375
I’m not sure why that is coming in a point late, but the flag does seem justified here. There is 1.4 mm of tank change over 3 hours, but we see 2.2 mm of precip accumulating. This doesn’t grab all of that, but it does reduce the total for the period to a more reasonable number.
Based on these cases, 6 consecutive values will be used.
[369]:
ax = flag.plot_flagged_day(pd.to_datetime('11/15/19 0900'), 'CEN_01', tdelta='9h',
auto_qa_event=duplicates, paired_tank=pair_tank)
plt.grid(True)
[373]:
ax = flag.plot_flagged_day(pd.to_datetime('2/5/20 1200'), 'CEN_01', tdelta='4h',
auto_qa_event=duplicates, paired_tank=pair_tank)
plt.grid(True)
11/15/19 looks like the filter is working, but 2/5/20 looks like another reason to switch the consecutive requirement to 6.
[405]:
pd.set_option('display.min_row', 10)
duplicates = duplicate_val & (how_many_in_group >= 5)
times = qc.df_orig[duplicates].index.astype('datetime64[ns]')
np.unique(times.date)
[405]:
array([datetime.date(2017, 10, 19), datetime.date(2018, 1, 20),
datetime.date(2019, 4, 6), datetime.date(2019, 4, 7),
datetime.date(2019, 4, 8), datetime.date(2019, 4, 9),
datetime.date(2019, 11, 15), datetime.date(2020, 6, 3),
datetime.date(2020, 6, 4), datetime.date(2022, 3, 13)],
dtype=object)
[408]:
qc.df_orig.loc[duplicates, 'TOT_SA'].sum()
[408]:
226.55000304803252
That removed all of the dates I wanted to remove, but kept all of the ones I wanted to keep. And it is a really solid amount of precip. Further, it should continue to work even if GCE starts sending real tank values for the missing periods.
Comparing tank change to precip¶
This section has some earlier attempts to see if some simple and obvious methods work. All of them were hard to make work right.
However, there are some potential benefits to looking at it this way:
It could keep working even if GCE stops sending large chuncks of missing tank values
There’s the potential to catch places where
simplepre.mis preforming poorlyThis may be useful in filtering diurnal fluctuations found at other sites (UPLO and VARA)
Rolling sum of tank¶
1 Day¶
[49]:
pd.options.display.min_rows = 50
dly_tank_chg = cnsa['INST'].diff().rolling('1D').sum()
dly_precip = cnsa['TOT'].rolling('1D').sum()
not_drain = dly_tank_chg >= 0
raining = cnsa['TOT']>0
mismatch = (dly_tank_chg < (dly_precip-(3*sa_precision))) & not_drain & raining
cnsa[mismatch]
[49]:
| INST | INST_Flag | TOT | TOT_Flag | ACC | ACC_Flag | |
|---|---|---|---|---|---|---|
| Date | ||||||
| 2017-10-19 14:55:00 | 346.5 | MM | 0.3 | M | 108.099998 | M |
| 2017-10-19 15:00:00 | 346.5 | MM | 0.3 | M | 108.400002 | M |
| 2017-10-19 15:05:00 | 346.5 | MM | 0.3 | M | 108.699997 | M |
| 2017-10-19 15:10:00 | 346.5 | MM | 0.3 | M | 109.0 | M |
| 2017-10-19 15:15:00 | 346.5 | MM | 0.3 | M | 109.300003 | M |
| 2017-10-19 15:20:00 | 346.5 | MM | 0.3 | M | 109.599998 | M |
| 2017-10-19 15:25:00 | 346.5 | MM | 0.3 | M | 109.900002 | M |
| 2017-10-19 15:30:00 | 346.5 | MM | 0.3 | M | 110.199997 | M |
| 2017-11-20 20:35:00 | 243.0 | <NA> | 0.2 | <NA> | 565.340027 | <NA> |
| 2018-01-20 05:15:00 | 209.800003 | <NA> | 0.2 | <NA> | 970.039978 | <NA> |
| 2018-01-20 05:20:00 | 209.800003 | <NA> | 0.2 | <NA> | 970.23999 | <NA> |
| 2018-01-20 06:00:00 | 210.300003 | <NA> | 0.3 | F | 970.539978 | F |
| 2018-01-20 06:35:00 | 210.5 | <NA> | 0.2 | <NA> | 970.73999 | <NA> |
| 2018-01-20 06:45:00 | 210.699997 | <NA> | 0.2 | <NA> | 970.940002 | <NA> |
| 2018-01-20 07:10:00 | 210.800003 | <NA> | 0.1 | <NA> | 971.039978 | <NA> |
| 2018-01-20 09:55:00 | 211.0 | <NA> | 0.2 | <NA> | 971.23999 | <NA> |
| 2018-01-20 10:00:00 | 211.100006 | <NA> | 0.1 | <NA> | 971.340027 | <NA> |
| 2018-01-20 12:05:00 | 211.199997 | <NA> | 0.1 | <NA> | 971.440002 | <NA> |
| 2018-01-20 12:10:00 | 211.5 | <NA> | 0.3 | <NA> | 971.73999 | <NA> |
| 2018-01-20 12:50:00 | 211.600006 | <NA> | 0.1 | <NA> | 971.840027 | <NA> |
| 2018-01-20 13:25:00 | 211.699997 | <NA> | 0.1 | <NA> | 971.940002 | <NA> |
| 2018-01-20 13:40:00 | 211.899994 | <NA> | 0.2 | <NA> | 972.140015 | <NA> |
| 2018-01-20 16:35:00 | 212.100006 | <NA> | 0.1 | <NA> | 972.340027 | <NA> |
| 2018-01-20 16:40:00 | 212.199997 | <NA> | 0.1 | <NA> | 972.440002 | <NA> |
| 2018-01-20 18:15:00 | 212.399994 | <NA> | 0.2 | <NA> | 972.640015 | <NA> |
| ... | ... | ... | ... | ... | ... | ... |
| 2022-03-13 09:25:00 | 51.459999 | <NA> | 0.13 | <NA> | 1284.25 | <NA> |
| 2022-03-13 09:30:00 | 51.580002 | <NA> | 0.12 | <NA> | 1284.369995 | <NA> |
| 2022-03-13 09:45:00 | 52.09 | <NA> | 0.51 | <NA> | 1284.880005 | <NA> |
| 2022-03-13 09:50:00 | 52.599998 | <NA> | 0.51 | <NA> | 1285.390015 | <NA> |
| 2022-03-13 10:40:00 | 53.860001 | <NA> | 1.26 | F | 1286.650024 | F |
| 2022-03-13 11:10:00 | 53.990002 | <NA> | 0.13 | <NA> | 1286.780029 | <NA> |
| 2022-03-13 11:15:00 | 54.110001 | <NA> | 0.12 | <NA> | 1286.900024 | <NA> |
| 2022-03-13 11:25:00 | 54.240002 | <NA> | 0.13 | <NA> | 1287.030029 | <NA> |
| 2022-03-13 11:30:00 | 54.369999 | <NA> | 0.13 | <NA> | 1287.160034 | <NA> |
| 2022-03-13 11:35:00 | 54.490002 | <NA> | 0.12 | <NA> | 1287.280029 | <NA> |
| 2022-03-13 11:40:00 | 55.130001 | <NA> | 0.64 | <NA> | 1287.920044 | <NA> |
| 2022-03-13 12:30:00 | 55.25 | <NA> | 0.12 | <NA> | 1288.040039 | <NA> |
| 2022-03-13 12:45:00 | 55.509998 | <NA> | 0.26 | <NA> | 1288.300049 | <NA> |
| 2022-03-13 13:00:00 | 55.630001 | <NA> | 0.12 | <NA> | 1288.420044 | <NA> |
| 2022-03-13 13:05:00 | 56.139999 | <NA> | 0.51 | <NA> | 1288.930054 | <NA> |
| 2022-03-13 14:20:00 | 56.259998 | <NA> | 0.12 | <NA> | 1289.050049 | <NA> |
| 2022-03-13 14:30:00 | 56.389999 | <NA> | 0.13 | <NA> | 1289.180054 | <NA> |
| 2022-03-13 15:00:00 | 56.639999 | <NA> | 0.25 | <NA> | 1289.430054 | <NA> |
| 2022-03-13 15:30:00 | 56.900002 | <NA> | 0.26 | <NA> | 1289.689941 | <NA> |
| 2022-03-13 15:45:00 | 57.400002 | <NA> | 0.5 | <NA> | 1290.189941 | <NA> |
| 2022-03-13 16:00:00 | 57.529999 | <NA> | 0.13 | <NA> | 1290.319946 | <NA> |
| 2022-03-13 17:05:00 | 57.91 | <NA> | 0.38 | <NA> | 1290.699951 | <NA> |
| 2022-03-13 18:35:00 | 58.040001 | <NA> | 0.13 | <NA> | 1290.829956 | <NA> |
| 2022-03-13 19:55:00 | 58.16 | <NA> | 0.12 | <NA> | 1290.949951 | <NA> |
| 2022-03-13 22:50:00 | 58.290001 | <NA> | 0.13 | <NA> | 1291.079956 | <NA> |
1502 rows × 6 columns
[50]:
plt.figure()
dly_precip.plot(legend=True, grid=True)
dly_tank_chg.plot(legend=True, grid=True)
dly_precip[mismatch].plot(grid=True, linestyle='', marker='o', color='navy', legend=True)
plt.legend(['sum of daily precip', 'sum of daily tank.diff()', 'flag mismatch'])
[50]:
<matplotlib.legend.Legend at 0x2d94904b1c0>
[51]:
strt = pd.to_datetime('1/19/18 1200')
end = strt + pd.to_timedelta('2d')
plt.figure()
dly_precip[strt:end].plot(legend=True, grid=True)
dly_tank_chg[strt:end].plot(legend=True, grid=True)
dly_precip[strt:end][mismatch].plot(grid=True, linestyle='', marker='o', color='navy', legend=True)
ax1 = plt.gca()
cnsa.loc[strt:end, 'TOT'].plot(ax=ax1, grid=True, linestyle=':')
plt.legend(['sum of daily precip', 'sum of daily tank.diff()', 'flag mismatch', 'actual precip',])
ax2=plt.twinx(ax1)
cnsa.loc[strt:end, 'INST'].plot(ax=ax2, grid=True, linestyle=':')
plt.legend(['actual tank value'])
[51]:
<matplotlib.legend.Legend at 0x2d952d84610>
[52]:
strt = pd.to_datetime('3/13/22')
end = strt + pd.to_timedelta('2d')
plt.figure()
dly_precip[strt:end].plot(legend=True, grid=True)
dly_tank_chg[strt:end].plot(legend=True, grid=True)
dly_precip[strt:end][mismatch].plot(grid=True, linestyle='', marker='o', color='navy', legend=True)
ax1 = plt.gca()
cnsa.loc[strt:end, 'TOT'].plot(ax=ax1, grid=True, linestyle=':')
plt.legend(['sum of daily precip', 'sum of daily tank.diff()', 'flag mismatch', 'actual precip',])
ax2=plt.twinx(ax1)
cnsa.loc[strt:end, 'INST'].plot(ax=ax2, grid=True, linestyle=':')
plt.legend(['actual tank value'])
[52]:
<matplotlib.legend.Legend at 0x2d952ef3190>
[53]:
strt = pd.to_datetime('12/25/18')
end = strt + pd.to_timedelta('2D')
plt.figure()
dly_precip[strt:end].plot(legend=True, grid=True)
dly_tank_chg[strt:end].plot(legend=True, grid=True)
dly_precip[strt:end][mismatch].plot(grid=True, linestyle='', marker='o', color='navy', legend=True)
cnsa.loc[strt:end, ['INST', 'TOT']].plot(ax=plt.gca(), grid=True, linestyle=':')
plt.legend(['sum of daily precip', 'sum of daily tank.diff()', 'flag mismatch', 'actual tank value', 'actual precip'])
[53]:
<matplotlib.legend.Legend at 0x2d952e21490>
[54]:
plt.close(20)
[55]:
strt = pd.to_datetime('4/6/19')
end = strt + pd.to_timedelta('4D')
plt.figure()
dly_precip[strt:end].plot(legend=True, grid=True)
dly_tank_chg[strt:end].plot(legend=True, grid=True)
dly_precip[strt:end][mismatch].plot(grid=True, linestyle='', marker='o', color='navy', legend=True)
ax1 = plt.gca()
cnsa.loc[strt:end, 'TOT'].plot(ax=ax1, grid=True, linestyle=':')
plt.legend(['sum of daily precip', 'sum of daily tank.diff()', 'flag mismatch', 'actual precip',])
ax2=plt.twinx(ax1)
cnsa.loc[strt:end, 'INST'].plot(ax=ax2, grid=True, linestyle=':')
plt.legend(['actual tank value'])
[55]:
<matplotlib.legend.Legend at 0x2d9596bd5e0>
This isn’t catching repeat values we want. The sum pegs everything to a drain or an unclog with some huge value for 24 hours. The cumulative catch up in Dec 2019 is a great example. Then we have 2 examples where our QaRules.flag_duplicate_precip catches small repeats, less than an hour and totaling a mm or two. But with this method it doesn’t clog the first 10-30 min but goes on to flag the whole rest of the day. It is workign correctly for 4/9/2019 at least. Maybe a small window would work.
1Hr¶
[417]:
pd.options.display.min_rows = 50
hrly_tank_chg = cnsa['INST'].diff().rolling('1h').sum()
hrly_precip = cnsa['TOT'].rolling('1h').sum()
not_drain = hrly_tank_chg >= 0
raining = cnsa['TOT']>0
mismatch = (hrly_tank_chg < (hrly_precip-(2*sa_precision))) & not_drain & raining
cnsa[mismatch]
[417]:
| INST | INST_Flag | TOT | TOT_Flag | ACC | ACC_Flag | |
|---|---|---|---|---|---|---|
| Date | ||||||
| 2017-10-19 14:50:00 | 346.5 | MM | 0.3 | M | 107.800003 | M |
| 2017-10-19 14:55:00 | 346.5 | MM | 0.3 | M | 108.099998 | M |
| 2017-10-19 15:00:00 | 346.5 | MM | 0.3 | M | 108.400002 | M |
| 2017-10-19 15:05:00 | 346.5 | MM | 0.3 | M | 108.699997 | M |
| 2017-10-19 15:10:00 | 346.5 | MM | 0.3 | M | 109.0 | M |
| 2017-10-19 15:15:00 | 346.5 | MM | 0.3 | M | 109.300003 | M |
| 2017-10-19 15:20:00 | 346.5 | MM | 0.3 | M | 109.599998 | M |
| 2017-10-19 15:25:00 | 346.5 | MM | 0.3 | M | 109.900002 | M |
| 2017-10-19 15:30:00 | 346.5 | MM | 0.3 | M | 110.199997 | M |
| 2017-11-19 21:35:00 | 182.300003 | <NA> | 0.4 | <NA> | 504.640015 | <NA> |
| 2017-11-28 13:40:00 | 355.799988 | <NA> | 0.3 | <NA> | 678.140015 | <NA> |
| 2017-11-28 13:45:00 | 356.0 | <NA> | 0.2 | <NA> | 678.340027 | <NA> |
| 2018-01-09 08:00:00 | 76.18 | <NA> | 0.38 | <NA> | 835.419983 | <NA> |
| 2018-01-20 05:10:00 | 209.800003 | <NA> | 0.2 | <NA> | 969.840027 | <NA> |
| 2018-01-20 05:15:00 | 209.800003 | <NA> | 0.2 | <NA> | 970.039978 | <NA> |
| 2018-01-20 05:20:00 | 209.800003 | <NA> | 0.2 | <NA> | 970.23999 | <NA> |
| 2018-01-20 06:00:00 | 210.300003 | <NA> | 0.3 | F | 970.539978 | F |
| 2018-01-29 21:20:00 | 82.099998 | <NA> | 0.4 | <NA> | 1120.030029 | <NA> |
| 2018-06-09 17:05:00 | 94.800003 | <NA> | 0.1 | <NA> | 1769.01001 | <NA> |
| 2018-06-09 17:10:00 | 94.900002 | <NA> | 0.1 | <NA> | 1769.109985 | <NA> |
| ... | ... | ... | ... | ... | ... | ... |
| 2021-09-18 03:20:00 | 199.899994 | <NA> | 1.1 | <NA> | 1715.910034 | <NA> |
| 2021-10-05 18:05:00 | 344.0 | <NA> | 0.1 | <NA> | 3.1 | <NA> |
| 2021-10-05 18:10:00 | 344.100006 | <NA> | 0.1 | <NA> | 3.2 | <NA> |
| 2021-11-04 11:35:00 | 39.279999 | <NA> | 0.63 | <NA> | 191.460007 | <NA> |
| 2022-01-13 07:30:00 | 16.299999 | <NA> | 0.38 | <NA> | 1032.119995 | <NA> |
| 2022-01-30 19:35:00 | 54.330002 | <NA> | 0.63 | <NA> | 1070.150024 | <NA> |
| 2022-03-13 02:10:00 | 39.169998 | MM | 0.26 | M | 1269.619995 | M |
| 2022-03-13 02:15:00 | 39.169998 | MM | 0.26 | M | 1269.880005 | M |
| 2022-03-13 02:20:00 | 39.169998 | MM | 0.26 | M | 1270.140015 | M |
| 2022-03-13 02:25:00 | 39.169998 | MM | 0.26 | M | 1270.400024 | M |
| 2022-03-13 02:30:00 | 39.169998 | MM | 0.26 | M | 1270.660034 | M |
| 2022-03-13 02:35:00 | 39.169998 | MM | 0.26 | M | 1270.920044 | M |
| 2022-03-13 02:40:00 | 39.169998 | MM | 0.26 | M | 1271.180054 | M |
| 2022-03-13 02:45:00 | 39.169998 | MM | 0.26 | M | 1271.439941 | M |
| 2022-03-13 02:50:00 | 39.169998 | MM | 0.26 | M | 1271.699951 | M |
| 2022-03-13 02:55:00 | 39.169998 | MM | 0.26 | M | 1271.959961 | M |
| 2022-03-13 03:00:00 | 40.07 | <NA> | 0.9 | <NA> | 1272.859985 | <NA> |
| 2022-03-13 03:15:00 | 40.200001 | <NA> | 0.13 | <NA> | 1272.98999 | <NA> |
| 2022-03-13 11:35:00 | 54.490002 | <NA> | 0.12 | <NA> | 1287.280029 | <NA> |
| 2022-05-28 11:50:00 | 27.74 | <NA> | 0.26 | <NA> | 1941.930054 | <NA> |
1225 rows × 6 columns
[414]:
pd.options.display.min_rows = 20
flag.flags['Set0'] = mismatch
flag.event['duplicate'] = mismatch
flag.get_flagged_days()
[414]:
DatetimeIndex(['2017-10-19', '2017-11-19', '2017-11-28', '2018-01-09',
'2018-01-20', '2018-01-29', '2018-06-09', '2018-11-04',
'2018-11-21', '2018-12-16', '2018-12-25', '2019-04-01',
'2019-04-06', '2019-04-07', '2019-04-08', '2019-04-09',
'2019-06-07', '2019-06-28', '2019-09-08', '2019-11-15',
'2019-11-30', '2020-03-13', '2020-06-03', '2020-06-04',
'2020-10-10', '2020-10-23', '2020-12-06', '2021-01-08',
'2021-01-12', '2021-02-15', '2021-03-05', '2021-03-09',
'2021-03-28', '2021-04-30', '2021-09-10', '2021-09-18',
'2021-10-05', '2021-11-04', '2022-01-13', '2022-01-30',
'2022-03-13', '2022-05-28'],
dtype='datetime64[ns]', freq=None)
This is very sensitive to the precision multiplier. If hrly_tank_chg < (hrly_precip-0*sa_precision) is used, it selects 17,000 records: all sorts of diconnected values in varying increments, event when the records are consecutive. This strays from our purpose of “repeating” values, which seems to be a specific artifact.
When increased to hrly_tank_chg < (hrly_precip-3*sa_precision), only a thousand records are filtered, although, they still have a lot of the same problems. A factor of 2 seems best, but it still is blantantly filtering a number of ‘F’ flagged values, which we know to be ‘catch-up’ precip. This seems better at catching anomolies than periods of repeating artificial values.
QaRules.flag_overaccum_precip¶
This function captures all of the isntances of duplcation that QaRules.flag_duplicate_precip does, but, instead of looking for flat tank and flat precip, it looks for places where the change in tank height doesn’t match the precip:
precip = df[ppt_col]
tank_change = df[tank_col].diff()
ppt_minus_tank = abs(precip - tank_change)
raining = precip > 0
overaccum_ppt = ppt_minus_tank > overaccum_threshold * self.precision
flag = raining & overaccum_ppt
Maybe this can be tuned to capture the artificial/repeating precip. And it should be more durable to changes in GCE processing.
Let’s sratrt by looking what it captures as it is currently parameterized.
[58]:
pd.options.display.min_rows = 20
# rerun
ai = qc.flag_overaccum_precip(ppt_col='TOT_SA', tank_col='INST_SA', overaccum_threshold=5)
qc.df_orig[ai]
[58]:
| INST_SA | INST_Flag_SA | TOT_SA | TOT_Flag_SA | ACC_SA | ACC_Flag_SA | INST_SH | INST_Flag_SH | TOT_SH | TOT_Flag_SH | ACC_SH | ACC_Flag_SH | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | ||||||||||||
| 2018-11-21 17:00:00 | 217.199997 | <NA> | 5.1 | F | 205.630005 | F | 236.800003 | <NA> | 0.0 | <NA> | 192.649994 | <NA> |
| 2018-12-25 09:25:00 | 307.299988 | <NA> | 173.0 | F | 826.309998 | F | 311.399994 | <NA> | 0.0 | <NA> | 628.419983 | <NA> |
| 2019-06-28 00:20:00 | 141.300003 | <NA> | 21.200001 | F | 2247.336914 | F | 170.600006 | <NA> | 0.0 | <NA> | 1778.619995 | <NA> |
| 2019-09-08 13:20:00 | 35.189999 | <NA> | 4.48 | F | 2278.616943 | F | 59.5 | <NA> | 0.0 | <NA> | 1807.150024 | <NA> |
| 2022-09-06 12:55:00 | 8.97 | <NA> | 6.7 | J | 2228.149902 | J | 24.49 | <NA> | 24.49 | E | 2175.530029 | E |
| 2022-09-06 13:00:00 | 7.248 | <NA> | 4.978 | F | 2233.12793 | F | 24.66 | <NA> | 24.66 | F | 2200.189941 | F |
[59]:
pd.options.display.max_rows = 15
pd.options.display.min_rows = 15
# rerun
ai = qc.flag_overaccum_precip(ppt_col='TOT_SA', tank_col='INST_SA', overaccum_threshold=3)
qc.df_orig[ai]
[59]:
| INST_SA | INST_Flag_SA | TOT_SA | TOT_Flag_SA | ACC_SA | ACC_Flag_SA | INST_SH | INST_Flag_SH | TOT_SH | TOT_Flag_SH | ACC_SH | ACC_Flag_SH | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | ||||||||||||
| 2017-11-19 20:40:00 | 179.699997 | <NA> | 1.2 | F | 502.040009 | F | 179.0 | <NA> | 0.2 | <NA> | 462.779999 | <NA> |
| 2018-06-16 18:05:00 | 111.599998 | <NA> | 1.2 | F | 1785.810059 | F | 130.0 | <NA> | 0.3 | <NA> | 1697.359985 | <NA> |
| 2018-11-21 17:00:00 | 217.199997 | <NA> | 5.1 | F | 205.630005 | F | 236.800003 | <NA> | 0.0 | <NA> | 192.649994 | <NA> |
| 2018-12-25 09:25:00 | 307.299988 | <NA> | 173.0 | F | 826.309998 | F | 311.399994 | <NA> | 0.0 | <NA> | 628.419983 | <NA> |
| 2019-03-06 07:05:00 | 355.700012 | <NA> | 1.2 | F | 1415.310059 | F | 356.899994 | Q | 0.3 | <NA> | 1195.329956 | <NA> |
| 2019-06-28 00:20:00 | 141.300003 | <NA> | 21.200001 | F | 2247.336914 | F | 170.600006 | <NA> | 0.0 | <NA> | 1778.619995 | <NA> |
| 2019-08-09 06:50:00 | 15.87 | <NA> | 1.06 | F | 2254.697021 | F | 40.459999 | <NA> | 0.0 | <NA> | 1785.52002 | <NA> |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2021-10-05 17:15:00 | 342.5 | <NA> | 1.0 | F | 1.6 | F | 373.899994 | <NA> | 0.0 | <NA> | 1.8 | <NA> |
| 2021-11-04 10:40:00 | 32.950001 | <NA> | 1.14 | F | 185.130005 | F | 60.869999 | <NA> | 0.17 | <NA> | 192.520004 | <NA> |
| 2021-12-23 04:35:00 | 39.189999 | <NA> | 2.28 | F | 693.349976 | F | 67.120003 | <NA> | 0.33 | <NA> | 657.570007 | <NA> |
| 2022-09-06 12:55:00 | 8.97 | <NA> | 6.7 | J | 2228.149902 | J | 24.49 | <NA> | 24.49 | E | 2175.530029 | E |
| 2022-09-06 13:00:00 | 7.248 | <NA> | 4.978 | F | 2233.12793 | F | 24.66 | <NA> | 24.66 | F | 2200.189941 | F |
| 2022-09-17 18:10:00 | 9.42 | <NA> | 1.36 | F | 2235.300049 | F | 26.26 | <NA> | 1.59 | F | 2201.790039 | F |
| 2022-09-29 00:45:00 | 24.15 | <NA> | 2.09 | F | 2250.030029 | F | 40.59 | <NA> | 0.05 | <NA> | 2216.120117 | <NA> |
22 rows × 12 columns
This is getting big “catch up” values, almost all of them flagged ‘F’. We need to get those 0.2 and 0.05 repeating values. Much like QaRules.flag_duplicate_precip, we need to make the threshold really small to catch those, we but are already getting all sorts real precip that we don’t want to capture with this filter.
[60]:
pd.options.display.min_rows = 8
# rerun
ai = qc.flag_overaccum_precip(ppt_col='TOT_SA', tank_col='INST_SA', overaccum_threshold=0)
qc.df_orig[ai]
[60]:
| INST_SA | INST_Flag_SA | TOT_SA | TOT_Flag_SA | ACC_SA | ACC_Flag_SA | INST_SH | INST_Flag_SH | TOT_SH | TOT_Flag_SH | ACC_SH | ACC_Flag_SH | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | ||||||||||||
| 2017-10-01 00:05:00 | 239.699997 | <NA> | 0.2 | Q | 0.2 | Q | 259.799988 | <NA> | 0.2 | Q | 0.2 | Q |
| 2017-10-01 00:10:00 | 239.800003 | <NA> | 0.3 | <NA> | 0.5 | <NA> | 260.100006 | <NA> | 0.5 | <NA> | 0.7 | <NA> |
| 2017-10-01 00:15:00 | 239.899994 | <NA> | 0.1 | <NA> | 0.6 | <NA> | 260.100006 | <NA> | 0.0 | <NA> | 0.7 | <NA> |
| 2017-10-01 00:40:00 | 240.100006 | <NA> | 0.2 | <NA> | 0.8 | <NA> | 260.399994 | <NA> | 0.0 | <NA> | 1.0 | <NA> |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2022-09-30 01:05:00 | 29.6 | <NA> | 0.03 | <NA> | 2255.47998 | <NA> | 44.720001 | <NA> | 0.0 | <NA> | 2220.27002 | <NA> |
| 2022-09-30 02:30:00 | 29.610001 | <NA> | 0.01 | <NA> | 2255.48999 | <NA> | 44.650002 | <NA> | 0.0 | <NA> | 2220.300049 | <NA> |
| 2022-09-30 11:40:00 | 29.629999 | <NA> | 0.02 | <NA> | 2255.51001 | <NA> | 44.77 | <NA> | 0.0 | <NA> | 2220.300049 | <NA> |
| 2022-09-30 12:30:00 | 29.639999 | <NA> | 0.01 | <NA> | 2255.52002 | <NA> | 44.720001 | <NA> | 0.0 | <NA> | 2220.330078 | <NA> |
34838 rows × 12 columns
With 34,838 instances, including all the ‘F’-flagged values that catch the precip up to the tank accumulation during dry periods, this clearly isn’t working. This was designed to catch big differences, not little 0.2mm differences that repeat forever.
Only where precip > tank¶
Maybe if I smooth out tank values, or at least stop using the abs() of tank.diff(). With the absolute value, we are catching places where the precip increases more than the tank, but also places where the tank increases more than the precip, which is not the artificial precip problem we are after.
[61]:
pd.options.display.max_rows = 20
pd.options.display.min_rows = 20
ppt_col, tank_col, overaccum_threshold = 'TOT_SA', 'INST_SA', 0
precip = df[ppt_col]
tank_change = df[tank_col].diff()
# no abs! But!!! 0ppt - -5tankchg = 5, so those need to be tossed
tank_increase = tank_change > 0
ppt_minus_tank = precip - tank_change
raining = precip > 0
overaccum_ppt = ppt_minus_tank > overaccum_threshold * qc.precision
ai = raining & overaccum_ppt & tank_increase
qc.df_orig[ai]
[61]:
| INST_SA | INST_Flag_SA | TOT_SA | TOT_Flag_SA | ACC_SA | ACC_Flag_SA | INST_SH | INST_Flag_SH | TOT_SH | TOT_Flag_SH | ACC_SH | ACC_Flag_SH | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | ||||||||||||
| 2017-10-01 00:05:00 | 239.699997 | <NA> | 0.2 | Q | 0.2 | Q | 259.799988 | <NA> | 0.2 | Q | 0.2 | Q |
| 2017-10-01 00:10:00 | 239.800003 | <NA> | 0.3 | <NA> | 0.5 | <NA> | 260.100006 | <NA> | 0.5 | <NA> | 0.7 | <NA> |
| 2017-10-01 00:15:00 | 239.899994 | <NA> | 0.1 | <NA> | 0.6 | <NA> | 260.100006 | <NA> | 0.0 | <NA> | 0.7 | <NA> |
| 2017-10-01 00:45:00 | 240.300003 | <NA> | 0.2 | <NA> | 1.0 | <NA> | 260.799988 | <NA> | 0.4 | <NA> | 1.4 | <NA> |
| 2017-10-01 01:55:00 | 240.399994 | <NA> | 0.1 | <NA> | 1.1 | <NA> | 260.799988 | <NA> | 0.0 | <NA> | 1.6 | <NA> |
| 2017-10-01 02:25:00 | 241.300003 | <NA> | 0.2 | <NA> | 2.0 | <NA> | 261.299988 | <NA> | 0.0 | <NA> | 2.2 | <NA> |
| 2017-10-01 02:50:00 | 241.399994 | <NA> | 0.1 | <NA> | 2.1 | <NA> | 261.700012 | <NA> | 0.0 | <NA> | 2.4 | <NA> |
| 2017-10-01 03:05:00 | 241.699997 | <NA> | 0.2 | <NA> | 2.4 | <NA> | 262.0 | <NA> | 0.0 | <NA> | 2.6 | <NA> |
| 2017-10-01 03:50:00 | 242.399994 | <NA> | 0.2 | <NA> | 3.1 | <NA> | 263.0 | <NA> | 0.5 | <NA> | 3.6 | <NA> |
| 2017-10-01 04:00:00 | 242.899994 | <NA> | 0.2 | <NA> | 3.6 | <NA> | 263.200012 | <NA> | 0.2 | <NA> | 3.8 | <NA> |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2022-09-29 03:10:00 | 26.879999 | <NA> | 0.01 | <NA> | 2252.76001 | <NA> | 41.110001 | <NA> | 0.0 | <NA> | 2216.76001 | <NA> |
| 2022-09-29 05:20:00 | 26.959999 | <NA> | 0.01 | <NA> | 2252.840088 | <NA> | 42.41 | <NA> | 0.01 | <NA> | 2217.939941 | <NA> |
| 2022-09-29 15:25:00 | 28.190001 | <NA> | 0.24 | <NA> | 2254.070068 | <NA> | 44.41 | <NA> | 0.0 | <NA> | 2220.01001 | <NA> |
| 2022-09-29 15:30:00 | 28.34 | <NA> | 0.15 | <NA> | 2254.219971 | <NA> | 44.32 | <NA> | 0.0 | <NA> | 2220.01001 | <NA> |
| 2022-09-29 15:35:00 | 28.49 | <NA> | 0.15 | <NA> | 2254.370117 | <NA> | 44.32 | <NA> | 0.0 | <NA> | 2220.01001 | <NA> |
| 2022-09-29 16:20:00 | 28.559999 | <NA> | 0.02 | <NA> | 2254.439941 | <NA> | 44.27 | <NA> | 0.0 | <NA> | 2220.01001 | <NA> |
| 2022-09-29 20:25:00 | 29.34 | <NA> | 0.14 | <NA> | 2255.219971 | <NA> | 44.650002 | <NA> | 0.06 | <NA> | 2220.179932 | <NA> |
| 2022-09-29 20:30:00 | 29.379999 | <NA> | 0.04 | <NA> | 2255.26001 | <NA> | 44.639999 | <NA> | 0.0 | <NA> | 2220.179932 | <NA> |
| 2022-09-29 20:50:00 | 29.51 | <NA> | 0.08 | <NA> | 2255.389893 | <NA> | 44.720001 | <NA> | 0.0 | <NA> | 2220.25 | <NA> |
| 2022-09-29 23:50:00 | 29.559999 | <NA> | 0.03 | <NA> | 2255.439941 | <NA> | 44.68 | <NA> | 0.0 | <NA> | 2220.25 | <NA> |
14603 rows × 12 columns
These timestamps don’t make a lot of sense to me. Let’s plot a day to confirm, but I think this filter isn’t working right.
[62]:
del flag
[63]:
flag = ApplyFlags(df.index)
# add our new FSDB_auto_flag to this instance of ApplyFlags
dbl_flag = pd.DataFrame(ai, columns=['Set0'], index=pd.to_datetime(df.index))
dbl_event = pd.DataFrame(ai, columns=['duplicate'], index=df.index)
dbl_event[['drain_event', 'neg_delta_tank', 'clog']] = False
flag.apply_FSDB_flags(dbl_event, dbl_flag)
# add the original data to the instance to plot
flag.import_provisional_data(qc.df_orig, tank_col='INST_SA', ppt_col='TOT_SA',
ppt_flag_col='TOT_Flag_SA')
pair_tank = qc.df_orig['INST_SH']
ax = flag.plot_flagged_day(pd.to_datetime('9/29/22'), 'CEN_01', tdelta='1D', auto_qa_event=dbl_event, paired_tank=pair_tank)
[64]:
ax = flag.plot_flagged_day(pd.to_datetime('9/20/22 0700'), 'CEN_01', tdelta='12h', auto_qa_event=dbl_event, paired_tank=pair_tank)
That looks totally random. While I agree that we have some precip when the tank is effectively flat, I don’t think there is any reason we can consistently classify the points above as artificial repeats.
As is, this method can only be tuned to check large precip amounts against tank changes.
Smooth tank first¶
Instead of looking for large differences, let’s reverse the process for finding drains:
Smooth the tank,
Look for increases
Look at what precip is doing when the tank isn’t rising
[492]:
tank_col = 'INST_SA'
tank_window = df[tank_col].rolling('45min')
tank_smooth = tank_window.median()
#tank_upr_bounce = tank_window.quantile(0.90)
rising_tank = df[tank_col] > (tank_smooth + qc.precision*.5)
raining = df[ppt_col] > 0
phantom_precip = ~rising_tank & raining
df[phantom_precip]
[492]:
| INST_SA | INST_Flag_SA | TOT_SA | TOT_Flag_SA | ACC_SA | ACC_Flag_SA | INST_SH | INST_Flag_SH | TOT_SH | TOT_Flag_SH | ACC_SH | ACC_Flag_SH | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | ||||||||||||
| 2017-10-01 00:05:00 | 239.699997 | <NA> | 0.2 | Q | 0.2 | Q | 259.799988 | <NA> | 0.2 | Q | 0.2 | Q |
| 2017-10-01 00:10:00 | 239.800003 | <NA> | 0.3 | <NA> | 0.5 | <NA> | 260.100006 | <NA> | 0.5 | <NA> | 0.7 | <NA> |
| 2017-10-01 01:55:00 | 240.399994 | <NA> | 0.1 | <NA> | 1.1 | <NA> | 260.799988 | <NA> | 0.0 | <NA> | 1.6 | <NA> |
| 2017-10-01 03:30:00 | 241.800003 | <NA> | 0.1 | <NA> | 2.5 | <NA> | 262.200012 | <NA> | 0.0 | <NA> | 3.1 | <NA> |
| 2017-10-01 05:05:00 | 243.100006 | <NA> | 0.1 | <NA> | 3.8 | <NA> | 263.700012 | <NA> | 0.0 | <NA> | 4.3 | <NA> |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2022-09-30 00:45:00 | 29.57 | <NA> | 0.01 | <NA> | 2255.449951 | <NA> | 44.700001 | <NA> | 0.0 | <NA> | 2220.27002 | <NA> |
| 2022-09-30 01:05:00 | 29.6 | <NA> | 0.03 | <NA> | 2255.47998 | <NA> | 44.720001 | <NA> | 0.0 | <NA> | 2220.27002 | <NA> |
| 2022-09-30 02:30:00 | 29.610001 | <NA> | 0.01 | <NA> | 2255.48999 | <NA> | 44.650002 | <NA> | 0.0 | <NA> | 2220.300049 | <NA> |
| 2022-09-30 11:40:00 | 29.629999 | <NA> | 0.02 | <NA> | 2255.51001 | <NA> | 44.77 | <NA> | 0.0 | <NA> | 2220.300049 | <NA> |
| 2022-09-30 12:30:00 | 29.639999 | <NA> | 0.01 | <NA> | 2255.52002 | <NA> | 44.720001 | <NA> | 0.0 | <NA> | 2220.330078 | <NA> |
4990 rows × 12 columns
[493]:
df.loc[phantom_precip & (how_many_in_group >= 5), 'TOT_SA'].describe()
[493]:
count 1353.0
mean 0.166563
std 0.062929
min 0.01
25% 0.2
50% 0.2
75% 0.2
max 0.37
Name: TOT_SA, dtype: double[pyarrow]
Yet again, this is just too wide a net. It catches all sorts of small precip values during periods where the tank isn’t doing much. I tried a bunch of different time steps and precision values, but, while this may be a strong inditement of the accuracy of simple_pre.m, it does not function as a filter for the artificial repeating values.
It’s worth seeing what this looks like if I require cosnecutive values, like I did with QaRules.falg_duplicate_precip
[598]:
rising_tank = df[tank_col] > (tank_smooth + qc.precision*1)
phantom_precip = ~rising_tank & raining
duplicate_val = phantom_precip.astype('boolean')
number_continuous_grps = duplicate_val.diff().cumsum()
how_many_in_group = duplicate_val.groupby(number_continuous_grps).transform('size')
df[phantom_precip & (how_many_in_group >= 5)]
[598]:
| INST_SA | INST_Flag_SA | TOT_SA | TOT_Flag_SA | ACC_SA | ACC_Flag_SA | INST_SH | INST_Flag_SH | TOT_SH | TOT_Flag_SH | ACC_SH | ACC_Flag_SH | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Date | ||||||||||||
| 2017-10-19 15:00:00 | 346.5 | MM | 0.3 | M | 108.400002 | M | 352.299988 | M | 0.2 | M | 104.300003 | M |
| 2017-10-19 15:05:00 | 346.5 | MM | 0.3 | M | 108.699997 | M | 352.299988 | M | 0.2 | M | 104.5 | M |
| 2017-10-19 15:10:00 | 346.5 | MM | 0.3 | M | 109.0 | M | 352.299988 | M | 0.2 | M | 104.699997 | M |
| 2017-10-19 15:15:00 | 346.5 | MM | 0.3 | M | 109.300003 | M | 352.299988 | M | 0.2 | M | 104.900002 | M |
| 2017-10-19 15:20:00 | 346.5 | MM | 0.3 | M | 109.599998 | M | 352.299988 | M | 0.2 | M | 105.099998 | M |
| 2017-10-19 15:25:00 | 346.5 | MM | 0.3 | M | 109.900002 | M | 352.299988 | M | 0.2 | M | 105.300003 | M |
| 2017-10-19 15:30:00 | 346.5 | MM | 0.3 | M | 110.199997 | M | 352.299988 | M | 0.2 | M | 105.5 | M |
| 2017-10-19 15:40:00 | 18.049999 | <NA> | 0.37 | <NA> | 110.57 | <NA> | 30.1 | <NA> | 0.32 | <NA> | 105.82 | <NA> |
| 2017-10-19 15:45:00 | 18.309999 | <NA> | 0.26 | <NA> | 110.830002 | <NA> | 30.1 | <NA> | 0.0 | <NA> | 105.82 | <NA> |
| 2017-10-19 15:50:00 | 18.559999 | <NA> | 0.25 | <NA> | 111.080002 | <NA> | 30.1 | <NA> | 0.0 | <NA> | 105.82 | <NA> |
| 2017-10-19 15:55:00 | 18.809999 | <NA> | 0.25 | <NA> | 111.330002 | <NA> | 29.780001 | <NA> | 0.0 | <NA> | 105.82 | <NA> |
| 2017-10-19 16:00:00 | 19.07 | <NA> | 0.26 | <NA> | 111.589996 | <NA> | 30.43 | <NA> | 0.33 | <NA> | 106.150002 | <NA> |
| 2018-01-20 04:55:00 | 210.0 | <NA> | 0.2 | <NA> | 969.23999 | <NA> | 210.100006 | <NA> | 0.0 | <NA> | 905.469971 | <NA> |
| 2018-01-20 05:00:00 | 209.800003 | <NA> | 0.2 | <NA> | 969.440002 | <NA> | 210.199997 | <NA> | 0.0 | <NA> | 905.469971 | <NA> |
| 2018-01-20 05:05:00 | 209.800003 | <NA> | 0.2 | <NA> | 969.640015 | <NA> | 210.399994 | <NA> | 0.0 | <NA> | 905.469971 | <NA> |
| 2018-01-20 05:10:00 | 209.800003 | <NA> | 0.2 | <NA> | 969.840027 | <NA> | 210.199997 | <NA> | 0.0 | <NA> | 905.469971 | <NA> |
| 2018-01-20 05:15:00 | 209.800003 | <NA> | 0.2 | <NA> | 970.039978 | <NA> | 210.399994 | <NA> | 0.0 | <NA> | 905.469971 | <NA> |
| 2018-01-20 05:20:00 | 209.800003 | <NA> | 0.2 | <NA> | 970.23999 | <NA> | 210.600006 | <NA> | 0.2 | <NA> | 905.669983 | <NA> |
| 2019-04-06 10:10:00 | 128.899994 | M | 0.2 | M | 1598.900024 | M | 174.800003 | <NA> | 0.4 | <NA> | 1369.640015 | <NA> |
| 2019-04-06 10:15:00 | 128.899994 | M | 0.2 | M | 1599.099976 | M | 174.899994 | <NA> | 0.1 | <NA> | 1369.73999 | <NA> |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2020-06-04 17:05:00 | 19.74 | <NA> | 0.05 | <NA> | 1491.640015 | <NA> | 38.939999 | <NA> | 0.0 | <NA> | 1380.25 | <NA> |
| 2020-06-04 17:10:00 | 19.610001 | <NA> | 0.05 | <NA> | 1491.689941 | <NA> | 38.939999 | <NA> | 0.0 | <NA> | 1380.25 | <NA> |
| 2020-06-04 17:15:00 | 19.610001 | <NA> | 0.05 | <NA> | 1491.73999 | <NA> | 38.779999 | <NA> | 0.0 | <NA> | 1380.25 | <NA> |
| 2020-06-04 17:20:00 | 19.74 | <NA> | 0.05 | <NA> | 1491.790039 | <NA> | 38.779999 | <NA> | 0.0 | <NA> | 1380.25 | <NA> |
| 2020-06-04 17:25:00 | 19.74 | <NA> | 0.05 | <NA> | 1491.839966 | <NA> | 38.779999 | <NA> | 0.0 | <NA> | 1380.25 | <NA> |
| 2021-10-23 01:50:00 | 68.769997 | <NA> | 0.01 | <NA> | 63.27 | <NA> | 112.699997 | <NA> | 0.0 | <NA> | 77.529999 | <NA> |
| 2021-10-23 01:55:00 | 68.779999 | <NA> | 0.01 | <NA> | 63.279999 | <NA> | 112.699997 | <NA> | 0.0 | <NA> | 77.529999 | <NA> |
| 2021-10-23 02:00:00 | 68.910004 | <NA> | 0.13 | <NA> | 63.41 | <NA> | 112.900002 | <NA> | 0.0 | <NA> | 77.529999 | <NA> |
| 2021-10-23 02:05:00 | 69.019997 | <NA> | 0.11 | <NA> | 63.52 | <NA> | 112.900002 | <NA> | 0.0 | <NA> | 77.529999 | <NA> |
| 2021-10-23 02:10:00 | 69.029999 | <NA> | 0.01 | <NA> | 63.529999 | <NA> | 113.199997 | <NA> | 0.3 | <NA> | 77.830002 | <NA> |
| 2022-03-13 02:10:00 | 39.169998 | MM | 0.26 | M | 1269.619995 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:15:00 | 39.169998 | MM | 0.26 | M | 1269.880005 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:20:00 | 39.169998 | MM | 0.26 | M | 1270.140015 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:25:00 | 39.169998 | MM | 0.26 | M | 1270.400024 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:30:00 | 39.169998 | MM | 0.26 | M | 1270.660034 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:35:00 | 39.169998 | MM | 0.26 | M | 1270.920044 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:40:00 | 39.169998 | MM | 0.26 | M | 1271.180054 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:45:00 | 39.169998 | MM | 0.26 | M | 1271.439941 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:50:00 | 39.169998 | MM | 0.26 | M | 1271.699951 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
| 2022-03-13 02:55:00 | 39.169998 | MM | 0.26 | M | 1271.959961 | M | 69.540001 | M | 0.0 | M | 1182.130005 | M |
1364 rows × 12 columns
With the consequtive requirement, this does quite well. I’m quite confident with how well QaRules.flag_duplicate_precip works above, this matches it pretty closely. It grabs 4 fewer timesteps, but one extra day and a smidge more precip (~0.01”)
To compare:
# days flagged by `QaRules.falg_duplicate_precip`
array([datetime.date(2017, 10, 19), datetime.date(2018, 1, 20),
datetime.date(2019, 4, 6), datetime.date(2019, 4, 7),
datetime.date(2019, 4, 8), datetime.date(2019, 4, 9),
datetime.date(2019, 11, 15), datetime.date(2020, 6, 3),
datetime.date(2020, 6, 4), datetime.date(2022, 3, 13)],
dtype=object)
# timesteps flagged by `QaRules.falg_duplicate_precip`
1370 rows
# precip flagged by `QaRules.falg_duplicate_precip`
226.55000304803252
[429]:
pd.set_option('display.min_row', 10)
flag.flags[:] = False
flag.flags['Set0'] = phantom_precip & (how_many_in_group >= 5)
flag.get_flagged_days()
[429]:
DatetimeIndex(['2017-10-19', '2018-01-20', '2019-04-06', '2019-04-07',
'2019-04-08', '2019-04-09', '2019-11-15', '2020-06-03',
'2020-06-04', '2021-10-23', '2022-03-13'],
dtype='datetime64[ns]', freq=None)
[448]:
df.loc[phantom_precip & (how_many_in_group >= 5), 'TOT_SA'].sum()
[448]:
226.76000300608575
Let’s take a quick look at the extra day caught by this method.
[584]:
duplicates = pd.DataFrame((phantom_precip & (how_many_in_group >= 5)))
duplicates.name = 'duplicate'
qa_events[:] = False
qa_events['duplicate'] = duplicates
flag.apply_FSDB_flags(qa_events, flag.flags)
ax = flag.plot_flagged_day(pd.to_datetime('10/23/2021 0100'), 'CEN_01', tdelta='90min', auto_qa_event=qa_events, paired_tank=pair_tank)
plt.grid(True)
[585]:
strt, end = pd.to_datetime('10/23/2021 0100'), pd.to_datetime('10/23/2021 0230')
(tank_smooth.loc[strt:end]).plot(ax=ax[1], color='g', marker='d')
(tank_smooth.loc[strt:end]+1*sa_precision).plot(ax=ax[1], color='g', style='d:')
[585]:
<Axes: title={'center': 'CEN_01 - 2021-10-23 01:00:00'}, xlabel='Date'>
[586]:
lbl = ax[1].get_legend_handles_labels()[1]
lbl[-2:] = ['Smoothed tank', 'Smoothed tank + precision']
ax[1].legend(labels=lbl)
[586]:
<matplotlib.legend.Legend at 0x2d915ecae50>
[457]:
print(f"Tank change: {df.loc[end, 'INST_SA'] - df.loc[strt, 'INST_SA']:.4f}")
print(f"Tot precip : {df.loc[strt:end, 'TOT_SA'].sum():.4f}")
Tank change: 0.7700
Tot precip : 0.7600
This example doesn’t look like it needs to be flagged.
This seems like a very effective way to find a when the tank is flat. By smoothing the tank to the median with a precision buffer, rather than looking at absolute differences between noisy tank values every 5 minutes, the filter naturally grabs all of our repeating values that we’ve been finding.
In the search for these artificially repeating numbers, it makes sense to require the values to be consecutive. And, since this filter is naturally smoothing signal noise with a buffer the precip captured is already limited to sub-precision values. Which means there is no need to identify whether the values are repeating themselves or not, as long as they are consecutive. This is very effective, but fundamentally is looking more at whether or not we are accumulating signal noise. That is a question which is worth addressing, but has not been sytematically investigated here, but rather incidentally investigated as a byproduct of the search for artificially repeating values.
For now, the well developed method for searching for duplicates is dually useful both to find large, doubled values and small constant ones.