Testing Across the Data¶
This needs to be tested across the data broadly to make sure it works right. The first step is to get the method into an efficient and compact function.
Refactoring¶
This needs to be made into an efficient collection of methods. It depends heavily on daily values. Is there an efficient way to get the daily values? I will attempt to do as much as possible through indexed functions and try to avoid big complicated loops. It will be a challenge to split the functionallity so it’s not one big loop like we ended the last section with.
[1]:
# Jupyter magic to make plots display interactive
# must install ipympl (Ipython-matplotlib) and nodejs
from ipywidgets.embed import embed_minimal_html
%matplotlib ipympl
import pandas as pd
import matplotlib.pyplot as plt
from numpy import nan, arange
import scipy
# expand all plots to comfortable viewing size
plt.rcParams['figure.figsize'] = [8, 5]
import sys
sys.path.append("../")
from post_gce_qc import qaqc, data_transfer, cross_probe_qc, main
[2]:
# laod data
prov = data_transfer.LoadProvisionalData(file_n='../config.yaml')
prov.load_ppt_data(strtyr=2018, endyr=2022)
df = prov.pivot_on_probe(prov.df, 'UPL', '02')
# apply rules
#-------------------------
param = qaqc._load_yaml('../qa_param.yaml')['UPL_02']
qc = qaqc.QaRules(df, qa_params=param)
dfqa = qc.df_orig
[3]:
day_keys = pd.to_datetime(df.index).normalize()
start_tank = dfqa.INST.groupby(day_keys).transform('first')
#dly_chg = midnight_tank.diff()
end_tank = df.INST.groupby(day_keys).transform('last')
dly_chg = start_tank - end_tank
start_tank.loc[pd.to_datetime('10/1/17 2350'):pd.to_datetime('10/2/17 0010')]
[3]:
Date
2017-10-01 23:50:00 49.290001
2017-10-01 23:55:00 49.290001
2017-10-02 00:00:00 55.509998
2017-10-02 00:05:00 55.509998
2017-10-02 00:10:00 55.509998
Name: INST, dtype: float[pyarrow]
[4]:
dly_chg.loc[pd.to_datetime('10/1/17 2350'):pd.to_datetime('10/2/17 0010')]
[4]:
Date
2017-10-01 23:50:00 -6.219997
2017-10-01 23:55:00 -6.219997
2017-10-02 00:00:00 -1.630001
2017-10-02 00:05:00 -1.630001
2017-10-02 00:10:00 -1.630001
Name: INST, dtype: float[pyarrow]
[5]:
dly_ppt = df.TOT.groupby(day_keys).transform('sum')
dly_ppt.loc[pd.to_datetime('10/1/17 2350'):pd.to_datetime('10/2/17 0010')]
[5]:
Date
2017-10-01 23:50:00 7.04
2017-10-01 23:55:00 7.04
2017-10-02 00:00:00 1.14
2017-10-02 00:05:00 1.14
2017-10-02 00:10:00 1.14
Name: TOT, dtype: float[pyarrow]
Well, it’s good to know that the precip never matches the tank…
[6]:
qaqc.QaRules.round_to_precision(dly_chg, param['precision']).loc[pd.to_datetime('10/1/17 2350'):pd.to_datetime('10/2/17 0010')]
[6]:
Date
2017-10-01 23:50:00 -6.1468
2017-10-01 23:55:00 -6.1468
2017-10-02 00:00:00 -1.6764
2017-10-02 00:05:00 -1.6764
2017-10-02 00:10:00 -1.6764
Name: INST, dtype: double[pyarrow]
[302]:
def calc_median_filter(df, col='INST', window=289):
return scipy.ndimage.median_filter(df[col], size=window, mode='nearest')
def find_tank_flux(df, precision, col='INST', window=289, nprecision=5):
trend = calc_median_filter(df, col=col, window=289)
flux = trend - df[col]
return abs(flux) > nprecision * precision
def calc_dly_tank_change(df, day_keys, precision, tank_col='INST'):
start_tank = df[tank_col].groupby(day_keys).transform('first')
end_tank = df[tank_col].groupby(day_keys).transform('last')
dly_chg = end_tank - start_tank
return qaqc.QaRules.round_to_precision(dly_chg, precision)
def calc_dly_precip(df, day_keys, precip_col='TOT', exclude=None):
if exclude is None:
# exclude none/ include all
exclude = df[precip_col] < 0.
return df.loc[~exclude, precip_col].groupby(day_keys).transform('sum')
def find_dly_over_accum(df, precision, nprecision=2, tank_col='INST', precip_col='TOT', get_val=False):
day_key = pd.to_datetime(df.index).normalize()
dly_chg = calc_dly_tank_change(df, day_key, precision, tank_col=tank_col)
dly_ppt = calc_dly_precip(df, day_key, precip_col=precip_col)
if not get_val:
return dly_ppt > (dly_chg + nprecision * precision)
elif get_val:
return dly_chg, dly_ppt
def flag_precip_during_tank_flux(df, precision, tank_col='INST', ppt_col='TOT', nprecision=5, window=289):
flux = find_tank_flux(df, precision, col=tank_col, window=window, nprecision=nprecision)
dly_over_accum = find_dly_over_accum(df, precision, nprecision=nprecision, tank_col=tank_col, precip_col=ppt_col)
raining = df[ppt_col] > 0.
diurnal_flux = flux & dly_over_accum & raining
#self.qa_flags['Set0'] |= diurnal_flux
#self.qa_flags['E'] |= diurnal_flux
#self.qa_events['diurnal_flux'] |= diurnal_flux
return diurnal_flux
[8]:
%timeit flag_precip_during_tank_flux(df, 0.288)
611 ms ± 9.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
That is insanely slow. To do that for all of the gauges will take almost 4s. Time to dig in and try to make it more efficient.
[36]:
%load_ext line_profiler
[10]:
%lprun -f flag_precip_during_tank_flux flag_precip_during_tank_flux(df, 0.288)
Timer unit: 1e-09 s
Total time: 0.800321 s
File: /var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_83689/2322213253.py
Function: flag_precip_during_tank_flux at line 37
Line # Hits Time Per Hit % Time Line Contents
==============================================================
37 def flag_precip_during_tank_flux(df, precision, tank_col='INST', ppt_col='TOT', nprecision=5, window=289):
38
39 1 16206000.0 2e+07 2.0 flux = find_tank_flux(df, precision, col=tank_col, window=window, nprecision=nprecision)
40 1 783763000.0 8e+08 97.9 dly_over_accum = find_dly_over_accum(df, precision, nprecision=nprecision, tank_col=tank_col, precip_col=ppt_col)
41 1 172000.0 172000.0 0.0 raining = df[ppt_col] > 0.
42
43 1 180000.0 180000.0 0.0 diurnal_flux = flux & dly_over_accum & raining
44
45 #self.qa_flags['Set0'] |= diurnal_flux
46 #self.qa_flags['E'] |= diurnal_flux
47 #self.qa_events['diurnal_flux'] |= diurnal_flux
48
49 1 0.0 0.0 0.0 return diurnal_flux
[11]:
%lprun -f find_dly_over_accum find_dly_over_accum(df, 0.288)
Timer unit: 1e-09 s
Total time: 0.764574 s
File: /var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_83689/2322213253.py
Function: find_dly_over_accum at line 26
Line # Hits Time Per Hit % Time Line Contents
==============================================================
26 def find_dly_over_accum(df, precision, nprecision=2, tank_col='INST', precip_col='TOT', get_val=False):
27 1 746889000.0 7e+08 97.7 day_key = pd.to_datetime(df.index).normalize()
28
29 1 9697000.0 1e+07 1.3 dly_chg = calc_dly_tank_change(df, day_key, precision, tank_col=tank_col)
30 1 7560000.0 8e+06 1.0 dly_ppt = calc_dly_precip(df, day_key, precip_col=precip_col)
31
32 1 0.0 0.0 0.0 if not get_val:
33 1 428000.0 428000.0 0.1 return dly_ppt > (dly_chg + nprecision * precision)
34 elif get_val:
35 return dly_chg, dly_ppt
[12]:
%timeit day_key = pd.to_datetime(df.index).normalize()
578 ms ± 5.99 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
[13]:
%timeit day_key = df.index.to_series().dt.date
580 μs ± 3.36 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
[14]:
%timeit day_key = pd.to_datetime(df.index).date
610 ms ± 5.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
[15]:
%timeit df['INST'].groupby(pd.Grouper(freq='1D')).transform('first')
4.89 ms ± 33.1 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
[16]:
%timeit df['INST'].groupby(pd.Grouper(freq='1D'))
2.43 ms ± 59.9 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
[304]:
def find_dly_over_accum(df, precision, nprecision=2, tank_col='INST', precip_col='TOT', get_val=False):
day_grp = pd.Grouper(freq='1D') # df.index.to_series().dt.date #
dly_chg = calc_dly_tank_change(df, day_grp, precision, tank_col=tank_col)
dly_ppt = calc_dly_precip(df, day_grp, precip_col=precip_col)
if not get_val:
return dly_ppt > (dly_chg + nprecision * precision)
elif get_val:
return dly_chg, dly_ppt
A daily groupby - transform is 128 times faster, but even after converting the index to a series, accessing dt.date is another order of magnitude faster (1250x, or almost 10x faster than groupby-transform). But the function turns out marginally faster using the Grouper().
[18]:
%timeit flag_precip_during_tank_flux(df, 0.288)
37.9 ms ± 517 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
That’s looking better.
Pro-rating data¶
Now I need to write something that does the prorating. This is likely to go in the qaqc.ApplyFlags class, since qaqc.QaRules only handles booleans and does not handle or store any changes to the data.
[9]:
diurnal_flux = flag_precip_during_tank_flux(df, 0.288)
day_grp = pd.Grouper(freq='1D')
# get days where diurnal flux occurs
#auto_qa_event['diurnal_flux']
diurnal_flux.groupby(day_grp).transform('max')[pd.to_datetime('9/25/19 0000'):pd.to_datetime('9/26/19 0000')]
[9]:
Date
2019-09-25 00:00:00 True
2019-09-25 00:05:00 True
2019-09-25 00:10:00 True
2019-09-25 00:15:00 True
2019-09-25 00:20:00 True
...
2019-09-25 23:40:00 True
2019-09-25 23:45:00 True
2019-09-25 23:50:00 True
2019-09-25 23:55:00 True
2019-09-26 00:00:00 True
Length: 289, dtype: bool[pyarrow]
[12]:
#auto_qa_event['diurnal_flux']
over_accum_dly = diurnal_flux.groupby(day_grp).transform('max')
# calc daily tank change and precip while tank is NOT fluxing
dly_chg = calc_dly_tank_change(df, day_grp, 0.288, tank_col='INST')
likely_dly_ppt = calc_dly_precip(df, day_grp, precip_col='TOT', exclude=diurnal_flux)
# only prorate data on days where it is overaccumulating and there are non-zero precip amounts outside of the tank flux
rain_day = likely_dly_ppt > 0
prorate = rain_day & over_accum_dly
# multiply all precip by a value of 1
ratio = pd.Series(1.0, name='ratio', index=dly_chg.index)
ratio
/var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_84183/611514667.py:10: FutureWarning: Operation between non boolean Series with different indexes will no longer return a boolean result in a future version. Cast both Series to object type to maintain the prior behavior.
prorate = rain_day & over_accum_dly
[12]:
Date
2017-10-01 00:00:00 1.0
2017-10-01 00:05:00 1.0
2017-10-01 00:10:00 1.0
2017-10-01 00:15:00 1.0
2017-10-01 00:20:00 1.0
...
2022-09-30 23:35:00 1.0
2022-09-30 23:40:00 1.0
2022-09-30 23:45:00 1.0
2022-09-30 23:50:00 1.0
2022-09-30 23:55:00 1.0
Name: ratio, Length: 525888, dtype: float64
Let’s track down that future warning.
[23]:
over_accum_dly
[23]:
Date
2017-10-01 00:00:00 False
2017-10-01 00:05:00 False
2017-10-01 00:10:00 False
2017-10-01 00:15:00 False
2017-10-01 00:20:00 False
...
2022-09-30 23:35:00 False
2022-09-30 23:40:00 False
2022-09-30 23:45:00 False
2022-09-30 23:50:00 False
2022-09-30 23:55:00 False
Length: 525888, dtype: bool[pyarrow]
[24]:
rain_day
[24]:
Date
2017-10-01 00:00:00 True
2017-10-01 00:05:00 True
2017-10-01 00:10:00 True
2017-10-01 00:15:00 True
2017-10-01 00:20:00 True
...
2022-09-30 23:35:00 True
2022-09-30 23:40:00 True
2022-09-30 23:45:00 True
2022-09-30 23:50:00 True
2022-09-30 23:55:00 True
Name: TOT, Length: 525217, dtype: bool[pyarrow]
They have a different length by at least a few days.
[25]:
likely_dly_ppt[diurnal_flux] = 0
[26]:
likely_dly_ppt
[26]:
Date
2017-10-01 00:00:00 7.04
2017-10-01 00:05:00 7.04
2017-10-01 00:10:00 7.04
2017-10-01 00:15:00 7.04
2017-10-01 00:20:00 7.04
...
2022-09-30 23:35:00 0.04
2022-09-30 23:40:00 0.04
2022-09-30 23:45:00 0.04
2022-09-30 23:50:00 0.04
2022-09-30 23:55:00 0.04
Name: TOT, Length: 525217, dtype: float[pyarrow]
[305]:
def calc_dly_precip(df, day_keys, precip_col='TOT', exclude=None):
if exclude is None:
# exclude none/ include all
exclude = df[precip_col] < 0.
elif exclude is not None:
df = df.copy(deep=True)
# these values are already flagged as E and set to value of 0.
# this references the same memory, and will sets values to 0 for the underlying df input
df.loc[exclude, precip_col] = 0
return df[precip_col].groupby(day_keys).transform('sum')
[15]:
# create a new
dly_chg[dly_chg < 2* param['precision']] = 0
ratio[prorate] = dly_chg[prorate] / likely_dly_ppt[prorate]
new_df = pd.DataFrame(index=df.index, columns=['precip', 'adj_precip'])
new_df['precip'] = new_df['adj_precip'] = df['TOT']
new_df['adj_precip'] = ratio * new_df['adj_precip']
ratio[pd.to_datetime('9/25/19 0000'):pd.to_datetime('9/26/19 0000')]
[15]:
Date
2019-09-25 00:00:00 0.000000
2019-09-25 00:05:00 0.000000
2019-09-25 00:10:00 0.000000
2019-09-25 00:15:00 0.000000
2019-09-25 00:20:00 0.000000
...
2019-09-25 23:40:00 0.000000
2019-09-25 23:45:00 0.000000
2019-09-25 23:50:00 0.000000
2019-09-25 23:55:00 0.000000
2019-09-26 00:00:00 0.544865
Name: ratio, Length: 289, dtype: float64
[18]:
%timeit find_dly_over_accum(df, 0.288, nprecision=5, tank_col='INST', precip_col='TOT')
20.8 ms ± 363 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
[19]:
%timeit over_accum_dly = diurnal_flux.groupby(day_grp).transform('max')
5.18 ms ± 43.6 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
[348]:
def prorate_precip_during_tank_flux(df, precision, diurnal_flux, tank_col='INST', precip_col='TOT'):
# when is the tank fluctuating (5 min)
# diurnal_flux = flag_precip_during_tank_flux(df, precision, tank_col=tank_col, ppt_col=precip_col)
# group data by day
day_grp = pd.Grouper(freq='1D')
# what days have tank fluctuations
over_accum_dly = diurnal_flux.groupby(day_grp).transform('max')
# get dly change in tank level and
dly_chg = calc_dly_tank_change(df, day_grp, 0.288, tank_col=tank_col)
# set daily change to zero if near precision limit
dly_chg[dly_chg < 2* precision] = 0
# get dly precip excluding 5 min fluctuations
likely_dly_ppt = calc_dly_precip(df, day_grp, precip_col=precip_col, exclude=diurnal_flux)
# what days have rain (can't divide by 0)
rain_day = likely_dly_ppt > 0
# select days to prorate where it is raining, and accumulated more than the tank
prorate = rain_day & over_accum_dly
# create a ratio of 1 for all precip values
ratio = pd.Series(1.0, name='ratio', index=dly_chg.index)
# create a different ratio for values to be prorated
ratio[prorate] = dly_chg[prorate] / likely_dly_ppt[prorate]
new_df = pd.DataFrame(index=df.index, columns=['precip', 'adj_precip'])
new_df['precip'] = df[precip_col].copy(deep=True)
new_df['adj_precip'] = df[precip_col].copy(deep=True)
new_df['adj_precip'] *= ratio
return new_df, ratio
[33]:
prorate_precip_during_tank_flux(dfqa, 0.288)
[37]:
%lprun -f prorate_precip_during_tank_flux prorate_precip_during_tank_flux(dfqa, 0.288)
Timer unit: 1e-09 s
Total time: 0.112648 s
File: /var/folders/vs/y0_kk_gj2jxcb2z5xvlgv9g80000gq/T/ipykernel_84183/2154619301.py
Function: prorate_precip_during_tank_flux at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def prorate_precip_during_tank_flux(df, precision):
2 # when is the tank fluctuating (5 min)
3 1 69927000.0 7e+07 62.1 diurnal_flux = flag_precip_during_tank_flux(df, precision)
4
5 # group data by day
6 1 96000.0 96000.0 0.1 day_grp = pd.Grouper(freq='1D')
7
8 # what days have tank fluctuations
9 1 6845000.0 7e+06 6.1 over_accum_dly = diurnal_flux.groupby(day_grp).transform('max')
10
11 # get dly change in tank level and
12 1 13130000.0 1e+07 11.7 dly_chg = calc_dly_tank_change(df, day_grp, 0.288, tank_col='INST')
13 # set daily change to zero if near precision limit
14 1 1832000.0 2e+06 1.6 dly_chg[dly_chg < 2* precision] = 0
15
16 # get dly precip excluding 5 min fluctuations
17 1 8976000.0 9e+06 8.0 likely_dly_ppt = calc_dly_precip(df, day_grp, precip_col='TOT', exclude=diurnal_flux)
18 # what days have rain
19 1 117000.0 117000.0 0.1 rain_day = likely_dly_ppt > 0
20
21 # select days to prorate where it is raining, and accumulated more than the tank
22 1 48000.0 48000.0 0.0 prorate = rain_day & over_accum_dly
23
24 # create a ratio of 1 for all precip values
25 1 161000.0 161000.0 0.1 ratio = pd.Series(1.0, name='ratio', index=dly_chg.index)
26 # create a different ratio for values to be prorated
27 1 4280000.0 4e+06 3.8 ratio[prorate] = dly_chg[prorate] / likely_dly_ppt[prorate]
28
29 1 5196000.0 5e+06 4.6 new_df = pd.DataFrame(index=df.index, columns=['precip', 'adj_precip'])
30 1 569000.0 569000.0 0.5 new_df['precip'] = new_df['adj_precip'] = df['TOT']
31
32 1 1471000.0 1e+06 1.3 new_df['adj_precip'] *= ratio
The heavy line (62%) won’t be in the final code because it will have been calculated already. The 4.6% line also won’t be there and the 8% line has already been worked on. That leaves the lines that are 11.7 and 3.8%.
[39]:
%%timeit
# create a ratio of 1 for all precip values
ratio = pd.Series(1.0, name='ratio', index=dly_chg.index)
# create a different ratio for values to be prorated
ratio[prorate] = dly_chg[prorate] / likely_dly_ppt[prorate]
1.47 s ± 22 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
[42]:
%%timeit
ratio = (dly_chg.div(likely_dly_ppt)
.where(prorate, other=1.0)
.to_frame('ratio'))
3.98 s ± 49.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
[44]:
import numpy as np
[48]:
%%timeit
ratio = pd.DataFrame({
'ratio': np.where(prorate,
dly_chg / likely_dly_ppt,
1.0)
}, index=df.index)
3.99 s ± 72.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I think that’s about as good as it’s going to be.
UPLO¶
UPLO has a limited number of time periods where this was an issue, so a few areas should stand out and be able to be quickly verified. Let’s run it through the combined function and take a look.
[60]:
diurnal_flux = flag_precip_during_tank_flux(qc.df_orig, param['precision'], tank_col='INST', ppt_col='TOT', nprecision=4, window=289)
qc.qa_flags['Set0'] |= diurnal_flux
qc.qa_flags['E'] |= diurnal_flux
qc.qa_events['diurnal_flux'] |= diurnal_flux
[64]:
flux = find_tank_flux(dfqa, 0.288, col='INST', window=289, nprecision=5)
dly_over_accum = find_dly_over_accum(dfqa, 0.288, nprecision=5, tank_col='INST', precip_col='TOT')
raining = dfqa['TOT'] > 0.
[69]:
plt.close('all')
[70]:
plt.figure()
dfqa.loc[:, 'INST'].plot(legend=True, label='tank height')
dfqa.loc[flux, 'INST'].plot(linestyle='', marker='o', grid=True, legend=True, label=' flux')
dfqa.loc[dly_over_accum, 'INST'].plot(linestyle='', marker='.', grid=True, legend=True, label='dly over_acc')
dfqa.loc[raining, 'INST'].plot(linestyle='', marker='x', grid=True, legend=True, label='raining')
[70]:
<Axes: xlabel='Date'>
Well that’s disappointing. The large oscilations during the days are being caught, and the days are being flagged for overaccumulation. But the flux flag is ending before the precip accumulates. This must be the delayed sort of accumulation that we see sometimes with simple_pre.m. Let’s try dropping the threshold for when flux is happening.
[71]:
flux = find_tank_flux(dfqa, 0.288, col='INST', window=289, nprecision=3)
dly_over_accum = find_dly_over_accum(dfqa, 0.288, nprecision=5, tank_col='INST', precip_col='TOT')
raining = dfqa['TOT'] > 0.
[76]:
plt.figure()
dfqa.loc[:, 'INST'].plot(legend=True, label='tank height')
dfqa.loc[flux, 'INST'].plot(linestyle='', marker='o', grid=True, legend=True, label=' flux')
dfqa.loc[dly_over_accum, 'INST'].plot(linestyle='', marker='.', grid=True, legend=True, label='dly over_acc')
dfqa.loc[raining, 'INST'].plot(linestyle='', marker='x', grid=True, legend=True, label='raining')
[76]:
<Axes: xlabel='Date'>
Dropping nprecision from 5 to 3 only flagged an extra one or two points each day, but it added a bunch of false flags on the real precip at the start of this period. I want to try decreasing nprecision just a little more to see what it would take to get the precip flagged.
[77]:
flux = find_tank_flux(dfqa, 0.288, col='INST', window=289, nprecision=2)
dly_over_accum = find_dly_over_accum(dfqa, 0.288, nprecision=5, tank_col='INST', precip_col='TOT')
raining = dfqa['TOT'] > 0.
[78]:
plt.figure()
dfqa.loc[:, 'INST'].plot(legend=True, label='tank height')
dfqa.loc[flux, 'INST'].plot(linestyle='', marker='o', grid=True, legend=True, label=' flux')
dfqa.loc[dly_over_accum, 'INST'].plot(linestyle='', marker='.', grid=True, legend=True, label='dly over_acc')
dfqa.loc[raining, 'INST'].plot(linestyle='', marker='x', grid=True, legend=True, label='raining')
[78]:
<Axes: xlabel='Date'>
This is not being fixed just by adjusting the nprecision levels. I think the prorating will become important here: when the daily change in tank is 0, the daily totals should be prorated to 0, i.e. multiplied by a ratio of 0 / ppt. So the flagged overaccum days are critical. This also should be less of a problem when the flux is positive, which is more common, because in the positive direction the precip shouldn’t trail the flux, but happen in the middle.
It also might be worth adding a delay of up to 3 time periods to try and match the delay in simple_pre.m. Let’s redo this with a compromise nprecision of 4 and test a look ahead.
[316]:
def find_tank_flux(df, precision, col='INST', window=289, nprecision=5, look_ahead=2):
trend = calc_median_filter(df, col=col, window=289)
flux = trend - df[col]
is_flux = abs(flux) > nprecision * precision
for ahead in range(1,look_ahead+1):
is_flux |= is_flux.shift(ahead)
return is_flux
[99]:
flux = find_tank_flux(dfqa, 0.288, col='INST', window=289, nprecision=4, look_ahead=3)
dly_over_accum = find_dly_over_accum(dfqa, 0.288, nprecision=5, tank_col='INST', precip_col='TOT')
raining = dfqa['TOT'] > 0.
[101]:
plt.figure()
dfqa.loc[:, 'INST'].plot(legend=True, label='tank height')
dfqa.loc[flux, 'INST'].plot(linestyle='', marker='o', grid=True, legend=True, label=' flux')
dfqa.loc[dly_over_accum, 'INST'].plot(linestyle='', marker='.', grid=True, legend=True, label='dly over_acc')
dfqa.loc[raining, 'INST'].plot(linestyle='', marker='x', grid=True, legend=True, label='raining')
[101]:
<Axes: xlabel='Date'>
nprecision of 4 is still flagging that large precip event. 5 is the right number for that parameter. It looks like we’re grabbing a little more of the precip with the look ahead. Let’s try it one more time with a 5.
[102]:
flux = find_tank_flux(dfqa, 0.288, col='INST', window=289, nprecision=5, look_ahead=3)
dly_over_accum = find_dly_over_accum(dfqa, 0.288, nprecision=5, tank_col='INST', precip_col='TOT')
raining = dfqa['TOT'] > 0.
[104]:
plt.figure()
dfqa.loc[:, 'INST'].plot(legend=True, label='tank height')
dfqa.loc[flux, 'INST'].plot(linestyle='', marker='o', grid=True, legend=True, label=' flux')
dfqa.loc[dly_over_accum, 'INST'].plot(linestyle='', marker='.', grid=True, legend=True, label='dly over_acc')
dfqa.loc[raining, 'INST'].plot(linestyle='', marker='x', grid=True, legend=True, label='raining')
[104]:
<Axes: xlabel='Date'>
At 5 we lost all the false flags, but we also lost a lot of the precip we were catching. Zooming in closely on each day, increasing the look ahead would only buy us 1 more flag on bad precip… Let’s try lookahead 4 and nprecision 4.5. See if that does a little better.
[106]:
flux = find_tank_flux(dfqa, 0.288, col='INST', window=289, nprecision=4.5, look_ahead=4)
dly_over_accum = find_dly_over_accum(dfqa, 0.288, nprecision=5, tank_col='INST', precip_col='TOT')
raining = dfqa['TOT'] > 0.
[107]:
plt.figure()
dfqa.loc[:, 'INST'].plot(legend=True, label='tank height')
dfqa.loc[flux, 'INST'].plot(linestyle='', marker='o', grid=True, legend=True, label=' flux')
dfqa.loc[dly_over_accum, 'INST'].plot(linestyle='', marker='.', grid=True, legend=True, label='dly over_acc')
dfqa.loc[raining, 'INST'].plot(linestyle='', marker='x', grid=True, legend=True, label='raining')
[107]:
<Axes: xlabel='Date'>
Well, at least it’s getting better than half on those two days! The pro-rating will probably zero it out anyhow. Let’s find our problem days.
[108]:
plt.figure()
dfqa.loc[:, 'INST'].plot(legend=True, label='tank height')
dfqa.loc[flux, 'INST'].plot(linestyle='', marker='o', grid=True, legend=True, label=' flux')
dfqa.loc[dly_over_accum, 'INST'].plot(linestyle='', marker='.', grid=True, legend=True, label='dly over_acc')
dfqa.loc[raining, 'INST'].plot(linestyle='', marker='x', grid=True, legend=True, label='raining')
[108]:
<Axes: xlabel='Date'>
Everything under nprecision 5 seems problematic. But it is working nicely on the diurnal fluxes…One last try. 4.75.
[109]:
flux = find_tank_flux(dfqa, 0.288, col='INST', window=289, nprecision=4.75, look_ahead=4)
[110]:
plt.figure()
dfqa.loc[:, 'INST'].plot(legend=True, label='tank height')
dfqa.loc[flux, 'INST'].plot(linestyle='', marker='o', grid=True, legend=True, label=' flux')
dfqa.loc[dly_over_accum, 'INST'].plot(linestyle='', marker='.', grid=True, legend=True, label='dly over_acc')
dfqa.loc[raining, 'INST'].plot(linestyle='', marker='x', grid=True, legend=True, label='raining')
[110]:
<Axes: xlabel='Date'>
That does an admirable job; it got rid of the false flags on the precip, and added at least 1 precip that wasn’t flagged before. I think that’s as good as we’re getting. The rest is up to the prorating. We need to be careful with the prorating because of it’s power to zero out and redistribute precip.
First add the look-ahead to the flag function. Then check where the combined flux and overaccum are. If it is not a small concentrated number, check the overaccum thershold.
[310]:
def flag_precip_during_tank_flux(df, precision, tank_col='INST', ppt_col='TOT', fluxprecision=5, accprecision=5,
window=289, lookahead=3):
flux = find_tank_flux(df, precision, col=tank_col, window=window, nprecision=fluxprecision, look_ahead=lookahead)
dly_over_accum = find_dly_over_accum(df, precision, nprecision=accprecision, tank_col=tank_col, precip_col=ppt_col)
raining = df[ppt_col] > 0.
diurnal_flux = flux & dly_over_accum & raining
#self.qa_flags['Set0'] |= diurnal_flux
#self.qa_flags['E'] |= diurnal_flux
#self.qa_events['diurnal_flux'] |= diurnal_flux
return diurnal_flux
[117]:
plt.close(14)
[123]:
diurnal_flux = flag_precip_during_tank_flux(qc.df_orig, param['precision'], tank_col='INST', ppt_col='TOT',
fluxprecision=4.75, accprecision=5, lookahead=4, window=289)
qc.qa_flags['Set0'] |= diurnal_flux
qc.qa_flags['E'] |= diurnal_flux
qc.qa_events['diurnal_flux'] = diurnal_flux
[121]:
plt.close(14)
[122]:
plt.figure()
dfqa.loc[:, 'INST'].plot(legend=True, label='tank height')
dfqa.loc[diurnal_flux, 'INST'].plot(linestyle='', marker='.', grid=True, legend=True, label='diurnal flux')
[122]:
<Axes: xlabel='Date'>
OK, so 37 moments to check out and confirm. Let’s do this.
[165]:
qc.qa_events['diurnal_flux'] = diurnal_flux
flags = main.apply_all_flags(qc.df_orig, qc.qa_flags, qc.qa_events, param)
trend = calc_median_filter(dfqa, col='INST', window=289)
# the plotting function is expecting a very specific format for a trend.
trend = pd.DataFrame(data=trend, index=dfqa.index, columns=['mean'])
[166]:
plt.close(15)
%load_ext autoreload
%autoreload explicit
%aimport post_gce_qc.qaqc
import post_gce_qc.qaqc
The autoreload extension is already loaded. To reload it, use:
%reload_ext autoreload
[167]:
day = pd.to_datetime('6/5/22')
flags.plot_flagged_day(day, 'UPL02', tdelta='1D', auto_qa_event=qc.qa_events, running_tank=trend)
[167]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'UPL02 - 2022-06-05 00:00:00'}, xlabel='Date'>)
[168]:
day_key = pd.Grouper(freq='1D')
dly_chg = calc_dly_tank_change(dfqa, day_key, 0.2794, tank_col='INST')
dly_ppt = calc_dly_precip(dfqa, day_key, precip_col='TOT')
[169]:
dly_ppt[day]
[169]:
2.7300000190734863
[170]:
dly_chg[day]
[170]:
-395.90979999999996
Ooops. Well that’s fixable. Tank changes < 0 need to be ignored. I’ll write that into the function.
[307]:
def find_dly_over_accum(df, precision, nprecision=2, tank_col='INST', precip_col='TOT', get_val=False):
day_grp = pd.Grouper(freq='1D') # df.index.to_series().dt.date #
dly_chg = calc_dly_tank_change(df, day_grp, precision, tank_col=tank_col)
dly_ppt = calc_dly_precip(df, day_grp, precip_col=precip_col)
no_drain = dly_chg > 0
if not get_val:
return (dly_ppt > (dly_chg + nprecision * precision)) & no_drain
elif get_val:
return dly_chg, dly_ppt
[186]:
qc = qaqc.QaRules(df, qa_params=param)
dfqa = qc.df_orig
diurnal_flux = flag_precip_during_tank_flux(qc.df_orig, param['precision'], tank_col='INST', ppt_col='TOT',
fluxprecision=4.75, accprecision=5, lookahead=4, window=289)
qc.qa_flags['Set0'] |= diurnal_flux
qc.qa_flags['E'] |= diurnal_flux
qc.qa_events['diurnal_flux'] = diurnal_flux
flags = main.apply_all_flags(qc.df_orig, qc.qa_flags, qc.qa_events, param)
[183]:
plt.figure()
dfqa.loc[:, 'INST'].plot(legend=True, label='tank height')
dfqa.loc[diurnal_flux, 'INST'].plot(linestyle='', marker='.', grid=True, legend=True, label='diurnal flux')
[183]:
<Axes: xlabel='Date'>
That drops the total number of moments to inspect down to 14! Excellent improvement. Bummer that we can’t ever correct on drain days, but you can’t use the midnight to mignight tank values on those days, so you can never get a new precip total.
[187]:
plt.close(17)
[188]:
day = pd.to_datetime('10/7/19')
flags.plot_flagged_day(day, 'UPL02', tdelta='1D', auto_qa_event=qc.qa_events, running_tank=trend)
[188]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'UPL02 - 2019-10-07 00:00:00'}, xlabel='Date'>)
That’s fantastic, it caught all of the bad precip!
[189]:
plt.close(18)
[190]:
day = pd.to_datetime('10/6/19')
flags.plot_flagged_day(day, 'UPL02', tdelta='1D', auto_qa_event=qc.qa_events, running_tank=trend)
[190]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'UPL02 - 2019-10-06 00:00:00'}, xlabel='Date'>)
[191]:
dly_chg[day]
[191]:
-0.2794
[192]:
dly_ppt[day]
[192]:
2.200000047683716
Ughh, I stumbled on that one. It’s not flagging anything because it’s a negative tank difference. But it’s only negative by one precision level. All precip should be zeroed out here. Let’s add yet another filter.
[436]:
def find_dly_over_accum(df, precision, nprecision=2, tank_col='INST', precip_col='TOT', get_val=False):
day_grp = pd.Grouper(freq='1D') # df.index.to_series().dt.date #
dly_chg = calc_dly_tank_change(df, day_grp, precision, tank_col=tank_col)
dly_ppt = calc_dly_precip(df, day_grp, precip_col=precip_col)
# any tank change that is within 2 precision levels is considered flat
dly_chg[abs(dly_chg) < 2* precision] = 0
# no negative changes (larger than 2 precision levels)
no_drain = dly_chg >= 0
if not get_val:
return (dly_ppt > (dly_chg + nprecision * precision)) & no_drain
elif get_val:
return dly_chg, dly_ppt
[435]:
qc = qaqc.QaRules(df, qa_params=param)
dfqa = qc.df_orig
diurnal_flux = flag_precip_during_tank_flux(qc.df_orig, param['precision'], tank_col='INST', ppt_col='TOT',
fluxprecision=4.75, accprecision=5, lookahead=4, window=289)
qc.qa_flags['Set0'] |= diurnal_flux
qc.qa_flags['E'] |= diurnal_flux
qc.qa_events['diurnal_flux'] = diurnal_flux
flags = main.apply_all_flags(qc.df_orig, qc.qa_flags, qc.qa_events, param)
[195]:
day = pd.to_datetime('10/6/19')
flags.plot_flagged_day(day, 'UPL02', tdelta='1D', auto_qa_event=qc.qa_events, running_tank=trend)
[195]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'UPL02 - 2019-10-06 00:00:00'}, xlabel='Date'>)
Excellent. That fixed this one.
[196]:
day = pd.to_datetime('10/5/19')
flags.plot_flagged_day(day, 'UPL02', tdelta='1D', auto_qa_event=qc.qa_events, running_tank=trend)
[196]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'UPL02 - 2019-10-05 00:00:00'}, xlabel='Date'>)
[201]:
dly_chg[day]
[201]:
0.2794
[202]:
dly_ppt[day]
[202]:
3.4000000953674316
Not perfect. But it caught the really large (>1mm) value. And it will prorate the rest down to 0.
[199]:
plt.close(24)
[200]:
day = pd.to_datetime('9/30/19')
flags.plot_flagged_day(day, 'UPL02', tdelta='2D', auto_qa_event=qc.qa_events, running_tank=trend)
[200]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'UPL02 - 2019-09-30 00:00:00'}, xlabel='Date'>)
[203]:
dly_chg[day]
[203]:
0.2794
Not sure aboutu the blips in the middle, but that will still get prorated to 0.
[207]:
plt.close(26)
[208]:
day = pd.to_datetime('9/23/19')
flags.plot_flagged_day(day, 'UPL02', tdelta='5D', auto_qa_event=qc.qa_events, running_tank=trend)
[208]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'UPL02 - 2019-09-23 00:00:00'}, xlabel='Date'>)
[209]:
dly_chg[day]
[209]:
5.0291999999999994
This stretch has been looked at to death. It’s as good as it will get. The middle two days will be prorated down to 0. The worst moments in the other 2 days will be removed. The last day actually looks like it’s nearly all real precip that remains.
[210]:
day = pd.to_datetime('9/21/19')
flags.plot_flagged_day(day, 'UPL02', tdelta='1D', auto_qa_event=qc.qa_events, running_tank=trend)
[210]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'UPL02 - 2019-09-21 00:00:00'}, xlabel='Date'>)
[211]:
dly_chg[day]
[211]:
0.2794
All of that will prorate to 0 ppt.
[212]:
day = pd.to_datetime('9/14/19')
flags.plot_flagged_day(day, 'UPL02', tdelta='1D', auto_qa_event=qc.qa_events, running_tank=trend)
[212]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'UPL02 - 2019-09-14 00:00:00'}, xlabel='Date'>)
[213]:
dly_chg[day]
[213]:
0.2794
It’s becoming clear how influential the prorating is. This is yet another case where all values will be prorated to 0. It is doing a nice job of catching the oscillation, though. This characteristic is most important on days that had real precip, but also had an oscillation. So it is good to get confirmation that it is working well.
[224]:
plt.close(28)
[223]:
day = pd.to_datetime('8/22/19')
flags.plot_flagged_day(day, 'UPL02', tdelta='1D', auto_qa_event=qc.qa_events, running_tank=trend)
[223]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'UPL02 - 2019-08-22 00:00:00'}, xlabel='Date'>)
[215]:
dly_chg[day]
[215]:
132.4356
[220]:
strt, end = pd.to_datetime('8/22/19 0930'), pd.to_datetime('8/22/19 1040')
qc.df_orig.loc[strt:end]
[220]:
| INST | INST_Flag | TOT | TOT_Flag | ACC | ACC_Flag | |
|---|---|---|---|---|---|---|
| Date | ||||||
| 2019-08-22 09:30:00 | 131.100006 | <NA> | 0.0 | <NA> | 2202.48999 | <NA> |
| 2019-08-22 09:35:00 | 14.28 | R | 0.0 | R | 2202.48999 | R |
| 2019-08-22 09:40:00 | 167.100006 | R | 0.0 | <NA> | 2355.310059 | <NA> |
| 2019-08-22 09:45:00 | 166.699997 | <NA> | 0.0 | <NA> | 2355.310059 | <NA> |
| 2019-08-22 09:50:00 | 202.100006 | R | 0.0 | J | 2390.310059 | J |
| 2019-08-22 09:55:00 | 202.100006 | <NA> | 0.0 | <NA> | 2390.310059 | <NA> |
| 2019-08-22 10:00:00 | 201.699997 | <NA> | 0.0 | <NA> | 2390.310059 | <NA> |
| 2019-08-22 10:05:00 | 202.0 | <NA> | 0.0 | <NA> | 2390.310059 | <NA> |
| 2019-08-22 10:10:00 | 202.0 | <NA> | 0.0 | <NA> | 2390.310059 | <NA> |
| 2019-08-22 10:15:00 | 201.699997 | <NA> | 0.0 | <NA> | 2390.310059 | <NA> |
| 2019-08-22 10:20:00 | 14.25 | R | 0.0 | R | 2390.310059 | R |
| 2019-08-22 10:25:00 | 70.18 | R | 0.0 | <NA> | 2446.23999 | <NA> |
| 2019-08-22 10:30:00 | 261.899994 | R | 191.720001 | J | 2637.959961 | J |
| 2019-08-22 10:35:00 | 262.200012 | <NA> | 0.3 | <NA> | 2638.26001 | <NA> |
| 2019-08-22 10:40:00 | 262.200012 | <NA> | 0.0 | <NA> | 2638.26001 | <NA> |
See NotesDB ID 8140. This was a day where the tank gauge was removed and cleaned. These methods weren’t intended to capture that, but it is a happy accident that it did. Any prorating will simply be applied to 0 precip amounts. I’ll put in a manual flag all the same. Should be M for missing with NAN values for this period.
[225]:
day = pd.to_datetime('7/6/19')
flags.plot_flagged_day(day, 'UPL02', tdelta='1D', auto_qa_event=qc.qa_events, running_tank=trend)
[225]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'UPL02 - 2019-07-06 00:00:00'}, xlabel='Date'>)
[226]:
dly_chg[day]
[226]:
0.5588
That will still prorate to 0. it caught but 3 of the precip values during the dip, though.
[230]:
plt.close(31)
[231]:
day = pd.to_datetime('7/11/19')
flags.plot_flagged_day(day, 'UPL02', tdelta='1D', auto_qa_event=qc.qa_events, running_tank=trend)
[231]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'UPL02 - 2019-07-11 00:00:00'}, xlabel='Date'>)
[232]:
dly_chg[day]
[232]:
1.3969999999999998
[233]:
dly_ppt[day]
[233]:
3.4000000953674316
That seems like it got rid of all the precip that happened during the flux. It does seem like some of that trailing precip is questionable, but prorating will cut the total by > 60%. So it still seems like a win.
[241]:
day = pd.to_datetime('3/30/19')
flags.plot_flagged_day(day, 'UPL02', tdelta='1D', auto_qa_event=qc.qa_events, running_tank=trend)
[241]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'UPL02 - 2019-03-30 00:00:00'}, xlabel='Date'>)
[242]:
dly_chg[day]
[242]:
-0.2794
Prorate to 0!
[244]:
day = pd.to_datetime('3/10/19')
flags.plot_flagged_day(day, 'UPL02', tdelta='1D', auto_qa_event=qc.qa_events, running_tank=trend)
[244]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'UPL02 - 2019-03-10 00:00:00'}, xlabel='Date'>)
[245]:
day = pd.to_datetime('3/2/19')
flags.plot_flagged_day(day, 'UPL02', tdelta='1D', auto_qa_event=qc.qa_events, running_tank=trend)
[245]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'UPL02 - 2019-03-02 00:00:00'}, xlabel='Date'>)
[246]:
day = pd.to_datetime('2/21/19')
flags.plot_flagged_day(day, 'UPL02', tdelta='1D', auto_qa_event=qc.qa_events, running_tank=trend)
[246]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'UPL02 - 2019-02-21 00:00:00'}, xlabel='Date'>)
[247]:
dly_chg[day]
[247]:
1.6764
So there was a small increase that day, but that eliminates the precip during the dip really well, and gets rid of the more extreme high amounts. The prorating should look nice.
[248]:
day = pd.to_datetime('2/18/19')
flags.plot_flagged_day(day, 'UPL02', tdelta='1D', auto_qa_event=qc.qa_events, running_tank=trend)
[248]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'UPL02 - 2019-02-18 00:00:00'}, xlabel='Date'>)
[249]:
day = pd.to_datetime('9/11/18')
flags.plot_flagged_day(day, 'UPL02', tdelta='2D', auto_qa_event=qc.qa_events, running_tank=trend)
[249]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'UPL02 - 2018-09-11 00:00:00'}, xlabel='Date'>)
[250]:
day = pd.to_datetime('2/15/18')
flags.plot_flagged_day(day, 'UPL02', tdelta='1D', auto_qa_event=qc.qa_events, running_tank=trend)
[250]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'UPL02 - 2018-02-15 00:00:00'}, xlabel='Date'>)
[251]:
dly_chg[day]
[251]:
3.9116
[252]:
dly_ppt[day]
[252]:
7.900000095367432
Perfect, the prorating will remove 4 mm of precip, and 3.5mm of it was all in that false oscillation. Good catch. The prorating should work well there.
[236]:
day = pd.to_datetime('7/16/19')
flags.plot_flagged_day(day, 'UPL02', tdelta='1D', auto_qa_event=qc.qa_events, running_tank=trend)
[236]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'UPL02 - 2019-07-16 00:00:00'}, xlabel='Date'>)
Nothing to flag but some inscurtible and small precip after the flux has stabilized.
[237]:
day = pd.to_datetime('7/10/19')
flags.plot_flagged_day(day, 'UPL02', tdelta='1D', auto_qa_event=qc.qa_events, running_tank=trend)
[237]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'UPL02 - 2019-07-10 00:00:00'}, xlabel='Date'>)
[239]:
dly_chg[day]
[239]:
-1.1176
It’s hard to understand how simple_pre.m calculated precip at the start of the dip… But the dip isn’t deep enough to trigger the flux protocol anyways. Plus it has a tank drop, so it will be excluded altogether.
[240]:
day = pd.to_datetime('7/3/19')
flags.plot_flagged_day(day, 'UPL02', tdelta='1D', auto_qa_event=qc.qa_events, running_tank=trend)
[240]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'UPL02 - 2019-07-03 00:00:00'}, xlabel='Date'>)
[243]:
day = pd.to_datetime('3/28/19')
flags.plot_flagged_day(day, 'UPL02', tdelta='1D', auto_qa_event=qc.qa_events, running_tank=trend)
[243]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'UPL02 - 2019-03-28 00:00:00'}, xlabel='Date'>)
[343]:
plt.close(43)
[344]:
new_df, ratio = prorate_precip_during_tank_flux(flags.data, param['precision'], precip_col='adj_precip', tank_col='tank_height')
new_acc = cross_probe_qc.BuildXTable.calc_wy_acc(new_df)
new_acc.plot(grid=True, legend=True)
[344]:
<Axes: xlabel='Date'>
[336]:
ratio.describe()
[336]:
count 525888.0
mean 1.0
std 0.0
min 1.0
25% 1.0
50% 1.0
75% 1.0
max 1.0
Name: ratio, dtype: float64
Well that didn’t work. #### Debug prorating
[340]:
df = flags.data
precision = param['precision']
tank_col = 'tank_height'
precip_col = 'adj_precip'
# when is the tank fluctuating (5 min)
diurnal_flux = flag_precip_during_tank_flux(df, precision, tank_col='tank_height', ppt_col='precip',
fluxprecision=4.75, accprecision=5, lookahead=4, window=289)
# group data by day
day_grp = pd.Grouper(freq='1D')
# what days have tank fluctuations
over_accum_dly = diurnal_flux.groupby(day_grp).transform('max')
[341]:
diurnal_flux.describe()
[341]:
count 525888
unique 2
top False
freq 525661
dtype: object
Important lessons! Since all of the flux periods where the tank is oscillating are set to 0, you can’t run the process on adjusted precip, where all of the ppt is already set to 0. It must be run on raw precip.
[345]:
new_df, ratio = prorate_precip_during_tank_flux(flags.data, param['precision'], precip_col='precip', tank_col='tank_height')
new_acc = cross_probe_qc.BuildXTable.calc_wy_acc(new_df)
new_acc.plot(grid=True, legend=True)
[345]:
<Axes: xlabel='Date'>
[347]:
ratio.describe()
[347]:
count 525888.000000
mean 1.037466
std 1.911580
min 0.000000
25% 1.000000
50% 1.000000
75% 1.000000
max 82.619999
Name: ratio, dtype: float64
Second important lesson: while the diurnal flux can only be identified on the raw precip, you can’t try to prorate raw numbers with all of their crazy spikes. Above you see the huge tank shift when the tank was clean leads to a ratio of 82* the precip that’s there.
[362]:
def prorate_precip_during_tank_flux(df, precision, diurnal_flux, tank_col='INST', precip_col='TOT', max_prorate=15):
# when is the tank fluctuating (5 min)
# diurnal_flux = flag_precip_during_tank_flux(df, precision, tank_col=tank_col, ppt_col=precip_col)
# group data by day
day_grp = pd.Grouper(freq='1D')
# what days have tank fluctuations
over_accum_dly = diurnal_flux.groupby(day_grp).transform('max')
# get dly change in tank level and
dly_chg = calc_dly_tank_change(df, day_grp, 0.288, tank_col=tank_col)
# set daily change to zero if near precision limit
dly_chg[dly_chg < 2* precision] = 0
# get dly precip excluding 5 min fluctuations
likely_dly_ppt = calc_dly_precip(df, day_grp, precip_col=precip_col, exclude=diurnal_flux)
# what days have rain (can't divide by 0)
rain_day = likely_dly_ppt > 0
# select days to prorate where it is raining, and accumulated more than the tank
prorate = rain_day & over_accum_dly
# create a ratio of 1 for all precip values
ratio = pd.Series(1.0, name='ratio', index=dly_chg.index)
# create a different ratio for values to be prorated
ratio[prorate] = dly_chg[prorate] / likely_dly_ppt[prorate]
ratio[ratio>max_prorate] = 0
new_df = pd.DataFrame(index=df.index, columns=['precip', 'adj_precip'])
new_df['precip'] = df[precip_col].copy(deep=True)
new_df['adj_precip'] = df[precip_col].copy(deep=True)
new_df['adj_precip'] *= ratio
return new_df, ratio
[375]:
plt.close(46)
[363]:
new_df, ratio = prorate_precip_during_tank_flux(flags.data, param['precision'], diurnal_flux,
precip_col='adj_precip', tank_col='tank_height')
new_df.loc[qc.qa_flags['Set0'], 'adj_precip']= 0
new_acc = cross_probe_qc.BuildXTable.calc_wy_acc(new_df)
[ ]:
[376]:
plt.figure()
ax1 = dfqa.ACC.plot(grid=True, legend=True, label='raw data')
new_acc.plot(grid=True, legend=True, ax=ax1, label=['adj_precip', 'prorated_precip'])
[376]:
<Axes: xlabel='Date'>
[379]:
plt.figure()
ax1 = dfqa.ACC.plot(grid=True, legend=True, label='raw data')
new_acc.plot(grid=True, legend=True, ax=ax1, label=['adj_precip', 'prorated_precip'])
[379]:
<Axes: xlabel='Date'>
[377]:
totals = new_acc.groupby(pd.Grouper(freq='YE-SEP')).max()
totals['adj_precip'] - totals['precip']
[377]:
2018-09-30 2.343909
2019-09-30 -10.315977
2020-09-30 -1.252017
2021-09-30 0.000101
2022-09-30 3.025997
Freq: YE-SEP, dtype: double[pyarrow]
[380]:
rawtotals = dfqa.ACC.groupby(pd.Grouper(freq='YE-SEP')).max()
totals['adj_precip'] - rawtotals
[380]:
2018-09-30 -10.765954
2019-09-30 -503.835997
2020-09-30 -17.451969
2021-09-30 0.002054
2022-09-30 -1.553837
Freq: YE-SEP, dtype: double[pyarrow]
Adding a funky max ratio isn’t very clean, but it fixed that problem when the sensor was removed (tank was clean). We can see here the difference is in two stages: first the flux periods are zeroed out. Then they are prorated. We can see that the prorating has modest impacts that range from -10 to 3 mm a year. However, zeroing out periods where the tank is fluxing removed 503 mm from 2019 and 17 mm from the early parts of WY 20 (fall of 2019). Of that 503 mm, only 191 mm are from the day the tank was cleaned, so thats 313 mm of diurnal fluctuations being removed just by zeroing out the flux periods.
I didn’t expect the prorating to have any positive moments, but the fact that the prorating creates small, subtle changes outside of the flux periods seems like a good thing.
[463]:
df = prov.pivot_on_probe(prov.df, 'UPL', '01')
# apply rules
#-------------------------
param = qaqc._load_yaml('../qa_param.yaml')['UPL_01']
qc = qaqc.QaRules(df, qa_params=param)
dfqa = qc.df_orig
[464]:
diurnal_flux = flag_precip_during_tank_flux(qc.df_orig, param['precision'], tank_col='INST', ppt_col='TOT',
fluxprecision=4.75, accprecision=5, lookahead=4, window=289)
qc.qa_flags['Set0'] |= diurnal_flux
qc.qa_flags['E'] |= diurnal_flux
qc.qa_events['diurnal_flux'] = diurnal_flux
[465]:
plt.close(61)
[466]:
plt.figure()
dfqa.loc[:, 'INST'].plot(legend=True, label='tank height')
dfqa.loc[diurnal_flux, 'INST'].plot(linestyle='', marker='.', grid=True, legend=True, label='diurnal flux')
[466]:
<Axes: xlabel='Date'>
[450]:
day = pd.to_datetime('11/3/21 0000')
flags.plot_flagged_day(day, 'UPL01', tdelta='2D', auto_qa_event=qc.qa_events, running_tank=trend)
[450]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'UPL01 - 2021-11-03 00:00:00'}, xlabel='Date'>)
This is a problem with rain following a drain event. There is already a filter for days when the tank is drained, however this event occurs within 24 hours of a drain, but after midnight on the next day. That would require some sort of filter on the flux side, which doesn’t run midnight to midnight. We don’t have max and min values for the rolling window.
[441]:
day_key = pd.Grouper(freq='1D')
dly_chg = calc_dly_tank_change(dfqa, day_key, 0.2794, tank_col='INST')
dly_ppt = calc_dly_precip(dfqa, day_key, precip_col='TOT')
trend = calc_median_filter(dfqa, col='INST', window=289)
# the plotting function is expecting a very specific format for a trend.
trend = pd.DataFrame(data=trend, index=dfqa.index, columns=['mean'])
[443]:
day += pd.to_timedelta('1D')
dly_chg[day]
[443]:
24.028399999999998
[444]:
dly_ppt[day]
[444]:
27.209999084472656
That’s more than double the allowable difference between tank and precip. Which just adds to the mystery of simple_pre.m on a rainy day like this. So, ignoring that our precip math outpaces the tank change, the problem is, I would have to take a rolling first and last measurement of my median window. And at VARA, there are days where the diurnal fluctuation is > 100mm. So if all moments where there is a big negative between first and last are dropped, we won’t flag any of the worst days at
VARA. Of course, I don’t know for sure that this will work on those days as is anyhow.
At the moment, I think it makes more sense to manually unflag this event, since it is the first and only one found.
[452]:
day = pd.to_datetime('3/23/21')
flags.plot_flagged_day(day, 'UPL01', tdelta='1D', auto_qa_event=qc.qa_events, running_tank=trend)
[452]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'UPL01 - 2021-03-23 00:00:00'}, xlabel='Date'>)
[471]:
plt.close(62)
[469]:
day_key = pd.Grouper(freq='1D')
dly_chg = calc_dly_tank_change(dfqa, day_key, 0.2794, tank_col='INST')
dly_ppt = calc_dly_precip(dfqa, day_key, precip_col='TOT')
trend = calc_median_filter(dfqa, col='INST', window=289)
# the plotting function is expecting a very specific format for a trend.
trend = pd.DataFrame(data=trend, index=dfqa.index, columns=['mean'])
[472]:
flags = main.apply_all_flags(qc.df_orig, qc.qa_flags, qc.qa_events, param)
day = pd.to_datetime('2/28/21')
flags.plot_flagged_day(day, 'UPL01', tdelta='1D', auto_qa_event=qc.qa_events, running_tank=trend)
[472]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'UPL01 - 2021-02-28 00:00:00'}, xlabel='Date'>)
If anything, these last two are underflagging. But they will trigger a proration to the tank change amount, so they will get further filtered.
[475]:
plt.close(63)
[476]:
day = pd.to_datetime('9/5/19')
flags.plot_flagged_day(day, 'UPL01', tdelta='5D', auto_qa_event=qc.qa_events, running_tank=trend)
[476]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'UPL01 - 2019-09-05 00:00:00'}, xlabel='Date'>)
Wow, big spike, for no aparent reason. Not exactly the sort of fluctuation we had in mind, but agood catch. But more worrying is the fixed precip at 35mm. Let’s make sure our repeating value method gets the rest.
[477]:
end = day + pd.to_timedelta('3D')
qc.qa_events.loc[day:end, 'duplicate'].describe()
[477]:
count 865
unique 1
top False
freq 865
Name: duplicate, dtype: object
[483]:
dfqa.loc[day:end, 'TOT'].sum()
[483]:
23288.80000000447
[478]:
qc.flag_repeating_val_precip(min_ppt_nprecision=0, flat_tank_nprecision=1, flat_ppt_nprecision=0,
min_number_repeating=5)
qc.qa_events.loc[day:end, 'duplicate'].describe()
[478]:
count 865
unique 2
top True
freq 672
Name: duplicate, dtype: object
[487]:
dfqa.loc[~qc.qa_events.duplicate&~qc.qa_events.diurnal_flux, 'TOT'][day:end].sum()
[487]:
70.30000000447035
[492]:
new_ppt, ratio = prorate_precip_during_tank_flux(dfqa, precision, diurnal_flux, tank_col='INST', precip_col='TOT')
[498]:
new_ppt.loc[day:end, 'adj_precip'][~qc.qa_events.duplicate].sum()
[498]:
35.80000000447035
OK, that goes from 23,289 mm precip down to 70. And the number of timesteps flagged as duplicate is 672 out of 865. So I think that is mostly working between the two methods. The prorating then takes it down to a single 35 mm measurement. I’ll add a manual flag for good measure, but I think the combination of quality checks are working.
[474]:
flags = main.apply_all_flags(qc.df_orig, qc.qa_flags, qc.qa_events, param)
day = pd.to_datetime('6/20/18')
flags.plot_flagged_day(day, 'UPL01', tdelta='1D', auto_qa_event=qc.qa_events, running_tank=trend)
[474]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'UPL01 - 2018-06-20 00:00:00'}, xlabel='Date'>)
50 mm precip is always suspicious. This looks like a good event to flag in a time of year that doesn’t often get much rain. But I’m hard pressed to explain what was actually happening to the sensor at the time. This seems to catch all sorts of unwanted spikes. Which is a nice bonus.
CENT¶
There aren’t any known periods of large oscillations at either probe at CENT, so this should be a pretty quick check.
[381]:
df = prov.pivot_on_probe(prov.df, 'CEN', '01')
# apply rules
#-------------------------
param = qaqc._load_yaml('../qa_param.yaml')['CEN_01']
qc = qaqc.QaRules(df, qa_params=param)
dfqa = qc.df_orig
[382]:
diurnal_flux = flag_precip_during_tank_flux(qc.df_orig, param['precision'], tank_col='INST', ppt_col='TOT',
fluxprecision=4.75, accprecision=5, lookahead=4, window=289)
qc.qa_flags['Set0'] |= diurnal_flux
qc.qa_flags['E'] |= diurnal_flux
qc.qa_events['diurnal_flux'] = diurnal_flux
[383]:
plt.figure()
dfqa.loc[:, 'INST'].plot(legend=True, label='tank height')
dfqa.loc[diurnal_flux, 'INST'].plot(linestyle='', marker='.', grid=True, legend=True, label='diurnal flux')
[383]:
<Axes: xlabel='Date'>
CENT stand alone has no fulctuations!
[499]:
df = prov.pivot_on_probe(prov.df, 'CEN', '02')
# apply rules
#-------------------------
param = qaqc._load_yaml('../qa_param.yaml')['CEN_02']
qc = qaqc.QaRules(df, qa_params=param)
dfqa = qc.df_orig
[500]:
diurnal_flux = flag_precip_during_tank_flux(qc.df_orig, param['precision'], tank_col='INST', ppt_col='TOT',
fluxprecision=5.75, accprecision=5, lookahead=4, window=289)
qc.qa_flags['Set0'] |= diurnal_flux
qc.qa_flags['E'] |= diurnal_flux
qc.qa_events['diurnal_flux'] = diurnal_flux
[447]:
plt.figure()
dfqa.loc[:, 'INST'].plot(legend=True, label='tank height')
dfqa.loc[diurnal_flux, 'INST'].plot(linestyle='', marker='.', grid=True, legend=True, label='diurnal flux')
[447]:
<Axes: xlabel='Date'>
[407]:
plt.close(50)
[501]:
day_key = pd.Grouper(freq='1D')
dly_chg = calc_dly_tank_change(dfqa, day_key, 0.2794, tank_col='INST')
dly_ppt = calc_dly_precip(dfqa, day_key, precip_col='TOT')
trend = calc_median_filter(dfqa, col='INST', window=289)
# the plotting function is expecting a very specific format for a trend.
trend = pd.DataFrame(data=trend, index=dfqa.index, columns=['mean'])
[408]:
flags = main.apply_all_flags(qc.df_orig, qc.qa_flags, qc.qa_events, param)
day = pd.to_datetime('10/13/17')
flags.plot_flagged_day(day, 'CEN02', tdelta='1D', auto_qa_event=qc.qa_events, running_tank=trend)
[408]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'CEN02 - 2017-10-13 00:00:00'}, xlabel='Date'>)
[410]:
dly_chg[day]
[410]:
11.175999999999998
[411]:
dly_ppt[day]
[411]:
21.80000114440918
[506]:
flags = main.apply_all_flags(qc.df_orig, qc.qa_flags, qc.qa_events, param)
[507]:
day = pd.to_datetime('10/13/17')
flags.plot_flagged_day(day, 'CEN02', tdelta='7D', auto_qa_event=qc.qa_events, running_tank=trend)
[507]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'CEN02 - 2017-10-13 00:00:00'}, xlabel='Date'>)
This is a value shift following reboot. The OS was replaced on the logger in an attempt to fix an SA heater bug stemming from the OS. This will be unflagged manually, since there is no way to tell it to ignore spikes like this while paying attention to other spikes.
[510]:
day = pd.to_datetime('11/15/19')
flags.plot_flagged_day(day, 'CEN02', tdelta='1D', auto_qa_event=qc.qa_events, running_tank=trend)
[510]:
(<Axes: xlabel='Date'>,
<Axes: title={'center': 'CEN02 - 2019-11-15 00:00:00'}, xlabel='Date'>)
[509]:
plt.close(67)
[518]:
day =pd.to_datetime('11/15/19 1650')
dfqa.loc[day:].head(20)
[518]:
| INST | INST_Flag | TOT | TOT_Flag | ACC | ACC_Flag | |
|---|---|---|---|---|---|---|
| Date | ||||||
| 2019-11-15 16:50:00 | 72.059998 | M | 0.0 | M | 127.099998 | M |
| 2019-11-15 16:55:00 | 72.059998 | M | 0.0 | M | 127.099998 | M |
| 2019-11-15 17:00:00 | 72.059998 | M | 0.0 | M | 127.099998 | M |
| 2019-11-15 17:05:00 | 73.370003 | <NA> | 0.98 | <NA> | 128.080002 | <NA> |
| 2019-11-15 17:10:00 | 66.029999 | <NA> | 0.0 | R | 128.080002 | R |
| 2019-11-15 17:15:00 | 66.029999 | <NA> | 0.0 | <NA> | 128.080002 | <NA> |
| 2019-11-15 17:20:00 | 66.019997 | <NA> | 0.0 | <NA> | 128.080002 | <NA> |
| 2019-11-15 17:25:00 | 66.019997 | <NA> | 0.0 | <NA> | 128.080002 | <NA> |
| 2019-11-15 17:30:00 | 66.339996 | <NA> | 0.31 | <NA> | 128.389999 | <NA> |
| 2019-11-15 17:35:00 | 66.330002 | <NA> | 0.0 | <NA> | 128.389999 | <NA> |
| 2019-11-15 17:40:00 | 66.489998 | <NA> | 0.15 | <NA> | 128.539993 | <NA> |
| 2019-11-15 17:45:00 | 66.330002 | <NA> | 0.0 | <NA> | 128.539993 | <NA> |
| 2019-11-15 17:50:00 | 66.339996 | <NA> | 0.0 | <NA> | 128.539993 | <NA> |
| 2019-11-15 17:55:00 | 66.18 | <NA> | 0.0 | <NA> | 128.539993 | <NA> |
| 2019-11-15 18:00:00 | 66.339996 | <NA> | 0.0 | <NA> | 128.539993 | <NA> |
| 2019-11-15 18:05:00 | 66.18 | <NA> | 0.0 | <NA> | 128.539993 | <NA> |
| 2019-11-15 18:10:00 | 66.339996 | <NA> | 0.0 | <NA> | 128.539993 | <NA> |
| 2019-11-15 18:15:00 | 66.330002 | <NA> | 0.0 | <NA> | 128.539993 | <NA> |
| 2019-11-15 18:20:00 | 66.330002 | <NA> | 0.0 | <NA> | 128.539993 | <NA> |
| 2019-11-15 18:25:00 | 66.349998 | <NA> | 0.0 | <NA> | 128.539993 | <NA> |
This is another example of an OS replacement causing a level shift. This will be manually unflagged, since there is no criteria to select fluctuations that gets around this.
[ ]: