Summary of Flag Quality Checks¶
As checks are performed throughout this process, flags are assigned. Before flags become final, additional checks are performed to ensure that flags are internally consistent and do not create any confusion or contradictions. Flags can be layered on top of eachother and only one can be chosen for assignment in the final data.
Accumulating Flags¶
Flags are accumulated in stages.
Each probe has a list of rules and checks to be applied. These quality assurance checks are performed by QaRules, and then any flags are applied to the data by ApplyFlags.
Next, provisional flags are applied to the data. These flags are not allowed to be assigned where there is an existing flag. This allows more specific flags assigned by QaRules to take precedence.
Any manual flags are assigned. Manual flags wipe any concurrent flags when they are assigned.
Some additional rules are applied to the cleaned dataset. These include checks that require flags to be assigned first, like remove_GCE_F_flags, and checks that require the data to be near final form, such as round_precip_to_min_increment
Cross probe checks are performed on the cleaned dataset to identify clogs, and clog flags are assigned to the data.
Manual flags are reassigned, allowing them to overwrite any clog flags
Checking Flags¶
Through this process, it is possible that multiple checks have layered different flags on top of each other. So checks are performed to make them internally consistent and chose a singular final flag. Checks include:
All NA values are flagged Missing and all Missing values are set as NA
affirm_NaN_flagged_M: Add M flags or reset values to NA to make internally consistent.
All values that have been reset to 0 are flagged as an estimate
affirm_zero_flagged_E: Add E flags (if missing) anywhere that the data has been reset to 0, unless there is an alternate manual flag
All CLOG events only have a U, C, or no flag
affirm_CLOG_flagged_UC: Selects the event_code CLOG and only allow U, C, or no flag. Provisional data often, but not always, identifies a clog with a manual Q flag. This ensures that Q’s do not override a U or a C, as well as preventing Q’s from filling periods where a majority of probes agree that there was no precip to be missed by a clog.
Only one flag can exist for each timestep.
affirm_only_one_flag: Overlapping probes are assigned in order of precedence, clearing all other flags:
Manual flags take precedence over any other flag
Missing (M) flags take next precedence. If the data isn’t there, no other flag would have any meaning.
Undercatch (U) flags are more descriptive than a flag like Questionable (Q).
Cumulative (C), also known as delayed precip. This is the flip side of a U flag, occurring when a clog is released or begins to melt. If the amount of precip is ambiguous and could indicate a clog release or continued undercatch, an undercatch takes precedence. This is more descriptive than a Q flag.
Estimates (E) flags. This should be limited to data that has been intentionally filled, but is not during a clog (where a U or C would be applied).
Questionable (Q) flags. These identify data that is suspicious, but give no explanation about what makes the data suspicious or untrustworthy. This is the most generic flag that gives little information.
During the development of these checks, not all probes had been parameterized for cross probe checks to identify clogs. So there are many examples where it is expected that clog flags will be applied in the future.
How Common is This in The Data¶
[1]:
# must install ipympl (Ipython-matplotlib) and nodejs
from ipywidgets.embed import embed_minimal_html
from ipywidgets import Layout
import matplotlib.pyplot as plt
# Jupyter magic to make plots display interactive
%matplotlib ipympl
# expand all plots to comfortable viewing size
plt.rcParams['figure.figsize'] = [8, 5]
Layout(width='400px', height='300px')
import pandas as pd
from numpy import nan, arange, floor
import sys
sys.path.append("../../")
from post_gce_qc import qaqc, data_transfer, cross_probe_qc, main
Get data and run all applicable QaRules.
[2]:
# get data
prov = data_transfer.LoadProvisionalData(strtyr=2019, endyr=2024, file_n='../../config_new.yaml',
fname_base='MS00413_PPT_L1_5min_')
prov.load_ppt_data()
df = prov.pivot_on_probe(prov.df, 'VAR', '02')
param = qaqc._load_yaml(file_n='../../qa_param.yaml')['VAR_02']
# run all QaRules on data
qa_flags, qa_events = main.qc_provisional(df, param)
Apply all flags across multiple sources.
[3]:
# apply flags from all QA
var_flags = qaqc.ApplyFlags(df.index, param['precision'])
# import provisional data and flags
var_flags.import_provisional_data(df)
[4]:
# apply QaRules flags to the data
var_flags.apply_QaRules_flags(qa_events, qa_flags)
# apply GCE flags where applicable
var_flags.apply_GCE_flags()
# apply all manual flags, overwriting any other flags
var_flags.apply_manual_flags(param['manual_flags'])
Change the data based on flagging.
[5]:
var_flags.apply_NAN_val()
var_flags.apply_0_val()
var_flags.remove_GCE_E_flags()
var_flags.remove_GCE_F_flags()
var_flags.prorate_precip_during_tank_flux()
var_flags.round_precip_to_min_increment(scrape_remainder_window=6)
Multiple Flags¶
[6]:
is_flag_col = ~var_flags.flags.columns.isin(['Set0', 'SetNA'])
col = var_flags.flags.columns[is_flag_col]
n_flags = var_flags.flags[col].sum(axis=1)
var_flags.flags[n_flags>1]
[6]:
| Q | U | C | * | SetNA | Set0 | E | M | |
|---|---|---|---|---|---|---|---|---|
| Date | ||||||||
| 2019-04-17 05:00:00 | True | False | False | False | False | False | True | False |
| 2019-04-17 07:20:00 | True | False | False | False | False | False | True | False |
| 2019-04-17 08:35:00 | True | False | False | False | False | False | True | False |
| 2019-04-17 08:45:00 | True | False | False | False | False | False | True | False |
| 2019-04-17 08:55:00 | True | False | False | False | False | False | True | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2024-09-12 19:40:00 | True | False | False | False | False | False | True | False |
| 2024-09-12 19:50:00 | True | False | False | False | False | False | True | False |
| 2024-09-12 20:00:00 | True | False | False | False | False | False | True | False |
| 2024-09-12 20:10:00 | True | False | False | False | False | False | True | False |
| 2024-09-12 21:20:00 | True | False | False | False | False | False | True | False |
1357 rows × 8 columns
So both E and Q flags have been assigned, let’s see why each was assigned.
[7]:
var_flags.event[n_flags>1]
[7]:
| prov_flag | tank_flag | QaRule_flag | manual_flag | final_flag | event_code | explanation | |
|---|---|---|---|---|---|---|---|
| Date | |||||||
| 2019-04-17 05:00:00 | <NA> | Q | E | INTPRO | ApplyFlags AutoFlag: prorate precip during diu... | ||
| 2019-04-17 07:20:00 | <NA> | Q | E | INTPRO | ApplyFlags AutoFlag: prorate precip during diu... | ||
| 2019-04-17 08:35:00 | <NA> | Q | E | INTPRO | ApplyFlags AutoFlag: prorate precip during diu... | ||
| 2019-04-17 08:45:00 | <NA> | Q | E | INTPRO | ApplyFlags AutoFlag: prorate precip during diu... | ||
| 2019-04-17 08:55:00 | <NA> | Q | E | INTPRO | ApplyFlags AutoFlag: prorate precip during diu... | ||
| ... | ... | ... | ... | ... | ... | ... | ... |
| 2024-09-12 19:40:00 | <NA> | Q | E | INTPRO | ApplyFlags AutoFlag: prorate precip during diu... | ||
| 2024-09-12 19:50:00 | <NA> | Q | E | INTPRO | ApplyFlags AutoFlag: prorate precip during diu... | ||
| 2024-09-12 20:00:00 | <NA> | Q | E | INTPRO | ApplyFlags AutoFlag: prorate precip during diu... | ||
| 2024-09-12 20:10:00 | <NA> | Q | E | INTPRO | ApplyFlags AutoFlag: prorate precip during diu... | ||
| 2024-09-12 21:20:00 | <NA> | Q | E | INTPRO | ApplyFlags AutoFlag: prorate precip during diu... |
1357 rows × 7 columns
OK, the tank was flagged questionable in provisional with a broad manual flag, but diurnal flux was found, which lead to values being prorated, and flagged E. The E with an INTPRO event_code makes it clear that there was a processing error and values were estimated. Let’s select only one final flag.
[8]:
var_flags.affirm_only_one_flag()
352: UserWarning: More than one flag assigned at the same time. Only one flag is retained by precedence.
[9]:
var_flags.flags[n_flags>1]
[9]:
| Q | U | C | * | SetNA | Set0 | E | M | |
|---|---|---|---|---|---|---|---|---|
| Date | ||||||||
| 2019-04-17 05:00:00 | False | False | False | False | False | False | True | False |
| 2019-04-17 07:20:00 | False | False | False | False | False | False | True | False |
| 2019-04-17 08:35:00 | False | False | False | False | False | False | True | False |
| 2019-04-17 08:45:00 | False | False | False | False | False | False | True | False |
| 2019-04-17 08:55:00 | False | False | False | False | False | False | True | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2024-09-12 19:40:00 | False | False | False | False | False | False | True | False |
| 2024-09-12 19:50:00 | False | False | False | False | False | False | True | False |
| 2024-09-12 20:00:00 | False | False | False | False | False | False | True | False |
| 2024-09-12 20:10:00 | False | False | False | False | False | False | True | False |
| 2024-09-12 21:20:00 | False | False | False | False | False | False | True | False |
1357 rows × 8 columns
The E flag is more descriptive about what happened and takes precedence. The Q flags have been erased.
M Flags Match NA Data¶
It is important that all NA data has an M flag, and that all M flags have NA data. This is key to internal consistencey.
During development several cases were found, but the underlying cause was fixed, so there are no remaining examples in the data. To see examples, read the full account of flag QA development.
Flag E Where Value Set to 0¶
The Estimate (E) flag is used when prorating is performed, or where values are set to 0. Values are set to 0 when there is great confidence that no rain could have fallen at that time. Sometimes this is due to direct (manual) observation that no precip fell over a short period or sometimes it is due to an anomolous condition, such as diurnal fluctuation, empty tank, or artificially repeating values, where it is assumed that no actual precip fell. For this reason, the data should always be flagged as an E anytime the data is being changed to a 0. The manual flag from direct observation is an outlying condition, not the norm, and it is more important to be internally consistent and provide clear markers in the data when a 0 value is being assumed.
[10]:
mismatch = var_flags.flags['Set0'] & ~var_flags.flags['E']
var_flags.flags[mismatch]
[10]:
| Q | U | C | * | SetNA | Set0 | E | M | |
|---|---|---|---|---|---|---|---|---|
| Date | ||||||||
| 2022-07-03 16:35:00 | False | False | False | False | False | True | False | False |
| 2022-07-03 16:40:00 | False | False | False | False | False | True | False | False |
| 2022-07-03 16:45:00 | False | False | False | False | False | True | False | False |
| 2022-07-13 16:35:00 | False | False | False | False | False | True | False | False |
[11]:
var_flags.event[mismatch]
[11]:
| prov_flag | tank_flag | QaRule_flag | manual_flag | final_flag | event_code | explanation | |
|---|---|---|---|---|---|---|---|
| Date | |||||||
| 2022-07-03 16:35:00 | MM | M | M | INSREM | ManualFlag: sensor reinstallation. tank raised... | ||
| 2022-07-03 16:40:00 | MM | M | M | INSREM | ManualFlag: sensor reinstallation. tank raised... | ||
| 2022-07-03 16:45:00 | MM | M | M | INSREM | ManualFlag: sensor reinstallation. tank raised... | ||
| 2022-07-13 16:35:00 | <NA> | Q | MAINTE | ManualFlag: sensor standpipe drained for repai... |
Here we see that the provisional flag of missing has been transferred to the QaRule flag, because if there is no data, no other flag makes sense. We also see that provisional flaggeed the tank Q. But these flags are superceded by the manual flag. The manual flag is responsible for resetting the value to 0, effectively creating data where there was none. While the checksheets do confirm that these events occurred during a dry period, in both cases the sensor was removed, or the tank was empty, so there inherently is no way to know if there was any precipitation. Having an E value when an instrument is removed is internally consistent from the user’s perspective, since they have no way to know that we have a direct observation for this period. So let’s see how this situation looks after it has been checked by its method.
[12]:
var_flags.affirm_zero_flagged_E()
304: UserWarning: Precip set to 0 without E flag or manual flag. E flag added
[13]:
var_flags.flags[mismatch]
[13]:
| Q | U | C | * | SetNA | Set0 | E | M | |
|---|---|---|---|---|---|---|---|---|
| Date | ||||||||
| 2022-07-03 16:35:00 | False | False | False | False | False | True | True | False |
| 2022-07-03 16:40:00 | False | False | False | False | False | True | True | False |
| 2022-07-03 16:45:00 | False | False | False | False | False | True | True | False |
| 2022-07-13 16:35:00 | False | False | False | False | False | True | True | False |
Flagging During Clog Events¶
This package applies a complex clog detection methodology, always requiring at least two probes to confirm a clog. Once a clog is confirmed, a running average of precip (usually 1 hour) is compared to other probes to determine if the clogged probe is experiencing undercatch or delayed accumulation. If it is shown that other probes as well as the clogged probe are all experiencing a dry period, no flag is assigned. Provisional flagging, or flags checking other rules, can interfere with this carefully designed flagging system. So, flags other than U or C should be discarded during a clog, first because U or C should supercede any other flag, and second, because other flags should not fill in dry periods within a clog.
Let’s look at an example from CEN 01 (stand alone), where substantial work has been done to check for clogs. First we need to clean all the data so that we can then perform a cross comparison.
[15]:
params = qaqc._load_yaml('../../qa_param.yaml')
# 1. load data
all_probes = main.load_data(2019, 2024, fname_base='MS00413_PPT_L1_5min_', data_path='../../config_new.yaml')
probes = params.keys()
# 2. QA all data
all_flags = {}
for probe in probes:
site = probe[:3]
nprobe = probe[-2:]
# 2a. select data and qa rules to apply
df = all_probes.pivot_on_probe(all_probes.df, site, nprobe, keep_col_name=['Value', 'Flag_Value'],
probeid_col='Parameter')
param = params[probe]
# 2b. run rules and tests on data
qa_flags, qa_events = main.qc_provisional(df, param)
# 2c. combine all flags
flags = main.apply_all_flags(df, qa_flags, qa_events, param)
all_flags[probe] = flags
# 3. build pivot table of cleaned data for cross site comparison
xppt = cross_probe_qc.BuildXTable.assemble_cross_table(all_flags, ppt_col='adj_precip')
xacc = cross_probe_qc.BuildXTable.assemble_wy_acc(xppt)
for probe in probes:
param_auto = params[probe]['auto_flag']
# 4. perform cross site comparison
if 'flag_x_clogs' in param_auto:
main.qc_cross_probe(xacc, xppt, param_auto, probe, all_flags[probe])
Loading all PPT data from ../../config_new.yaml
214: UserWarning: No existing flags found. qaqc.ApplyFlags.apply_GCE_flags was designed to fill in where there are not other flags. Consider running qaqc.ApplyFlags.apply_QaRules_flags first.
214: UserWarning: No existing flags found. qaqc.ApplyFlags.apply_GCE_flags was designed to fill in where there are not other flags. Consider running qaqc.ApplyFlags.apply_QaRules_flags first.
214: UserWarning: No existing flags found. qaqc.ApplyFlags.apply_GCE_flags was designed to fill in where there are not other flags. Consider running qaqc.ApplyFlags.apply_QaRules_flags first.
Performing cross probe on CEN_01
Performing cross probe on CEN_02
Performing cross probe on CS2_02
Performing cross probe on PRI_03
[18]:
clogs = all_flags['CEN_01'].event.event_code == 'CLOG'
all_flags['CEN_01'].event.loc[clogs, 'QaRule_flag'].unique()
[18]:
<ArrowExtensionArray>
['', 'MU', 'M', 'C', 'U']
Length: 5, dtype: string[pyarrow]
[20]:
all_flags['CEN_01'].event.loc[clogs, 'prov_flag'].unique()
[20]:
<ArrowExtensionArray>
[<NA>, 'J', 'WM', 'MM', 'W', 'R']
Length: 6, dtype: string[pyarrow]
[19]:
all_flags['CEN_01'].event.loc[clogs, 'manual_flag'].unique()
[19]:
<ArrowExtensionArray>
['U', 'C', '']
Length: 3, dtype: string[pyarrow]
There are a number of missing flags during clogs. Let’s take a look at them.
[22]:
m = all_flags['CEN_01'].event['QaRule_flag'].str.contains('M')
all_flags['CEN_01'].event[m&clogs]
[22]:
| prov_flag | tank_flag | QaRule_flag | manual_flag | final_flag | event_code | explanation | |
|---|---|---|---|---|---|---|---|
| Date | |||||||
| 2018-12-18 06:50:00 | <NA> | M | MU | CLOG | ManualFlag: remove gce m flag during clog peri... | ||
| 2018-12-18 06:55:00 | <NA> | M | MU | CLOG | ManualFlag: remove gce m flag during clog peri... | ||
| 2018-12-18 07:00:00 | <NA> | M | MU | CLOG | ManualFlag: remove gce m flag during clog peri... | ||
| 2018-12-18 07:05:00 | <NA> | M | MU | CLOG | ManualFlag: remove gce m flag during clog peri... | ||
| 2018-12-18 07:10:00 | <NA> | M | MU | CLOG | ManualFlag: remove gce m flag during clog peri... | ||
| ... | ... | ... | ... | ... | ... | ... | ... |
| 2021-10-12 16:40:00 | MM | M | M | U | CLOG | ManualFlag: valve left open at gauge. not enou... | |
| 2021-10-12 16:45:00 | MM | M | M | U | CLOG | ManualFlag: valve left open at gauge. not enou... | |
| 2021-10-12 16:50:00 | MM | M | M | U | CLOG | ManualFlag: valve left open at gauge. not enou... | |
| 2021-10-12 16:55:00 | MM | M | M | U | CLOG | ManualFlag: valve left open at gauge. not enou... | |
| 2021-10-12 17:00:00 | MM | M | M | U | CLOG | ManualFlag: valve left open at gauge. not enou... |
7738 rows × 7 columns
We can see that the provisional data was flagged as missing during the clog. This communicates to the user of provisional data, that there is a gap in the data, ensuring that it isn’t interpretted as a rain free period. All missing data is transfered to QaRule, since no information is available if the data is missing. However, this data was manually unflagged and filled with zero precip values. The tank value was filled forward from the last known value. That allows a more complex assessment of the data which identified this period as a clog. This is a more descriptive flag for the user of final data, informing them that the sensor was clogged, and undercatching. This also then explains the final and sudden increase in the tank when it unclogged as delayed precip, the cumulative total since the last good value, so that it is not interpretted as a massive downpour of multiple days of precip all at once.
In this case, no clog analysis could be performed until the data was restored with the manual flag. In turn, the manual flag has already overwritten the M flag, clearing those values. However, there are other cases where clogs are flagged with a Q in provisional data, so there is no manual flag. This would provide two conflicting flags, U vs Q and C vs Q and would fill all dry periods with a Q value. By applying this check, we keep dry periods unlfagged and provide an additional check to prevent Q flags from overwriting U or C flags during a clog.
[ ]: