qaqc Module

class qaqc.ApplyFlags(df_index, flags=False)[source]

Bases: object

Apply flags from multiple sources to create a final set of flagged data.

Take flags from Provisional, QaRules(), and manual flags from ../qa_param.yaml and combine them to create a single set of data flags. Create documentation why things were flagged and provide tools to evaluate the final flag set.

apply_0_val()[source]
apply_FSDB_flags(auto_qa_event, flags)[source]

Add FSDB flags. Flags are an accumulation of rows that are True.

Flags are accumulated with |=. I.e. flags = flags or new_flags. Explanations and event codes are populated based on the auto_qa_event generated by QaRules().

Parameters:
  • auto_qa_event – A boolean DataFrame with a column for each event the data was tested for.

  • flags – A boolean DataFrame with a column for each flag.

Returns:

updates instance variables self.event and self.flags

apply_NAN_val()[source]
apply_event_code()[source]
apply_manual_flags(manual_notes)[source]
create_flag_log(probe, output_dir='./processed_data')[source]
get_flagged_days()[source]
import_provisional_data(prov_df, tank_col='INST', ppt_col='TOT', ppt_flag_col='TOT_Flag')[source]
plot_flagged_day(day, site, tdelta='1D', **kwargs)[source]
static read_manual_flags(site, qa_yaml='../qa_param.yaml')[source]
remove_GCE_F_flags()[source]

Where provisional processing has placed an F flag immediately following a J flag, the precip value from the record flagged J is duplicated in the record flagged F. The J flag signifies large precip jumps, so this artifact added > 200 mm of precip to CENT in WY19 if not removed.

Returns:

class qaqc.LoadProvisionalData(file_n='./config.yaml')[source]

Bases: object

static find_probe(df, search_list=[], search_col='Parameter')[source]

Find probe name in GCE output.

GCE output is in a flat format. As of April 2023 there are the following columns:

Example::

Date Parameter Value Flag_Value

2018-09-30 23:55:00 UPL_PRECIP_INST_455_0_01 45.050 <NA>

2018-09-30 23:55:00 UPL_PRECIP_TOT_455_0_01 0.000 <NA>

2018-09-30 23:55:00 UPL_PRECIP_ACC_455_0_01 2372.92 <NA>

So to find all the data for probe 1 at UPLO, you need to query for 3 different parameters from the flat file. This function looks at all the unique component names, currently in the parameter col, and searches for a list of identifiers, such as site and probe number, or site and probe height. It returns a list of Parameter names to query all data for a given probe.

Parameters:
  • df – Pandas dataframe containing GCE output

  • search_list – list of strings to identify a probe. Searches for Parameters that contain the whole list.

  • search_col – Column to search for probe names

Returns:

list of Parameter names that contain data for search probe.

load_ppt_data(strtyr=2018, endyr=2022, fname_base='MS043PPT_PPT_L1_5min_', **kwargs)[source]

Load GCE files for multiple water years.

Multiple years are concatenated together. All files must be in data_dir defined for this instance.

Any keyword argument accepted by pandas.read_csv() can be supplied to this function and will be passed to pandas.read_csv().

Warning

Data is filtered by Water Year (WY). Even if file contains data spanning a different date range, it will be trimmed. f”10/1/{y - 1}”:f”9/30/{y}”]

Warning

Assumes filename is format <fname_base><year>.csv. The method only works with year as suffix.

Parameters:
  • strtyr – int. First year to import.

  • endyr – int. Last year to import. If same as first year, only one year is imported.

  • fname_base – str. Filename without year (year must be suffix)

Returns:

classmethod pivot_on_probe(df, site, probe_num, keep_col_name=['Value', 'Flag_Value'], probeid_col='Parameter')[source]

Create pivot table of data for a single probe from GCE flat file.

As of April 2023, GCE precip output is a flat file with separate labels for 3 components. This method finds the 3 components for the requested probe and returns a pivot table. Components:

  • INST - The instantaneous measure of tank height

  • TOT - The total precipitation measured since the last timestep

  • ACC - The accumulated precip, a cumulative sum of WY to date.

Example::

FlatFormat

Date

Parameter

Value

Flag_Value

2018-09-30 23:55:00

CEN_PRECIP_INST_625_0_02

44.150

<NA>

2018-09-30 23:55:00

CEN_PRECIP_TOT_625_0_02

0.000

<NA>

2018-09-30 23:55:00

CEN_PRECIP_ACC_625_0_02

1739.51

<NA>

Pivot

Date

INST

INST_Flag

TOT

TOT_Flag

ACC

ACC_Flag

2018-09-30 23:55:00

44.15

<NA>

0.00

<NA>

1739.51

<NA>

Parameters:
  • df – Pandas DataFrame of a GCE flat file

  • site – str containing 3 character site ID

  • probe_num – str containing 2 character probe num

  • keep_col_name – list of column names to keep in final output

  • probeid_col – str. Column name containing site ID.

Returns:

Pandas DataFrame. Pivot table of data and flags for a single probe.

class qaqc.QaRules(df, qa_params)[source]

Bases: object

A set of QA rules to run on a single sensor.

Rules to identify places where there are problems in the data or low confidence in the accuracy of the values and all the metrics and calculations to evaluate those rules. Raw data is input and 2 outputs are produced:

  1. A DataFrame of flags accumulated by the rules applied. This shares a DateTime index with the data

    and is in the format of a single boolean column for each flag, identifying rows where flag conditions exist (flag is True).

  2. A boolean DataFrame of events or conditions. E.g. column ‘overflow’ is true wherever the rain gauge overflowed.

calc_run_avg_rainfall(rainfall_col='TOT', wind=4, nstd=2)[source]

Calculate the average and standard deviation for a rolling window of precipitation amounts.

Drain events are removed from data and running values are filled using linear interpolation. All running values are also rounded to the minimum precision.

Parameters:
  • rainfall_col – str. Column name for precip accumulated since last time step

  • wind – size of running window. If int, number of timesteps. If str, must be a valid Pandas frequency.

  • nstd – number of standard deviations to add to running average.

Returns:

2 Pandas Series: 1) running avg; 2) running avg +N std deviation of running average.

static find_drops(df, precision, col='INST', wind=3)[source]

Return boolean Series True where value is below the running average.

Primarily used to identify precip tank drain events and any subsequent recharge of mineral oil or antifreeze.

Note

Whole DataFrame is input instead of a slice with just one column to make it more clear what the function is doing.

Parameters:
  • df – Pandas DataFrame

  • precision – float. minimum precision of measurement

  • col – str. Name of column to assess in DataFrame

  • wind – Size of rolling window to use. Integer interpreted as number of timesteps. Any Pandas DateTime frequency that evenly divides into multiyear data is also accepted. For example, month is not accepted because not all months have the same number of days.

Returns:

Pandas Series of boolean values with index of df.

static find_neg_delta(df, col='INST', threshold=-25)[source]

Return boolean Series True where value dropped more than the threshold

Primarily used for identifying the time stamp where the precip storage tank is drained.

Parameters:
  • df – Pandas DataFrame

  • col – str of column name to search

  • threshold – float. Threshold to use to define a drain.

Returns:

Pandas Series of boolean values with index of df.

flag_drains(runavg_rainfall, rainfall_col='TOT')[source]

Flag data during drain events.

When the tank level is dropping, all precip values are changed to nan unless the value is <= running average of precipitation, which is flagged Q.

flag_* functions assign values to instance without return. Changes value of self.qa_flags to True where a flag’s conditions are met.

Parameters:
  • runavg_rainfall – a Pandas Series containing the running average of precipitation. calc_run_avg_rainfall()

  • rainfall_col – str of column containing precipitation data.

flag_duplicate_precip(ppt_col='TOT', tank_col='INST', large_ppt_size=1)[source]

Large amounts of precip were found duplicated in consecutive records. For example 173.1 and 173.0 in consecutive 5 min intervals.

This method identifies duplicates by looking for large precip that occurs where the tank level is flat and nearly duplicates the previous value. This has similar purpose to ApplyFlags.remove_GCE_F_flags(), but uses a numerical method.

flag_* functions assign values to instance without return. Changes value of self.qa_flags to True where a flag’s conditions are met.

Parameters:
  • ppt_col – str with column name

  • tank_col – str with column name

  • large_ppt_size – int. Multiple of precision

flag_empty_tank(tank_col='INST', pause_nsteps=2)[source]

A tank value <0 is not possible and means the sensor float is in a dead zone where it can not be read. This can be a result of a logger reboot or sensor removal rather than an actual measurement. If the tank value is <0 the next measurement will be falsely counted as precip. Due to the ‘F’ flag functionality in simple_pre.m it is necessary to filter for at least 2 timesteps after a tank measurement becomes <0.

meth:.flag_duplicate_precip and meth:.ApplyFlags.remove_GCE_F_flags catch the second timestep of values resulting from this case, but they leave the first timestep. This method removes both values.

flag_* functions assign values to instance without return. Changes value of self.qa_flags to True where a flag’s conditions are met.

Parameters:
  • tank_col – str. Name of column with tank level

  • pause_nsteps – int. number of time steps after zero tank to delete.

flag_overaccum_precip(overaccum_threshold=5, tank_col='INST', ppt_col='TOT')[source]

The change in tank level should match the amount of precip.

The algorithm that evaluates the tank level for precipitation looks for increases in 3 consecutive measurements to begin a rain event, so this must be carefully parameterized to prevent overflagging. It is also unclear what flags should emanate from this metric.

Warning

With a test case of CENT, an over-accumulation-threshold of 5 x precision captures all the events captured by flag_duplicate_precip() plus one additional event. In that case, ApplyFlags.remove_GCE_F_flags() captures all the same events as this method. So this is a more accurate numerical approach to capturing the conditions occurring when a “J” precedes an “F” flag. But the criteria in this method have the potential to capture other events that are not necessarily duplicates. I don’t want to nebulously flag duplicates as Q when they are an artifact, nor do I want to remove things captured by this criteria, so as of 5/17/23 this function is not used.

Parameters:
  • overcount_threshold – float. Number multiplied by probe precision defining the threshold for ppt overcount

  • tank_col – col

  • ppt_col

Returns:

flag_propagate_from_tank(tankflag_col='INST_Flag', ppt_col='TOT')[source]

Makes sure that Estimated and Missing flags are applied to both tank and precip values.

flag_* functions assign values to instance without return. Adds str to flags in self.df_orig where a flag’s conditions are met.

Parameters:
  • tankflag_col

  • pptflag_col

flag_recharge(runstd_rainfall, rainfall_col='TOT')[source]

Flag any recharge added to tank following a drain.

When some tanks are drained, they must be recharged with mineral oil and/or antifreeze. Mineral oil is used to prevent evaporation, primarily in summer months, while antifreeze is used to liquify unheated or frozen gauges. The addition of these liquids should not be counted as rain. The following criteria are applied:

  1. Drain events: a probe specific window following a drop in tank level (set_drain_event()).

  2. NA: impossible/highly-unlikely value is rainfall, meets both of:
    1. value > precip running avg +N std deviation of running avg

    2. value > probe max recharge

  3. Q: questionable whether value is rainfall or recharge. Only meets one of NA criteria, but does not meet both.

flag_* functions assign values to instance without return. Changes value of self.qa_flags to True where a flag’s conditions are met.

Parameters:
  • runstd_rainfall – a Pandas Series containing precipitation running avg +N std deviation of running average calc_run_avg_rainfall()

  • rainfall_col – str of column containing precipitation data.

flag_tank_overflow(max_tank_depth, tank_col='INST')[source]

Flag values where tank is above maximum fill. Gauge will not be able to record additional precip until the tank level drops.

flag_* functions assign values to instance without return. Changes value of self.qa_flags to True where a flag’s conditions are met.

Parameters:
  • max_tank_depth – float. Maximum fill point of tank.

  • tank_col – str with column name

reset_wy_acc(rainfall_col='TOT')[source]

Reset the water year accumulation (cumulative sum) for adjusted values. To be preformed occasionally if this column of data is needed.

Parameters:

cumsum_col

Returns:

static round_to_precision(df_col, precision)[source]

Round all values to the nearest precision of the instrument. For example, all tipping bucket measurements rounded to a number of whole bucket tips (no partial tips).

Parameters:
  • df_col – A single column of data. Array math will be preformed to each row of any input format.

  • precision – a numerical value that is the minimum step of the data.

Returns:

A column or array of data that has been rounded to whole steps equal to precision.

set_drain_event(tank_col='INST', event_window=3, neg_delta_threshold=-25)[source]

Find the drain events, defined by a running tank average greater than tank level, and any moments where the tank has a negative change in depth.

set_* functions assign values to instance without return.

Parameters:
  • tank_col – str containing name of the column with tank level.

  • event_window – int. What size window to use to calculate the event running average.

  • neg_delta_threshold – int. What size threshold to use to define negative changes in tank depth