Anomaly Detection is based on the concept of describe the statistical behaivior of known good data, and than use this as a reference to identify bad data by uncommon characteristics.
From a list of data files, analyze all and characterize each measurement by a series of features, like for example the gradient or the difference with the climatology. Than, rank all files based on how unexpected is each feature, i.e. a measurement with a spike too intense, or too different from the climatology would show up first.
rank_files(datadir, varname, cfg=None)
datadir: root directory with the data to be evaluated
varname: Variable to be evaluated, like TEMP
cfg: Q.C. rule to be considered
Return a list of all files inside datadir ordered by the probablity of being all good data.
Calibrate Anomaly Detection¶
Calibrate the parameters for anomaly detection to best reproduce a preset Q.C. rule (for example: GTSPP). Since the anomaly detection consider simultaneously several features together to make a final decision, it should achieve more consistent results. A measurement with several tests too close to the traditional Q.C. thresholds would be approved by the traditional approach, but would raise suspicious, or even fail, in the anomaly detection approach.
calibrate_anomaly_detection(datadir, varname, cfg=None)