Anomaly Detection

Overview

In this scenario the goal is to identify irregular values in an outcome variable prospectively in a homogeneous population (i.e. when no treatment / intervention is planned). As an example, we may wish to detect failure of any one machine in a cluster, and to do so, we wish to create a synthetic unit for each machine which is composed of a weighted average of other machines in the cluster. In particular, there may be variation of the workload across the cluster and where workload may vary across the cluster by (possibly unobserved) differences in machine hardware, cluster architecture, scheduler versions, networking architecture, job type, etc.

Like the Prospective Treatment Effects scenario, Feature data consist of of unit attributes (covariates) and a subset of the pre-intervention values from the outcome of interest, and target data consist of the remaining pre-intervention values for the outcome of interest, and Cross fold validation is conducted using the entire dataset, and Cross validation and gradient folds are determined randomly.

Example

In this scenario, we'll need a matrix with past observations of the outcome (target) of interest (targets), with one row per unit of observation, and one column per time period, ordered from left to right. Additionally we may have another matrix of additional features with one row per unit and one column per feature (features). Armed with this we may wish to construct a synthetic control model to help decide weather future observations (additional_observations) deviate from their synthetic predictions.

The strategy will be to divide the targets matrix into two parts (before and after column t), one of which will be used as features, and other which will be treated as outcomes for the purpose of fitting the weights which make up the synthetic controls model.

from numpy import hstack
from SparseSC import fit

# Let X be the features plus some of the targets
X = hstack([features, targets[:,:t])

# And let Y be the remaining targets
Y = targets[:,t:]

# fit the model:
fitted_model = fit(X=X,
                   Y=Y,
                   model_type="full")

The model_type="full" allows produces a model in which every unit can serve as a control for every other unit, unless of course the parameter custom_donor_pool is specified.

Now with our fitted synthetic control model, as soon as new set of targets outcomes are observed for each unit, we can create synthetic outcomes using our fitted model using the predict() method:

synthetic_controls = fitted_model.predict(additional_observations)

Note that while the call to fit() is computationally intensive, the call to model.predict() is fast and can be used for real time anomaly detection.

Model Details:

This model yields a synthetic unit for every unit in the dataset, and synthetic units are composted of the remaining units not included in the same gradient fold.

Type Units used to fit V & penalties Donor pool for W
(prospective) full All units All units