Welcome to SparseSC’s 0.2.0 documentation!¶
Overview¶
SparseSC is a package that implements an ML-enhanced version of Synthetic Control Methodologies. Typically this is used to estimate causal effects from binary treatments on observational panel (longitudinal) data. The functions fit()
and fit_fast()
provide basic fitting of the model. If you are estimating treatment effects, fitting and diagnostic information can be done via estimate_effects()
.
Data¶
Though the fitting methods do not require such structure, the typical setup is where we have panel data of an outcome variable \(Y\) for \(T\) time periods for \(N\) observation units (customer, computers, etc.). We may additionally have some baseline characteristics \(X\) about the units. In the treatment effect setting, we will also have a discrete change in treatment status (e.g. some policy change) at time, \(T_0\), for a select group of units. When there is treatment, we can think of the pre-treatment data as [\(X\), \(Y_{pre}\)] and post-treatment data as [\(Y_{post}\)].
Synthetic Controls¶
Synthetic Controls was originally designed as a treatment effect estimator for a small number of regions (e.g. states). It is a type of matching estimator. For each unit it will find a weighted average of untreated units that is similar on key pre-treatment data (\(X\), \(Y_{pre}\)). The goal of Synthetic controls is find out which variables are important to match on (specified by the diagonal matrix V
where element \(v_{k,k}\) is the weight for the \(k\)-th pre-treatment variable) and then given those a vector of per-unit weights that combine the control units into its synthtic control. The synthetic control acts as the counterfactual for a unit, and the estimate of a treatment effect is the difference between the observed outcome in the post-treatment period and the synthetic control's outcome.
One way to think of SC is as an improvement upon difference-in-difference (DiD) estimation. Typical DiD will compare a treated unit to the average of the control units. But often the treated unit does not look like a typical control (e.g. it might have a different growth rate), in which case the 'parallel trend' assumption of DiD is not valid. SC remedies this by choosing a smarter linear combination, rather than the simple average, to weight more heavily the more similar units.
The authors show if endogeneity of treatment is driven by a factor model with vectors components \(f_t\cdot\lambda_i\) where \(\lambda_i\) might be correlated with treatment (a simple example would be that there are groups with different typical time trends) and the synthetic control is able to reproduce the treated unit's pre-treatment history, then as the pre-history grows the size of the expected bias of the estimated treatment effect approaches zero (NB: this is not quite consistency). Essentially, if there are endogenous factors that affect treatment and future outcomes then you should be able to control for them by matching on past outcomes. The matching that SC provides can therefore deal with some problems in estimation that DiD can not handle.
SparseSC Solution Structure¶
Given a specific set of variable weights, \(V\) and control variable matrix of data to match on \(M^C\), unit-weights for unit \(i\) is \(W_i=\arg\min_{W}\sum_k(M_{ik}-W\cdot M^C_{\cdot,k})^2\cdot v_{kk}\). Synthetic Controls typically also restricts the weight vector to be non-negative and sum to one. These restrictions may aid interpretability, though they are not econometrically necessary and may harm performance (e.g. they make it difficult to model units on the convex hull of the matching-space).
The determination of \(V\) can be done jointly or separately.
- Jointly: We find \(V\) such that resulting SCs have outcomes that are 'optimal' in some sense. They originally minimized squared prediction error on \(Y_{pre}\). In the standard Stata
synth
command this is thenested
option. - Separately: We perform some non-matching analysis to determine \(V\). They originally found those variables that were the best linear predictors of \(Y_{pre}\). In the standard Stata
synth
command this is the default option.
The separate solution is much faster and often the only feasible method with larger data.
Inference¶
Inference is typically done by placebo tests (randomization inference). The procedure is replicated for the control units and we get a distribution of placebo effects from the differences for the control units. We then compare main effect to this placebo distribution to evaluate statistical significance.
Multiple treated units¶
If there are multiple (N1
) treated units then the average treatment effect averages over these individual effects. The elements of the placebo distribution are each then averages over N1
differences for untreated units. If the full placebo distribution is too large to compute then it can be randomly sampled to the desired precision.
If units experience treatment at different times, then we define consistent pre and post period (T0
, T1
) and compute difference for each event, then align these according to the standard windows and compute effects.
SparseSC¶
SparseSC makes a number of changes to Synthetic Controls. To avoid overfitting, ensure uniqueness of the solution, and allow for estimation on large datasets we:
- Automatically find a low-dimensional space to match in. We explicitly target a 'sparse' space so as to limit non-parametric bias that is typical in matching estimators. Using ML to find this space also removes ad-hoc decisions by analysts that can affect results. We provide two strategies:
- Feature selection - We impose an \(L_1\) penalization on the existing [\(X\), \(Y_{pre}\)] variables.
- Feature learning/generation - We generate a low dimensional space \(M\) to match on using time-series neural-networks (LSTM). This exploits the time-series nature of the outcome.
- Impose an \(L_2\) penalization on weights as they deviate from \(1/N_{controls}\). That is, in the absence of signal, we should resort to a simple comparison to the control group. This also implicitly penalizes weights whose sums is different than one.
Additional changes include:
- We fit \(V\) once for the whole sample rather than for each unit. This reduces overfitting and speeds execution.
- For optimizing
V
we target squared prediction error on \(Y_{post}\) for the control units, thus clearly separating features from targets. In the original formulation \(Y_{pre}\) were often features used to predict \(Y_{pre}\). This caused problems as \(X\)'s could not be included with all \(Y_{pre}\). Analysts were thus forced to hand-choose their matching varibles, which could introduce bias and lack of robustness.
SparseSC Solution Structure¶
As with the original SC, there are choices about whether to calculate all parameters on the main matching objective or whether to get approximate/fast estimates of them using non-matching formulations. To complicate matters, we now typically have to solve for two new penalty parameters v_pen
and w_pen
in addition to V
and W
. The methods are:
- Full joint (done by
fit()
): We optimize overv_pen
,w_pen
andV
, so that the resulting SC for controls have smallest squared prediction error on \(Y_{post}\). - Separate (done by
fit_fast()
): We note that we can efficiently estimatew_pen
on main matching objective, since, givenV
, we can reformulate the finding problem into a Ridge Regression and use efficient LOO cross-validation (e.g.RidgeCV
) to estimatew_pen
. We will estimateV
using an alternative, non-matching objective (such as aMultiTaskLasso
of using \(X,Y_{pre}\) to predict \(Y_{post}\)). This setup also allows for feature generation to select the match space. There are two variants depending on how we handlev_pen
:- Mixed. Choose
v_pen
based on the resulting down-stream main matching objective. - Full separate: Choose
v_pen
base on approximate objective (e.g.,MultiTaskLassoCV
).
- Mixed. Choose
The Fully Separate solution is fast and often quite good so we recommend starting there, and if need be, advancing to the Mixed and then Fully Joint optimizations.
Custom donor pools¶
There are two main reasons why one might want to limit the per-unit donor pools:
- Interference/contamination (violations of SUTVA)
- limiting 'interpolation bias' (if there is a strong non-linearity in how features map to outcomes, then even if the synthetic control is similar in feature space, its outcome might not be).
For all entry-points, one can specify the custom_donor_pool
which is an (N,N0)
boolean matrix of allowable donors for each unit.
Here is the default donor pool
base_custom_donor_pool = (1-np.eye(N))[:,control_units].astype('bool')
Interference Example¶
Suppose the units you were dealing with are located in physical space and there is some concern that a treatment will spill over to adjacent units. Below is a way to construct the custom_donor_pool
from an adjacency matrix A
.
custom_donor_pool = (1 - A)[:,control_units].astype('bool')
custom_donor_pool = np.logical_and(custom_donor_pool, base_custom_donor_pool)
Interpolation Bias Example¶
Suppose we want to use 'calipers' to ensure that units are only matched to those with a maximum difference of thresh
on a key metric \(B\) (which is \(N\)-vector)
custom_donor_pool = (np.abs(B[:,np.newaxis] - B) < thresh)[:,control_units].astype('bool')
custom_donor_pool = np.logical_and(custom_donor_pool, base_custom_donor_pool)
Model types¶
There are two main model-types (corresponding to different cuts of the data) that can be used to estimate treatment effects.
- Retrospective: The goal is to minimize squared prediction error of the control units on \(Y_{post}\) and the full-pre history of the outcome is used as features in fitting. This is the default and was used in the descriptive elements above.
- Prospective: We make an artificial split in time before any treatment actually happens (\(Y_{pre}=[Y_{train},Y_{test}]\)). The goal is to minimize squared prediction error of all units on \(Y_{test}\) and \(Y_{train}\) for all units is used as features in fitting.
Given the same amount of features, the two will only differ when there are a non-trivial number of treated units. In this case the prospective model may provide lower prediction error for the treated units, though at the cost of less pre-history data used for fitting. When there are a trivial number of units, the retrospective design will be the most efficient.
See more details about these and two additional model types at model-types.
Fitting Sparse Synthetic Controls¶
The fit()
and fit_fast()
functions can be used to create a set of weights and returns a
fitted model which can be used to create synthetic units using it's
.predict()
method:
from SparseSC import fit
# Fit the model:
fitted_model = fit(features,targets,...)
# Get the fitted synthetic controls for `targets`:
in_sample_predictions = fitted_model.predict()
# Make predictions for a held out set of fetures (targets_hat)
# using the fitted synthetic controls model:
additional_predictions = fitted_model.predict(targets_additional)
Note that targets
and features
here are depend on the model type and are not the
typical analysts outcome and covariates.
The two methods differ in terms of there choices about whether to calculate all parameters on the main matching objective or whether to get approximate/fast estimates of them using non-matching formulations.
- Full joint (done by
fit()
): We optimize overv_pen
,w_pen
andV
, so that the resulting SC for controls have smallest squared prediction error on \(Y_{post}\). - Separate (done by
fit_fast()
): We note that we can efficiently estimatew_pen
on main matching objective, since, givenV
, we can reformulate the finding problem into a Ridge Regression and use efficient LOO cross-validation (e.g.RidgeCV
) to estimatew_pen
. We will estimateV
using an alternative, non-matching objective (such as aMultiTaskLasso
of using \(X,Y_{pre}\) to predict \(Y_{post}\)). This setup also allows for feature generation to select the match space. There are two variants depend on how we handlev_pen
:- Mixed. Choose
v_pen
based on the resulting down-stream main matching objective. - Full separate: Choose
v_pen
base on approximate objective (e.g.,MultiTaskLassoCV
). The Fully Separate solution is fast and often quite good so we recommend starting there, and if need be, advancing to the Mixed and then Fully Joint optimizations.
- Mixed. Choose
Feature and Target Data¶
When estimating synthetic controls, units of observation are divided into control and treated units. Data collected on these units may include observations of the outcome of interest, as well as other characteristics of the units (termed "covariates", herein). Outcomes may be observed both before and after an intervention on the treated units.
To maintain independence of the fitted synthetic controls and the post-intervention outcomes of interest of treated units, the post-intervention outcomes from treated units are not used in the fitting process. There are two cuts from the remaining data that may be used to fit synthetic controls, and each has it's advantages and disadvantages.
In the call to fit()
and fit_fast()
, parameters features
and targets
should be numeric matrices
containing data on the features and target variables, respectively, with
one row per unit of observation, and one column per feature or target
variable.
Data and Model Type¶
There area 4 model types that can be fit using the fit functions which
can be selected by passing one of the following values to the model_type
parameter:
"retrospective"
: In this model, data are assumed to be collected retrospectively, sometime after an intervention or event has taken place in a subset of the subjects/units, typically with the intent of estimating the effect of the intervention.In this model,
targets
should contain target variables recorded after the event of interest andfeatures
may contain a combination of target variables recorded prior to the event of interest and other predictors / covariates known prior to the event. In addition, the rows infeatures
andtargets
which contain units that were affected by the intervention ("treated units") should be indicated using thetreated_units
parameter."prospective"
: In a prospective analysis, a subset of units have been designated to receive a treatment but the treatment has not yet occurred and the designation of the treatment may be correlated with a (possibly unobserved) feature of the treatment units. In this scenario, all data are collected prior to the treatment intervention, and data on the outcome of interested are divided in two, typically divided in two subsets taken before and after a particular point in time.In this model,
targets
should contain only target variables andfeatures
may contain a combination of target variables and other predictors / covariates. The parameterstreated_units
should be used to indicate the units which will or will not receive treatment."prospective-restricted"
: This is motivated by the same example as the previous sample. It requires a larger set of treated units for similar levels of precision, with the benefit of substantially faster running time."full"
: This model is motivated by the need for prospective failure detection, and is not used in the context of a historical event or treatment intervention.like the
prospective
models, data on the outcome of interested are divided in two, typically divided in two subsets taken before and after a particular point in time, andtargets
should contain only target variables andfeatures
may contain a combination of target variables and other predictors / covariates. The parametertreated_units
is unused.
A more through discussoin of the model types can be found Model Types Page.
Fit (Joint)¶
Penalty Parameters¶
The fitted synthetic control weights depend on the penalties applied to the V and W
matrices (v_pen
and w_pen
, respectively), and the fit()
function will
attempt to find an optimal pair of penalty parameters. Users can modify the selection
process or simply provide their own values for the penalty parameters, for
example to optimize these parameters on their own, with one of the
following methods:
1. Passing v_pen
and w_pen
as floats:¶
When single values are passed in the to the v_pen
and w_pen
, a fitted
synthetic control model is returned using the provided penalties.
2. Passing v_pen
as a value and w_pen
as a vector, or vice versa:¶
When either v_pen
or w_pen
are passed a vector of values, fit()
will iterate over the vector of values and return the model with an optimal
out of sample prediction error using cross validation. The choice of model
can be controlled with the choice
parameter which has the options of
"min"
(default) which selects the model with the smallest out of sample
error, "1se"
which implements the 'one standard-error' rule, or a
function which implements a custom selection rule.
Note that passing vectors to both v_pen
and w_pen
is assumed to be
inefficient and fit
will raise an error. If you wish to evaluate over a N x N
grid of penalties, use:
from intertools import product
fitted_models = [ fit(..., v_pen=v, w_pen=w) for v,w in product(v_pen,w_pen)]
3. Modifying the default search¶
By default fit()
picks an arbitrary value for w_pen
and creates a grid
of values for v_pen
over which to search, picks the optimal for v_pen
from the set of parameters, and then repeats the process alternating
between a fixed v_pen
and array of values w_pen
and vice versa until
stopping rule is reached.
The grid over which each penalty parameter is searched is determined by the
value of the other (fixed) penalty parameter. For example, for a given
value of w_pen
there is a maximum value of v_pen
which does not result
in a null model (i.e. when the V matrix would be identically 0 and W would
be identically 1/N), and the same logic applies in both scenarios (i.e.
when w_pen
is fixed).
The search grid is therefor bounded between 0 and the maximum referenced
above. By default the grid consists of 20 points log-linearly spaced
between 0 and the maximum. The number of points in the grid can be
controlled with the grid_length
parameters, and the bounds are controlled
via the grid_min
and grid_max
parameters. Alternatively, an array of
values between 0 and 1 can be passed to the grid
parameter and will be
multiplied by the relevant grid_max
to determine the search grid at each
iteration of the alternating coordinate descent.
Finally, the parameter stopping_rule
determines how long the coordinate
descent will alternate between searching over a grid of V and W penalties.
(see the Big list of parameters for details)
Advanced Topics¶
Constraining the V matrix¶
In the current implementation, the V matrix is a diagonal matrix, and the individual elements of V are constrained to be positive, as negative values would be interpreted as two units would considered to more similar when their observed values for a particular feature are more different.
Additionally, the V matrix may be constrained to the standard simplex.
which tends to minimize out of sample of error relative to the model
constrained to the nonnegative
orthant in some cases. V is
constrained to the either the simplex or the nonnegative orthant by passing
either "simplex"
or "orthant"
to the constrain
parameter.
Note that with both penalty parameters and V just constrained to the non-negative orthant, then there is an extra degree of freedom. Typically then it is more efficient to constrain V to the simplex. When solving using Azure Batch then only a single penalty parameter is varied so then V should only be constrained to the non-negative orthant.
The "simplex"
constraint is the default when not using Azure batch.
Fold Parameters¶
The data are split into folds both purpose of calculating the cross fold
validation (out-of-sample) errors and for K-fold gradient descent, a
technique used to speed up the model fitting process. The parameters
cv_fold
and gradient_fold
can be passed either an integer number of
folds or an list-of-lists which indicate the units (rows) which are
allocated to each fold.
In the case that an integer is passed, the scikit-learn function
kfold
is used internally to split the data into random folds. For consistency
across calls to fit, the cv_seed
and gradient_seed
parameters are
passed to Kfold(..., random_state=seed)
.
Performance Notes¶
The function get_max_lambda()
requires a single calculation of the
gradient using all of the available data. In contrast, SC.CV_score()
performs gradient descent within each validation-fold of the data.
Furthermore, in the 'pre-only' scenario the gradient is calculated once for
each iteration of the gradient descent, whereas in the 'controls-only'
scenario the gradient is calculated once for each control unit.
Specifically, each control unit is excluded from the set of units that can
be used to predict it's own post-intervention outcomes, resulting in
leave-one-out gradient descent.
For large sample sizes in the 'controls-only' scenario, it may be
sufficient to divide the non-held out control units into "gradient folds", such
that controls within the same gradient-fold are not used to predict the
post-intervention outcomes of other control units in the same fold. This
result's in K-fold gradient descent, which improves the speed of
calculating the overall gradient by a factor slightly greater than c/k
(where c
is the number of control units) with an even greater reduction
in memory usage.
K-fold gradient descent is enabled by passing the parameter grad_splits
to CV_score()
, and for consistency across calls to CV_score()
it is
recommended to also pass a value to the parameter random_state
, which is
used in selecting the gradient folds.
Parallelization¶
If you have the BLAS/LAPACK libraries installed and available to Python,
you should not need to do any further optimization to ensure that maximum
number of processors are used during the execution of fit()
. If
not, seting the parameter paralell=True
when you call
fit()
which will split the work across N - 2 sub-processes where N
is the number of cores in your
machine.
Note that setting paralell=True
when the BLAS/LAPACK are available will
tend to increase running times. Also, this is considered an experimenatl
stub. While it works, parallel processing spends most of the time passing
repeatedly sending a relatively small amount of data, which could be (but
currently is not) initialized in each worker at the start. If this a
priority for your team, feel free to submit a PR or feature request.
Gradient Descent in feature space¶
Currently a custom gradient descent method called cdl_search
(imported
from SparseSC.optimizers.cd_line_search import
. ) is used which which
performs the constrained gradient descent. An alternate gradient descent
function may be supplied to the method
parameter, and any additional
keyword arguments passed to fit()
are passed along to whichever gradient
descent function is used. (see the Big list of
parameters for details)
Fit_fast (Separate)¶
The interface here is very similar except that V
and v_pen
are determined
by modified problems. This is specified by the match_space_maker
option and the main
choice is whether to use the fully separate solution method via the default MTLassoCV_MatchSpace
or to try the mixed method (slower, but better) via MTLassoMixed_MatchSpace
Synthetic Differences in Differences¶
The estimate_effects()
function can be used to conduct
DID style
analyses where counter-factual observations are constructed using Sparse
Synthetic Controls.
import SparseSC
# Fit the model:
fitted_estimates = SparseSC.estimate_effects(outcomes,unit_treatment_periods,covariates=X,fast=True,...)
# Print summary of the model including effect size estimates,
# p-values, and confidendence intervals:
print(fitted_estimates)
# Extract model attributes:
fitted_estimates.pl_res_post.avg_joint_effect.p_value
fitted_estimates.pl_res_post.avg_joint_effect.CI
# access the fitted Synthetic Controls model:
fitted_model = fitted_estimates.fit
The returned object is of class SparseSCEstResults
.
Feature and Target Data¶
When estimating synthetic controls, units of observation are divided into control and treated units. Data collected on these units may include observations of the outcome of interest, as well as other characteristics of the units (termed "covariates", herein). Outcomes may be observed both before and after an intervention on the treated units.
To maintain independence of the fitted synthetic controls and the post-intervention outcomes of interest of treated units, the post-intervention outcomes from treated units are not used in the fitting process. There are two cuts from the remaining data that may be used to fit synthetic controls, and each has it's advantages and disadvantages.
In the call to estimate_effects()
, outcomes
should
be numeric matrices containing data on the target variables collected prior
to (after) the treatment / intervention ( respectively), and the optional
parameter covariates
may be a matrix of additional features. All matrices
should have one row per unit and one column per observation.
In addition, the rows in covariates
and outcomes
which contain units that were affected
by the intervention ("treated units") should be indicated using the
treated_units
parameter, which may be a vector of booleans or integers
indicating the rows which belong to treat units.
Statistical parameters¶
The confidence level may be specified with the level
parameter, and the
maximum number of simulations used to produce the placebo distribution may
be set with the max_n_pl
parameter.
Additional parameters¶
Additional keyword arguments are passed on to the call to fit()
, which is
responsible for fitting the Synthetic Controls used to create the
counterfactuals.
Model Types¶
There are three distinct types of model fitting with respect choosing optimal values of the penalty parameters and the collection of units that are used for cross validation.
Recall that a synthetic treatment unit is defined as a weighted average of control units where weights are determined from the targets and are chosen to minimize the difference between the each the treated unit and its synthetic unit in the absence of an intervention. This methodology can be combined with cross-fold validation a in several ways in order to accommodate a variety of use cases.
Retrospective Treatment Effects¶
model_type = "retrospective"
In a retrospective analysis, a subset of units have received a treatment which possibly correlates with features of the units and values for the target variable have been collected both before and after an intervention. For example, a software update may have been applied to machines in a cluster which were experiences unusually latency, and retrospectively an analyst wishes to understand the effect of the update on another outcome such as memory utilization.
In this scenario, for each treated unit we wish to create a synthetic unit composed only of untreated units. The units are divided into a training set consisting of just the control units, and a test set consisting of the treated units. Within the training set, feature data consist of anything known prior to a treatment or intervention, such as pre-intervention values from the outcome of interest as well as unit level attributes, sometimes called "covariatess". Likewise, target data consist of observations from the outcome of interest collected after the treatment or intervention is initiated.
Cross-fold validation is done within the training set, holding out a single fold, identifying feature weights within the remaining folds, creating synthetic units for each held out unit defined as a weighted average of the non-held-out units. Out-of-sample prediction errors are calculated for each training fold and summed across folds. The set of penalty parameters that minimizes the Cross-Fold out-of-sample error are chosen. Feature weights are calculated within the training set for the chosen penalties, and finally individual synthetic units are calculated for each treated unit which is a weighted average of the control units.
This model yields a synthetic unit for each treated unit composed of control units.
Prospective Treatment Effects¶
model_type = "prospective"
In a prospective analysis, a subset of units have been designated to receive a treatment but the treatment has not yet occurred and the designation of the treatment may be correlated with a (possibly unobserved) feature of the treatment units. For example, a software update may have been planned for machines in a cluster which are experiencing unusually latency, and there is a desire to understand the impact of the software on memory utilization.
Like the retrospective scenario, for each treated unit we wish to create a synthetic unit composed only of untreated units.
Feature data consist of of unit attributes (covariates) and a subset of the pre-intervention values from the outcome of interest, and target data consist of the remaining pre-intervention values for the outcome of interest
Cross fold validation is conducted using the entire dataset without regard to intent to treat. However, treated units allocated to a single gradient fold, ensuring that synthetic treated units are composed of only the control units.
Cross-fold validation is done within the test set, holding out a single fold, identifying feature weights within the remaining treatment folds combined with the control units, synthetic units for each held out unit defined as a weighted average of the full set of control units.
Out-of-sample prediction errors are calculated for each treatment fold and the sum of these defines the Cross-Fold out-of-sample error. The set of penalty parameters that minimizes the Cross-Fold out-of-sample error are chosen. Feature weights are calculated within the training set for the chosen penalties, and finally individual synthetic units are calculated for each full unit which is a weighted average of the control units.
This model yields a synthetic unit for each treated unit composed of control units.
Prospective Treatment Effects training¶
model_type = "prospective-restricted"
This is motivated by the same example as the previous sample. It requires a larger set of treated units for similar levels of precision, with the benefit of substantially faster running time.
The units are divided into a training set consisting of just the control units, and a test set consisting of the unit which will be treated. feature data will consist of of unit attributes (covariates) and a subset of the pre-intervention values from the outcome of interest, and target data consist of the remaining pre-intervention values for the outcome of interest
Cross-fold validation is done within the test set, holding out a single fold, identifying feature weights within the remaining treatment folds combined with the control units, synthetic units for each held out unit defined as a weighted average of the full set of control units.
Out-of-sample prediction errors are calculated for each treatment fold and the sum of these defines the Cross-Fold out-of-sample error. The set of penalty parameters that minimizes the Cross-Fold out-of-sample error are chosen. Feature weights are calculated within the training set for the chosen penalties, and finally individual synthetic units are calculated for each full unit which is a weighted average of the control units.
This model yields a synthetic unit for each treated unit composed of control units.
Not that this model will tend to have wider confidence intervals and small estimated treatments given the sample it is fit on.
Prospective Failure Detection¶
model_type = "full"
In this scenario the goal is to identify irregular values in an outcome variable prospectively in a homogeneous population (i.e. when no treatment / intervention is planned). As an example, we may wish to detect failure of any one machine in a cluster, and to do so, we wish to create a synthetic unit for each machine which is composed of a weighted average of other machines in the cluster. In particular, there may be variation of the workload across the cluster and where workload may vary across the cluster by (possibly unobserved) differences in machine hardware, cluster architecture, scheduler versions, networking architecture, job type, etc.
Like the Prospective Treatment Effects scenario, Feature data consist of of unit attributes (covariates) and a subset of the pre-intervention values from the outcome of interest, and target data consist of the remaining pre-intervention values for the outcome of interest, and Cross fold validation is conducted using the entire dataset, and Cross validation and gradient folds are determined randomly.
This model yields a synthetic unit for every unit in the dataset, and synthetic units are composted of the remaining units not included in the same gradient fold.
Summary¶
Here is a summary of the main differences between the model types.
Type | Units used to fit V & penalties | Donor pool for W |
---|---|---|
retrospective | Controls | Controls |
prospective | All | Controls |
prospective-restricted | Treated | Controls |
(prospective) full | All | All |
A tree view of differences:
- Treatment date: The prospective studies differ from the retrospective study in that they can use all units for fitting.
- (Prospective studies) Treated units: The intended-to-treat (ITT) studies differ from the full in that the "treated" units can't be used for donors.
- (Prospective-ITT studies): The restrictive model differs in that it tries to maximize predictive power for just the treated units.
API Reference¶
Estimate Treatment Effects¶
-
SparseSC.estimate_effects.
estimate_effects
(outcomes, unit_treatment_periods, T0=None, T1=None, covariates=None, max_n_pl=10000, ret_pl=False, ret_CI=False, level=0.95, fast=True, model_type='retrospective', T2=None, cf_folds=10, cf_seed=110011, **kwargs) Determines statistical significance for average and individual effects
Parameters: - outcomes (np.array or pd.DataFrame with shape (N,T)) -- Outcomes
- unit_treatment_periods (np.array or pd.Series with shape (N)) -- Vector of treatment periods for each unit (if a unit is never treated then use np.NaN if vector refers to time periods by numerical index and np.datetime64('NaT') if using DateTime to refer to time periods (and thne Y must be pd.DataFrame with columns in DateTime too)) If using a prospective-based design this is the true treatment periods (and fit will be called with pseudo-treatment periods that are T1 periods earlier).
- T0 (int, Optional (default is pre-period for first treatment)) -- pre-history length to match over.
- T1 (int, Optional (Default is post-period for last treatment)) -- post-history length to fit over.
- covariates (np.array or pd.DataFrame with shape (N,K), Optional) -- Additional pre-treatment features
- max_n_pl (int, Optional) -- The full number of placebos is choose(N0,N1). If N1=1 then this is N0. This number grows quickly with N1 so we can set a maximum that we compute and this is drawn at random.
- ret_pl (bool) -- Return the matrix of placebos (different from the SC of the controls when N1>1)
- ret_CI -- Whether to return confidence intervals (requires more memory during execution)
- level (float (between 0 and 1)) -- Level for confidence intervals
- fast (bool) -- Whether to use the fast approximate solution (fit_fast() rather than fit())
- model_type -- Model type
- T2 -- If model='prospective' then the period of which to evaluate the effect
- kwargs -- Additional parameters passed to fit() or fit_fast()
Returns: An instance of SparseSCEstResults with the fitted results
Raises: ValueError -- when invalid parameters are passed
Keyword Args: Passed on to fit() or fit_fast()
-
class
SparseSC.estimate_effects.
SparseSCEstResults
(outcomes, fits, unit_treatment_periods, unit_treatment_periods_idx, unit_treatment_periods_idx_fit, T0, T1, pl_res_pre, pl_res_post, pl_res_post_scaled, max_n_pl, covariates=None, ind_CI=None, model_type='retrospective', T2=None, pl_res_post_fit=None) Bases:
object
Holds estimation info
-
CI
p-value for the current model if relevant, else None.
-
__init__
(outcomes, fits, unit_treatment_periods, unit_treatment_periods_idx, unit_treatment_periods_idx_fit, T0, T1, pl_res_pre, pl_res_post, pl_res_post_scaled, max_n_pl, covariates=None, ind_CI=None, model_type='retrospective', T2=None, pl_res_post_fit=None) Parameters: - outcomes -- Outcome for the whole sample
- fits (dictionary of period->SparseSCFit) -- The fit() return objects
- unit_treatment_periods -- Vector or treatment periods for each unit (if a unit is never treated then use np.NaN if vector refers to time periods by numerical index and np.datetime64('NaT') if using DateTime to refer to time periods (and thne Y must be pd.DataFrame with columns in DateTime too))
- unit_treatment_periods_idx -- the conversion of unit_treatment_periods to indexes (helpful if use had datetime index)
- unit_treatment_periods_idx_fit -- the treatment period indexes passed to fit (helpful if prospective-based design)
- T0 -- Pre-history to match over
- T1 -- post-history to evaluate over
- pl_res_pre (PlaceboResults) -- Statistics for the average fit of the treated units in the pre-period (used for diagnostics)
- pl_res_post (PlaceboResults) -- Statistics for the average treatment effect in the post-period
- pl_res_post_scaled (PlaceboResults) -- Statistics for the average scaled treatment effect (difference divided by pre-treatment RMS fit) in the post-period.
- max_n_pl -- maximum number of of placebos effects used for inference
- covariates -- Nxk matrix of full baseline covariates (or None)
- ind_CI (dictionary of period->CI_int. Each CI_int is for the full sample (not necessarily T0+T1)) -- Confidence intervals for SC predictions at the unit level (not averaged over N1). Used for graphing rather than treatment effect statistics
- model_type -- Model type string
- T2 -- T2 (if prospective-type design)
- pl_res_post_fit -- If prospective-type designs, the PlaceboResults for target period used for fit (still before actual treatment)
-
fit
Handy accessor if there is only one treatment-period
-
get_V
(treatment_period=None) Returns V split across potential pre-treatment outcomes and baseline features
-
get_W
(treatment_period=None) Get W (np.ndarray 2D or pd.DataFrame depends on what was passed in) for one of the treatment periods
-
get_sc
(treatment_period=None) Returns and NxT matrix of synthetic controls. For units not eligible (those previously treated or between treatment_period and treatment_period+T1) The results is left empty.
-
get_tr_time_info
(treatment_period) Returns treatment info: a) indexes for the time index (helpful if user used an np datetime index) b) treatment period used in the call to fit (helpful if model is a prospective type)
Parameters: treatment_period -- treatment period in the user's view (could be TimeIndex) Returns: (reatment_period_idx, treatment_period_idx_fit, treatment_period_fit)
-
p_value
p-value for the current model if relevant, else None.
-
-
class
SparseSC.utils.metrics_utils.
PlaceboResults
(effect_vec, avg_joint_effect, rms_joint_effect, N_placebo) Bases:
object
Holds statistics for a vector of effects, include the full vector and two choices of aggregates (average and RMS)
-
__init__
(effect_vec, avg_joint_effect, rms_joint_effect, N_placebo) Parameters: - effect_vec (EstResultCI) -- Statistics for a vector of time-specific effects.
- avg_joint_effect (EstResultCI) -- Statistics for the average effect.
- rms_joint_effect (EstResultCI) -- Statistics for the RMS effect
- N_placebo (EstResultCI) -- Number of placebos used for the statistis
-
-
class
SparseSC.utils.metrics_utils.
EstResultCI
(effect, p, ci=None, placebos=None) Bases:
object
Holds an estimation result (effect + statistical significance)
-
__init__
(effect, p, ci=None, placebos=None) Parameters: - effect (scalar, vector, or pd.Series) -- Effect
- p (Scalar or vector) -- p-value
- ci (CI_int) -- Confidence interval
- placebos (matrix) -- Full matrix of placebos
-
-
class
SparseSC.utils.metrics_utils.
CI_int
(ci_low, ci_high, level) Bases:
object
Class to hold informatino for a confidence interval (for single point or for a vector)
-
__init__
(ci_low, ci_high, level) Parameters: - ci_low (scalar, vector, or pd.Series) -- Low-bound
- ci_high (scalar, vector, or pd.Series) -- High-bound
- level (float) -- Level (1-alpha) for the CI interval
-
-
class
SparseSC.utils.metrics_utils.
AA_results
(diffs_pre, diffs_post, level=0.95, sym_CI=True) Bases:
object
Constructs simple results object from AA test that has already been fit (model_type="full") and diffs constructed.
-
__init__
(diffs_pre, diffs_post, level=0.95, sym_CI=True) Provides typical stats for sims of estimation :param diffs_pre: Pre-treatment diffs :param diffs_post: vector of CI lower bounds :param level: level :param sym_CI: Return symmetric CIs. Will also set "treatment" effect to 0 rather than mean of placebo diffs. :returns: AA_results
-
Fit a Synthetic Controls Model (Slow, Joint)¶
-
SparseSC.fit.
fit
(features, targets, treated_units=None, w_pen=None, v_pen=None, grid=None, grid_min=1e-06, grid_max=1, grid_length=20, stopping_rule=2, gradient_folds=10, w_pen_inner=False, match_space_maker=None, **kwargs) Parameters: - features (matrix of floats) -- Matrix of features
- targets (matrix of floats) -- Matrix of targets
- model_type (str, default =
"retrospective"
) -- Type of model being fit. One of"retrospective"
,"prospective"
,"prospective-restricted"
or"full"
- treated_units (int[], Optional) -- An iterable indicating the rows of X and Y which contain data from treated units.
- w_pen (float | float[], optional) -- Penalty applied to the difference
between the current weights and the null weights (1/n). default
provided by :func:
w_pen_guestimate
. - v_pen (float | float[], optional) -- penalty
(penalties) applied to the magnitude of the covariate weights.
Defaults to
[ Lambda_c_max * g for g in grid]
, where Lambda_c_max is determined viaget_max_v_pen()
. - grid (float | float[], optional) -- only used when v_pen is not provided.
Defaults to
np.exp(np.linspace(np.log(grid_min),np.log(grid_max),grid_length))
- grid_min (float, default = 1e-6) -- Lower bound for
grid
whenv_pen
andgrid
are not provided. Must be in the range(0,1)
- grid_max (float, default = 1) -- Upper bound for
grid
whenv_pen
andgrid
are not provided. Must be in the range(0,1]
- grid_length (int, default = 20) -- number of points in the
grid
parameter whenv_pen
andgrid
are not provided - stopping_rule (int, float, or function) -- A stopping rule less than one is interpreted as the
percent improvement in the out-of-sample squared prediction error required
between the current and previous iteration in order to continue with the
coordinate descent. A stopping rule of one or greater is interpreted as
the number of iterations of the coordinate descent (rounded down to the
nearest Int). Alternatively,
stopping_rule
may be a function which will be passed the current model fit, the previous model fit, and the iteration number (depending on it's signature), and should return a truthy value if the coordinate descent should stop and a falsey value if the coordinate descent should stop. - choice (str or function. default =
"min"
) -- Method for choosing from among the v_pen. Only used when v_pen is an iterable. Defaults to"min"
which selects the v_pen parameter associated with the lowest cross validation error. - cv_folds (int or (int[],int[])[], default = 10) -- An integer number of Cross Validation folds passed to
sklearn.model_selection.KFold()
, or an explicit list of train validation folds. TODO: These folds are calculated withKFold(...,shuffle=False)
, but instead, it should be assigned a random state. - gradient_folds (int or (int[],int[])[]) -- (default = 10) An integer
number of Gradient folds passed to
sklearn.model_selection.KFold()
, or an explicit list of train validation folds, to be used model_type is one either"foo"
"bar"
. - cv_seed (int, default = 10101) -- passed to
sklearn.model_selection.KFold()
to allow for consistent cross validation folds across calls - gradient_seed (int, default = 10101) -- passed to
sklearn.model_selection.KFold()
to allow for consistent gradient folds across calls when model_type is one either"foo"
"bar"
with and gradient_folds is an integer. - progress (boolean, default =
True
) -- Controls the level of verbosity. If True, the messages indication the progress are printed to the console (stdout). - kwargs -- Additional arguments passed to the optimizer (i.e.
method
or scipy.optimize.minimize). See below. - custom_donor_pool (boolean, default =
None
) -- By default all control units are allowed to be donors for all units. There are cases where this is not desired and so the user can pass in a matrix specifying a unit-specific donor pool (NxC matrix of booleans). Common reasons for restricting the allowability: (a) When we would like to reduce interpolation bias by restricting the donor pool to those units similar along certain features. (b) If units are not completely independent (for example there may be contamination between neighboring units). This is a violation of the Single Unit Treatment Value Assumption (SUTVA). Note: These are not used in the fitting stage (of V and penalties) just in final unit weight determination.
Keyword Args: - method (str or callable) -- The method or function
- responsible for performing gradient descent in the covariate
space. If a string, it is passed as the
method
argument toscipy.optimize.minimize()
. Otherwise,method
must be a function with a signature compatible withscipy.optimize.minimize()
(method(fun,x0,grad,**kwargs)
) which returns an object havingx
andfun
attributes. (Default =SparseSC.optimizers.cd_line_search.cdl_search()
)
- learning_rate (float, Default = 0.2) -- The initial learning rate
- which determines the initial step size, which is set to
learning_rate * null_model_error / gradient
. Must be between 0 and 1.
- learning_rate_adjustment (float, Default = 0.9) -- Adjustment factor
- applied to the learning rate applied between iterations when the
optimal step size returned by
scipy.optimize.line_search()
is greater less than 1, else the step size is adjusted by1/learning_rate_adjustment
. Must be between 0 and 1,
- tol (float, Default = 0.0001) -- Tolerance used for the stopping
- rule based on the proportion of the in-sample residual error reduced in the last step of the gradient descent.
Returns: A
SparseSCFit
object containing details of the fitted model.Return type: SparseSCFit
Raises: ValueError -- when
treated_units
is not None and not aniterable
, or when model_type is not one of the allowed values
-
class
SparseSC.fit.
SparseSCFit
(features, targets, control_units, treated_units, model_type, V, sc_weights, targets_sc=None, fitted_v_pen=None, fitted_w_pen=None, initial_v_pen=None, initial_w_pen=None, score=None, scores=None, selected_score=None, match_space_trans=None, match_space=None, match_space_desc=None) Bases:
object
A class representing the results of a Synthetic Control model instance.
-
__init__
(features, targets, control_units, treated_units, model_type, V, sc_weights, targets_sc=None, fitted_v_pen=None, fitted_w_pen=None, initial_v_pen=None, initial_w_pen=None, score=None, scores=None, selected_score=None, match_space_trans=None, match_space=None, match_space_desc=None)
-
get_weights
(include_trivial_donors=True) getter for the sc_weights. By default, the trivial
-
predict
(targets=None, include_trivial_donors=True) predict method
Parameters: - targets ((optional) matrix of floats) -- Matrix of targets
- include_trivial_donors (boolean) -- Should donors for whom selected
predictors and all targets equal to zero be included in the weights for
non-trivial units. These units will typically have a weight of 1 /
total number of units as they do not contribute to the gradient.
Default =
`False`
Returns: matrix of predicted outcomes
Return type: matrix of floats
Raises: ValueError -- When
targets.shape[0]
is inconsistent with the fitted model.
-
sc_weights
getter for the sc_weights. By default, the trivial
-
show
() display goodness of figures illustrating goodness of fit
-
summary
() A summary of the model fit / penalty selection
This illustrates that (a) the gird function could / should be better, and (b) currently more than two iterations is typically useless.
-
Fit a Synthetic Controls Model (Fast, Separate)¶
-
SparseSC.fit_fast.
fit_fast
(features, targets, model_type='restrospective', treated_units=None, w_pens=None, custom_donor_pool=None, match_space_maker=None, w_pen_inner=True, avoid_NxN_mats=False, verbose=0, targets_aux=None, **kwargs) Parameters: - features (matrix of floats) -- Matrix of features
- targets (matrix of floats) -- Matrix of targets
- model_type (str, default =
"retrospective"
) -- Type of model being fit. One of"retrospective"
,"prospective"
,"prospective-restricted"
or"full"
- treated_units (int[], default=np.logspace(start=-5, stop=5, num=40) (sklearn.RidgeCV can't automatically pick)) -- An iterable indicating the rows of X and Y which contain data from treated units.
- w_pens (float[], default=np.logspace(start=-5, stop=5, num=40)) -- Penalization values to try when searching for unit weights.
- treated_units -- An iterable indicating the rows of X and Y which contain data from treated units.
- custom_donor_pool (boolean, default =
None
) -- By default all control units are allowed to be donors for all units. There are cases where this is not desired and so the user can pass in a matrix specifying a unit-specific donor pool (NxC matrix of booleans). Common reasons for restricting the allowability: (a) When we would like to reduce interpolation bias by restricting the donor pool to those units similar along certain features. (b) If units are not completely independent (for example there may be contamination between neighboring units). This is a violation of the Single Unit Treatment Value Assumption (SUTVA). Note: These are not used in the fitting stage (of V and penalties) just in final unit weight determination. - match_space_maker -- Function with signature MatchSpace_transformer, V_vector, best_v_pen, V desc = match_space_maker(X, Y, fit_model_wrapper) where we can call fit_model_wrapper(MatchSpace_transformer, V_vector). Default is MTLassoCV_MatchSpace_factory().
- avoid_NxN_mats (bool, default=False) -- There are several points where typically a matrices on the order of NxN would be made (either N or N_c). With a large number of units these can be quite big. These can be avoided. One consequence is that the full unit-level weights will not be kept and just the built Synthetic Control outcome will be return.
- verbose (int, default=0) -- Verbosity level. 0 means no printouts. 1 will note times of completing each of the 3 main stages and some loop progress bars. 2 will print memory snapshots (Optionally out to a file if the env var SparseSC_log_file is set).
- kwargs -- Additional parameters so that one can easily switch between fit() and fit_fast()
Returns: A
SparseSCFit
object containing details of the fitted model.Return type: SparseSCFit
Raises: ValueError -- when
treated_units
is not None and not aniterable
, or when model_type is not one of the allowed values
-
SparseSC.utils.match_space.
Fixed_V_factory
(V) Return a MatchSpace function with user-supplied V over raw X.
Parameters: V -- V Matrix on the raw features Returns: a function with the signature MatchSpace fn, V vector, best_v_pen, V = function(X,Y)
-
SparseSC.utils.match_space.
MTLassoCV_MatchSpace_factory
(v_pens=None, n_v_cv=5, sample_frac=1, Y_col_block_size=None, se_factor=None, normalize=True) Return a MatchSpace function that will fit a MultiTaskLassoCV for Y ~ X
Parameters: - v_pens -- Penalties to evaluate (default is to automatically determince)
- n_v_cv -- Number of Cross-Validation folds
- sample_frac -- Fraction of the data to sample
- se_factor -- Allows taking a different penalty than the min mse. Similar to the lambda.1se rule, if not None, it will take the max lambda that has mse < min_mse + se_factor*(MSE standard error).
Returns: MatchSpace fn, V vector, best_v_pen, V
-
SparseSC.utils.match_space.
MTLSTMMixed_MatchSpace_factory
(T0=None, K_fixed=0, M_sizes=None, dropout_rate=0.2, epochs=2, verbose=0, hidden_length=100) Return a MatchSpace function that will fit an LSTM of [X_fixed, X_time_varying, Y_pre] ~ Y with the hidden-layer size optimized to reduce errors on goal units
Parameters: - T0 -- length of Y_pre
- K_fixed -- Number of fixed unit-covariates (rest will assume to be time-varying)
- M_sizes -- list of sizes of hidden layer (match-space) sizes to try. Default is range(1, 2*int(np.log(Y.shape[0])))
- dropout_rate --
- epochs --
- verbose --
- hidden_length --
Returns: MatchSpace fn, V vector, best_M_size, V
-
SparseSC.utils.match_space.
MTLassoMixed_MatchSpace_factory
(v_pens=None, n_v_cv=5) Return a MatchSpace function that will fit a MultiTaskLassoCV for Y ~ X with the penalization optimized to reduce errors on goal units
Parameters: - v_pens -- Penalties to evaluate (default is to automatically determince)
- n_v_cv -- Number of Cross-Validation folds
Returns: MatchSpace fn, V vector, best_v_pen, V
Running Jobs in Parallel with Azure Batch¶
Fitting a Sparse Synthetic Controls model can result in a very long running time. Fortunately much of the work can be done in parallel and executed in the cloud, and the SparseSC package comes with an Azure Batch utility which can be used to fit a Synthetic controls model using Azure Batch. There are code examples for Windows CMD, Bash and Powershell.
Setup¶
Running SparseSC with Azure Batch requires the super_batch
library which
can be installed with:
pip install git+https://github.com/jdthorpe/batch-config.git
Also note that this module has only been tested with Python 3.7.
You will also need pyyaml
and psutil
.
Create the Required Azure resources¶
Running SparseSC with Azure Batch requires a an Azure account and handful of resources and credentials. These can be set up by following along with section 4 of the super-batch README.
Prepare parameters for the Batch Job¶
The parameters required to run a batch job can be created using fit()
by
providing a directory where the parameters files should be stored:
from SparseSC import fit
batch_dir = "/path/to/my/batch/data/"
# initialize the batch parameters in the directory `batch_dir`
fit(x, y, ... , batchDir = batch_dir)
Executing the Batch Job¶
In the following Python script, a Batch configuration is created and the batch job is executed with Azure Batch. Note that in the following script, the various Batch Account and Storage Account credentials are taken from system environment varables, as in the super-batch readme.
import os
from datetime import datetime
from super_batch import Client
from SparseSC.utils.AzureBatch import (
DOCKER_IMAGE_NAME,
create_job,
)
# Batch job names must be unique, and a timestamp is one way to keep it uniquie across runs
timestamp = datetime.utcnow().strftime("%H%M%S")
batch_dir = "/path/to/my/batch/data/"
batch_client = Client(
# Name of the VM pool
POOL_ID= name,
# number of standard nodes
POOL_NODE_COUNT=5,
# number of low priority nodes
POOL_LOW_PRIORITY_NODE_COUNT=5,
# VM type
POOL_VM_SIZE= "STANDARD_A1_v2",
# Job ID. Note that this must be unique.
JOB_ID= name + timestamp,
# Name of the storage container for storing parameters and results
CONTAINER_NAME= name,
# local directory with the parameters, and where the results will go
BATCH_DIRECTORY= batch_dir,
# Keep the pool around after the run, which saves time when doing
# multiple batch jobs, as it typically takes a few minutes to spin up a
# pool of VMs. (Optional. Default = False)
DELETE_POOL_WHEN_DONE=False,
# Keeping the job details can be useful for debugging:
# (Optional. Default = False)
DELETE_JOB_WHEN_DONE=False
)
create_job(batch_client, batch_dir)
# run the batch job
batch_client.run()
# aggregate the results into a fitted model instance
fitted_model = aggregate_batch_results(batch_dir)
Cleaning Up¶
When you are done fitting your model with Azure Batch be sure to clean up your Azure Resources in order to prevent unexpected charges on your Azure account.
Solving¶
The Azure batch will just vary one of the penalty parameters. You should therefore not specify the simplex constraint for the V matrix as then it will be missing one degree of freedom.
FAQ¶
What if I get disconnected while the batch job is running?
Once the pool and the job are created, they will keep running until the job completes, or your delete the resources. You can reconnect create the
batch_client
as in the example above and then reconnect to the job and download the results with:batch_client.load_results() fitted_model = aggregate_batch_results(batch_dir)
In fact, if you'd rather not wait for the job to compelte, you can add the parameter
batch_client.run(... ,wait=False)
and therun_batch_job
will return as soon as the job and pool configuration have been createdn in Azure.batch_client.run()
orbatch_client.load_results()
complain that the results are in complete. What happened?Typically this means that one or more of the jobs failed, and a common reason for the job to fail is that the VM runs out of memory while running the batch job. Failed Jobs can be viewed in either the Azure Batch Explorer or the Azure Portal. The
POOL_VM_SIZE
use above ("STANDARD_A1_v2") is one of the smallest (and cheapest) VMs available on Azure. Upgrading to a VM with more memory can help in this situation.Why does
aggregate_batch_results()
take so long?Each batch job runs a single gradient descent in V space using a subset (Cross Validation fold) of the data and with a single pair of penalty parameters, and return the out of sample error for the held out samples.
aggregate_batch_results()
very quickly aggregates these out of sample errors and chooses the optimal penalty parameters given thechoice
parameter provided tofit()
oraggregate_batch_results()
. Finally, with the selected parameters, a final gradient descent is run using the full dataset which will be larger than the and take longer as the rate limiting step ( scipy.linalg.solve ) has a running time ofO(N^3)
. While it is possible to run this step in parallel as well, it hasn't yet been implemented.
Developer Notes¶
Python environments¶
You can create Anaconda environments using
conda env create -f test/SparseSC_36.yml
You can can do update
rather than create
to update existing ones (to avoid potential bugs make sure the env isn't currently active).
Note: When regenerating these files (conda env export > test/SparseSC_*.yml
) make sure to remove the final prefix
line since that's computer specific. You can do this automatically on Linux by inserting | grep -v "prefix"
and on Windows by inserting | findstr -v "prefix"
.
Building the docs¶
Requires Python >=3.6 and packages: sphinx
, recommonmark
, sphinx-markdown-tables
.
Use (n)make htmldocs
and an index HTML file is madeat docs/build/html/index.html
.
To build a mini-RTD environment to test building docs:
- You can make a new environment with Python 3.7 (
conda create -n SparseSC_37_rtd python=3.7
) - update
pip
(likely fine). pip install --upgrade --no-cache-dir -r docs/rtd-base.txt
. This file is loosely kept in sync by looking at the install commands on the rtd run.pip install --exists-action=w --no-cache-dir -r docs/rtd-requirements.txt
. This file doesn't list the full environment versions because that causes headaches when the rtd base environment got updated. It downgrades Sphinx to a known good version that allows markdown files to have math in code quotes (GH Issues) (there might be higher ones that also work, didn't try).
Running examples¶
The Jupyter notebooks require matplotlib
, jupyter
, and notebook
.
Testing¶
We use the built-in unittest
. Can run from makefile using the tests
target or you can run python directly from the repo root using the following types of commands:
python -m unittest test/test_fit.py #file (only Python >=3.5)
python -m unittest test.test_fit #module
python -m unittest test.test_fit.TestFit #class
python -m unittest test.test_fit.TestFit.test_retrospective #function
Release Process¶
- Ensure the makefile target
check
(which does pylint, tests, doc building, and packaging) runs clean - If new version, check that it's been updated in
SparseSC/src/__init__.py
- Updated
Changelog.md
- Tag/Release in version control
Anomaly Detection¶
Overview¶
In this scenario the goal is to identify irregular values in an outcome variable prospectively in a homogeneous population (i.e. when no treatment / intervention is planned). As an example, we may wish to detect failure of any one machine in a cluster, and to do so, we wish to create a synthetic unit for each machine which is composed of a weighted average of other machines in the cluster. In particular, there may be variation of the workload across the cluster and where workload may vary across the cluster by (possibly unobserved) differences in machine hardware, cluster architecture, scheduler versions, networking architecture, job type, etc.
Like the Prospective Treatment Effects scenario, Feature data consist of of unit attributes (covariates) and a subset of the pre-intervention values from the outcome of interest, and target data consist of the remaining pre-intervention values for the outcome of interest, and Cross fold validation is conducted using the entire dataset, and Cross validation and gradient folds are determined randomly.
Example¶
In this scenario, we'll need a matrix with past observations of the outcome
(target) of interest (targets
), with one row per unit of observation, and
one column per time period, ordered from left to right. Additionally we
may have another matrix of additional features with one row per unit and
one column per feature (features
). Armed with this we may wish to construct a
synthetic control model to help decide weather future observations
(additional_observations
) deviate from their synthetic predictions.
The strategy will be to divide the targets
matrix into two parts (before
and after column t
), one of which will be used as features, and other
which will be treated as outcomes for the purpose of fitting the weights
which make up the synthetic controls model.
from numpy import hstack
from SparseSC import fit
# Let X be the features plus some of the targets
X = hstack([features, targets[:,:t])
# And let Y be the remaining targets
Y = targets[:,t:]
# fit the model:
fitted_model = fit(X=X,
Y=Y,
model_type="full")
The model_type="full"
allows produces a model in which every unit can
serve as a control for every other unit, unless of course the parameter
custom_donor_pool
is specified.
Now with our fitted synthetic control model, as soon as new set of targets
outcomes are observed for each unit, we can create synthetic outcomes using
our fitted model using the predict()
method:
synthetic_controls = fitted_model.predict(additional_observations)
Note that while the call to fit()
is computationally intensive, the call
to model.predict()
is fast and can be used for real time anomaly
detection.
Model Details:¶
This model yields a synthetic unit for every unit in the dataset, and synthetic units are composted of the remaining units not included in the same gradient fold.
Type | Units used to fit V & penalties | Donor pool for W |
---|---|---|
(prospective) full | All units | All units |