The functions in the library are standalone and can be imported and used from within any project and from the command line.
Loading of data from various sources is not in scope of this library.
- Sample-size estimation:
- Treatment does not affect variance
- Variance in treatment and control is identical
- Mean of delta is normally distributed
- Welch t-test:
- Mean of means is t-distributed (or normally distributed)
- In general:
- Sample represents underlying population
- Entities are independent
Main user stories:
As a Data Scientist I want to perform all the basic analysis routines that are typical of a the analysis of an A/B Test (a.k.a. Between-Subject Randomised Control Trial) while retaining access to the raw data so I can perform very also custom analyses in order to answer the questions of stakeholders with little effort.
As an analyst from a different department, I want to be able to bring my own data, and easily be able to use this library to perform analysis: in other words, as long as data is in a format compatible with expan.ExperimentData
(documented below), importing it into the library and then performing analyses on it should be almost trivial.
- Data to be analysed is loaded into the
ExperimentData
class. - Features and KPIs are stored separately (but are exposed as a single object 'metrics' by dynamically joining the two)
- An ExperimentData therefore contains:
- 2 pandas DataFrame objects (
kpis
andfeatures
) - a dictionary for metadata
- a property (
metrics
) which dynamically returns another DataFrame - a set of functions and properties to simplify access to the data somewhat
- 2 pandas DataFrame objects (
- Analysis functionality is provided on a subclass of this (
expan.Experiment
)
Underlined column names refer to indices; bold is any column or row name; and square brackets indicate [an].
This is a dictionary of information describing the experiment to be analysed.
key | example value | explanation |
---|---|---|
experiment | "Generic Website Improvement" | Name of the experiment, as known to stakeholders. Can be anything meaningful to you. |
[experiment_id] | "a9a9e987a9f99d3_2015-01-01T12:00:00.123" | This uniquely identifies the experiment. Could be a concatenation of the experiment name and the experiment start timestamp. |
sources | ["our_mysql","website_logs"] |
Names of the data sources used in the preparation of this data. |
baseline_variant | “No Change” | the variant against which all others will be measured. |
[retrieval_time] | 2015-10-21H18:28CEST |
time that data was fetched from original sources... perhaps this should be a list with entry per source? |
[primary_KPI] | "orders" | Overall Evaluation Criteria |
variant | entity | [time_since_treatment] | number of orders | PCII |
---|---|---|---|---|
A | ec0231efh | 0 | 1 | 23.23 |
A | ec0231efh | 1 | 2 | 250.32 |
B | f387534e2 | 0 | 0 | - |
variant | entity | treatment start time | age | PCII_365 |
---|---|---|---|---|
A | ec0231efh | 2015-02-23H12:00CEST | 32 | 932.92 |
B | f387534e2 | 2015-02-23H12:00CEST | 65 | 23.44 |
The Results object is based on a single pandas DataFrame object. Currently it has-a DataFrame, but could in the future be implemented so that it is-a DataFrame.
Similar to the input data, Results have metadata (a dictionary) and a DataFrame.
This is a dictionary describing the results, some of which is derived directly from the metadata of the input data, and some is additional.
key |
example value | explanation | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
experiment | “Generic Website Improvement 2015-01-01” | see ExperimentData metadata | ||||||||||||
experiment_id | "a9a9e987a9f99d3_2015-01-01T12:00:00.123" | see ExperimentData metadata | ||||||||||||
retrieval_time | see above | retrieval time of the data sources | ||||||||||||
analysis_time | 2015-10-21H18:28CEST | Time that the analysis was performed. not yet implemented | ||||||||||||
baseline_variant | “No Change” | Variant against which all results were computed. | ||||||||||||
primary_KPI | "PCII" | The KPI used for OEC (Overall Evaluation Criteria). | ||||||||||||
metric_units |
| The underlying unit of each metric. not yet implemented Probably this can be combined with the full metric object? | ||||||||||||
cost_of_treatment | {'A': 1, 'B': 1.5} | Cost of treatment per variant as a dict used to offset the uplift. | ||||||||||||
expan_version | 1.0.1 | The version of expan that was used to compute the results. | ||||||||||||
[analysts] | ["joe.bloggs@zalando.de"] | Identification of the data scientists running the analysis: probably email address is best here. Will be a list, but is optional. |
The binning objects are stored as a dictionary of 'Binning' objects in the Results structure, indicating how the subgroups were created.
The bin associated with a subgroup in the results dataframe is referenced by the string label.
subgroup_metric | binning | label_format_str (going to deprecate this) | label_example (not actually in results) |
---|---|---|---|
Age |
<Binning Object Created on Age data>
|
'{lo},{hi}' |
20-30 |
CLV |
<Binning Object Created on CLV data> |
'{standard}' |
[102.0,144.5) |
yellow statistics will probably be derived: calculated on the fly by properties rather than stored in the dataframe.
index |
variant columns |
subgroup columns (think about this) | comments (not in data) | |||||||
---|---|---|---|---|---|---|---|---|---|---|
metric |
subgroup_metric |
subgroup |
time_since_treatment |
statistic | pctile |
“Bamboozle” |
“Spektakulatrix” | "No Change" | subgroup_bin_index | |
PCII |
Age |
20-30 |
0 |
uplift | nan |
3.2 |
3.5 | 0 | 0 | the mean of the difference between variant and baseline (variant-baseline) |
uplift_rel | nan | 16% | 17.5% | 0% |
the uplift as proportion of baseline ((variant-baseline)/baseline) NB: probably won't be in the dataframe itself because it can be derived (so prob. implement as a property of the results class) | |||||
sample_size | nan |
10000 |
5000 | 1000 | sample size of each variant | |||||
uplift_pctile | 2.5 |
-0.3 |
1.2 | nan | percentiles of the difference between variant and baseline (so 95% confidence intervals are represented by the 2.5 and 97.5 percentiles | |||||
uplift_pctile | 97.5 |
7.8 |
7.4 | nan | ||||||
uplift_pctile | 4.3 | 0 | 0 | nan | any percentile can be represented, including some special ones, like those associated with 0 uplift or uplift of exactly treatment cost. | |||||
prob_uplift_over_0 | nan | 0.043 | nan | could represent the probability of uplift being over 0 explicitly like this, equivalent to having the uplift_pctile statistic with a value of 0. Discussion is here (only internal to Zalando currently, sorry) | ||||||
prob_uplift_over_cost | ||||||||||
variant_mean | nan |
23.2 |
23.5 | 20 | simply the mean of the variant, including baseline | |||||
pre_treatment_diff | nan | 2.63 | -1.23 | 0 | feature check result for numerical variables | |||||
pre_treatment_diff_pctile | 2.5 | -2.54 | -2.46 | -1.53 | feature check result for numerical variables | |||||
pre_treatment_diff_pctile | 97.5 | 5.34 | 0.64 | 1.56 | feature check result for numerical variables | |||||
chi_square_p | nan | 0.63 | 0.25 | 0.93 | feature check result for categorical variables | |||||
10 |
uplift | nan |
5.2 | 22.1 | 0 | |||||
sample_size | nan |
10000 | 5000 | 1000 | ||||||
uplift_pctile | 2.5 |
0.9 | 10.0 | nan | ||||||
uplift_pctile | 97.5 |
7.8 | 30.0 | nan | ||||||
variant_mean | nan | 27.2 | 44.1 | 22 |
- 'time_since_treatment' is currently only included if a trend analysis was done.
- '-' is used as a sentinal for NaNs for index levels 'metric', 'subgroup_metric','subgroup' because all-nan index levels cause big problems with pandas reindexing etc.
- could think of dropping index levels if they are all nans - as time level is.
- Variants are stored in first level of columns.
- Storing baseline_variant as a piece of metadata means we do not need a column for it, and we will most likely have no use case for combining results with different baseline variants. However, we store the baseline variant in the data as an explicit column because this will allow the same structure to be used for plotting the variants directly against each other, and allows for storing the absolute (within-variant) values as well as the uplift information in the same format.
- [Core] Should the feature and kpi data frames be combined?
- No, we will store them separately and combine them when needed with a join. This join can be cached, but keeping separate allows efficiency especially for time-dependent analysis where features do not change.
- [Core] Should features and kpis be class objects (metric class as in Default Analyzer)?
- Attributes specific to individual metrics can be captured with dictionaries in the metadata where the metric name is the key to the dictionary (e.g. metadata['is_categorical']={'orders': False, 'gender': True}
- [Core] Should the metadata and the core data frame be combined into one unified structure?
- No. Metadata is global to the whole dataframe, it does not apply to individual elements in it. Also, it should be very easily understood and able to be manipulated: analysts should be able to store extra stuff in there as they like.
- [General] Connection to statistical monitoring module?
- That should be something dealt with in the Analysis Service
Name | Definition | Example |
---|---|---|
Metric | Metric is the generic term covering KPI and Feature. It describes anything that can be measured on a per entity level. | |
KPI | Key Performance Indicator. Used here to describe the data measured after the start of the treatment. It is used to identify the variables which are expected to be influenced by the treatment. | PCII accumulated after treatment start is a typical KPI for customer based experiments. |
Feature | A feature is data which is not expected to be influenced by the treatment. That includes but is not limited to all data that is known on an entity at the start of a treatment. | Age or gender are typical features in customer based experiments. |