π Finished reading Trustworthy Online Controlled Experiments by Ron Kohavi.
This is a very comprehensive book about running experiments. The heavy focus is on the modern type of online experiment where you might want to use thousands or millions of users who interact with an app or website in order to detect sometimes very small behavioural effects.
This contrasts with many of the previous books on experiments and statistics I’ve read that often assume you have maybe 30 in-person participants you’re looking to study in a lab. It also focuses a bit less on the statistics and a bit more on the practicalities of design and implementation, although there’s plenty of stats to get your teeth into as well.
The authors also share plenty of examples, usually from the big tech companies, so you can see how the experts apply these lessons to make their decisions.
To be honest, you can get a decent idea of what’s included from the chapter titles. Here we go:
- Introduction and Motivation: Why experiment? What are the basic ingredients of doing so?
- Running and Analyzing Experiments: An End-to-End Example
- Twymanβs Law and Experimentation Trustworthiness: Twyman’s law is “Any figure that looks interesting or different is usually wrong”, so this is all about misinterpretation and validity.
- Experimentation Platform and Culture: Maturity models and tools.
- Speed Matters: An End-to-End Case Study: Why platform performance matters.
- Organizational Metrics: The principles and practice of selecting metrics to measure.
- Metrics for Experimentation and the Overall Evaluation Criterion: How to derive metrics you can measure in an experiment from the organisational metrics above.
- Institutional Memory and Meta-Analysis
- Ethics in Controlled Experiments
- Complementary Techniques: Alternatives or complements to experimenting.
- Observational Causal Studies: What to do when RCTs are impossible to carry out.
- Client-Side Experiments: Should you run client-side or server-side experiments?
- Instrumentation: Culture and practice.
- Choosing a Randomization Unit
- Ramping Experiment Exposure: Trading Off Speed, Quality, and Risk: Considerations when releasing your experiment.
- Scaling Experiment Analyses: The less manual the better, generally.
- The Statistics behind Online Controlled Experiments: T-tests et al.
- Variance Estimation and Improved Sensitivity: Pitfalls and Solutions
- The A/A Test: What is it? Why run it?
- Triggering for Improved Sensitivity: When to trigger a participant into an experiment.
- Sample Ratio Mismatch and Other Trust-Related Guardrail Metrics: Measures of how trustworthy your results are.
- Leakage and Interference between Variants: Problems and solutions when test and control groups interfere with each other.
- Measuring Long-Term Treatment Effects: When you want to measure results for much longer than the typical 1-2 week duration of an experiment - and when you don’t need to worry about it.
All in all, highly recommended. I learned and formalised a decent amount even after being involved in this field for some time.
