Data analytics is one of the fastest growing jobs. But, you can be even more successful in it with at least some knowledge of economics. In particular, these techniques can help you create your own mini research lab in your startup so that you have the resources of a large company to produce useful business intelligence analysis, while still remaining agile and lean.
Having finished my doctorates in both a traditional economics and a less traditional computer science-ish department at Stanford, I’ve had the opportunity and pleasure to interact with a wide range of quantitative data science techniques. Both departments have different styles, but their approaches are highly complementary, which is being increasingly recognized by economists, such Susan Athey and Sendhil Mullainathan.
While the perspective and tools in computer science are highly effective for detecting and predicting patterns in large datasets, they are less suited for understanding the fundamental relationship between different phenomena.
For example, if a business is trying to understand how improvements in its corporate culture affect employee turnover, then the business will need to do more than simply build a classifier for predicting engagement based off of different dimensions of culture.
There are several types of confounding factors that economists routinely encounter in causal inference. First, reverse causality. If you see that a company has high culture and low turnover, then is culture predicting turnover or turnover predicting culture? For example, if culture is in part a function of employee composition, then changes in turnover will necessarily impact the perceptions that other employees have about the culture.
Second, omitted variables. A company with high corporate culture might have low turnover for other reasons. For example, if the company offers high quality health benefits, then employees will be less likely to leave since they know that doing so means forfeiting their benefits.
One of the most effective and versatile tools for overcoming these types of challenges to causal inference is known as the application of “instrumental variables” (IV). These variables have the feature of being correlated with the relevant independent variable of interest, but uncorrelated with the unobserved determinants that enter the error in the statistical model.
Applications of instrumental variables proceed by implementing a two-step regression. The “first-stage” projects the relevant independent variable of interest on the instrumental variable (together with any other controls included in the model). After running that projection, take the predicted values, which are obtained by using the estimated parameter values, and estimate a second-stage regression that projects the initial outcome variable of interest on the aforementioned predicted values and controls.
The effectiveness of the approach rests upon two important assumptions. First, the instrument needs to be correlated with the independent variable of interest. If not, the predicted values from the first-stage will be too low, creating a “weak” second-stage estimate. Think of it as a noise-to-signal ratio problem. Second, the instrument needs to be uncorrelated with the error in the statistical model. Unfortunately, this latter assumption is inherently untestable since the error includes anything not explicitly controlled for, but data scientists have the opportunity to be creative and clever when conducting diagnostics. For example, if you think of a potential confounding variable, try controlling for it and seeing whether the estimated coefficients change. That’s known as the “coefficient comparison test”.
Opportunities abound for computer scientists to integrate prediction algorithms into these causal inference approaches — and many methods are already surfacing. For example, one possibility is to use elegant classifiers to improve the quality of the first-stage prediction. Specifically, one of the challenges in many settings is the presence of something known as “heterogeneous treatment effects”. When a treatment — let’s say corporate culture — has potentially different effects on different units in the sample, then that can create a problem. If learning classifiers can identify a pattern that explains the heterogeneity, that pattern can be incorporated in the first-stage.
So, where to go from here? There are two books I’d recommend highly to interested readers — both of them have somewhat different selling points.
The first is Imbens and Rubin’s book “Causal Inference for Statistics, Social, and Biomedical Sciences: An Introduction”. I had the opportunity to study under Guido Imbens in a course at Stanford and he’s a very kind professor and excellent thinker. The book takes a very scientific approach to causal inference, laying out carefully a formal framework for how we as data scientists should think about counterfactuals as if we were conducting a laboratory-based experiment. The book is more on the theory side.
The second is Angrist and Piscke’s book “ Mostly Harmless Econometrics”. These two scholars have successfully taken complex statistical principles and boiled them down into an engaging and practical text. Although the book is accessible at an undergraduate level, one of its differentiating features is that it is still useful for graduate courses and professional work in data science.
These are the two books that come most to mind, but I look forward to reading others’ tips or suggested books too!