May 2023: Data Literature Review

The Data CRT team does a monthly literature review to keep up with what’s happening in the data space as well as monthly data meetups to learn from our broader community of data practitioners. We explored a few interesting topics that we’d love to share, and don’t miss Camille’s in-depth article review about causal inference!

👩‍🔬🧪 Experimentation as a first-class citizen: Though the tools that exist to support experimentation have increased and improved, sometimes the biggest hindrance is getting buy-in from stakeholders

  • Buy-in generally has to come from the top, and it can be hard to convince someone that they should experiment if they don’t believe they need to
  • Camille’s article review below highlights the challenge of adopting causal inference strategies more broadly, including the lack of structural support within organizations for implementing causal inference. This is due to factors including a dearth of established processes, missing incentive structures, few training opportunities, and the pressure on data scientists to deliver fast results in practical business environments

📈 Getting data is great, but having data that drives value is better: Folks in the past decade have been focused on getting data, but data folks are oftentimes too far from the business and failing to drive the value of which they’re capable

  • It’s easier today than ever to collect, store, and analyze data, but unless you have your data teams fully integrated with the business and focused on the highest leverage problems, your data efforts will not be as effective as they can be

Takeaways from a 2021 paper, “Causal Machine Learning and Business Decision Making”

published online by Paul Hünermund, Jermain Kaminski, and Carla Schmitt.

This paper posits that there is a mismatch between the questions being addressed with data driven approaches and the tools being used to answer those questions, specifically with regard to causal inference. Standard machine learning techniques – which are correlation and prediction tools often used to make data-driven decisions – cannot of themselves determine causality and therefore can only address a small portion of the decisions necessary for an organization. They cannot predict the outcome of a new intervention strategy or business decision with any degree of confidence. In order to determine whether a particular action or intervention causes a particular outcome (aka “causal inference”), data practitioners should utilize methods that can determine causal relationships.  This research was conducted with 15 interviews and a survey of 234 data practitioners “[…]engaged with machine learning and data science in their job”.  For more detail on their research methods and results, see the original paper.

The types of questions businesses address with data science methods

The results of the survey indicate that business questions addressed and methods employed by data scientists depend on the methods with which data scientists are currently familiar. Executives can have broadly defined questions and no specific methods they hope to employ, so data scientists will interpret and answer those questions based on their existing toolkits. “Thus, the approaches to strategic problems, and the solutions proposed for them, depend on, and are limited to, the methods and capabilities available to the data scientists in the organization.”

Awareness of the difference between correlational and causal knowledge

A majority (60%) of respondents recognized the limitations of predictive models in determining causality, and were able to identify the risk that other people may interpret their correlative relationships as causal.

The results of the interviews indicate that a data practitioner’s daily work is largely determined by management views and opinions, so if determining causality is not of interest to management then it will not be pursued. 60% of respondents say that a better understanding of causal relationships is diffusing in their organization, and that diffusion seems to originate from the data science practitioners (bottom-up). 40% say they are new to the topic and interested in learning more.

Importance of causal inference

As an interviewee states, “The questions we deal with are generally larger than a specific context or a concrete data set. If I want my models to work in different scenarios, across data sets, I quickly arrive at such [causal inference] problems.”

In practice, interviewees say that causal inference is used for better predictions of business metrics, increasing operational efficiency, solving particularly complex problems, and evaluating the performance of specific decisions. Results of the survey indicate, however, that pure prediction is used most commonly in data science projects.

The results also indicate that experimentation is the most common method of determining causal inference, due to its ease of use. However, experimentation is impractical in many cases, and the data collected is not necessarily suitable for answering business questions. Observational causal inference methods are seen as relevant alternatives, although with some drawbacks such as the applicability of these methods to practical business cases, lengthy and expensive deployment, difficulties with explainability, and underdeveloped tools/software.

Future of causal inference in the industry

Results indicate four challenges to adopting causal inference strategies more broadly. First is that industry examples of causal inference (including A/B testing) are missing from many business sectors. Second is a lack of awareness and unavailability of applicable, standardized tools. Third is a difficulty getting a broader understanding of the topic due to its complexity. Finally, there is a lack of structural support in organizations for implementing causal inference methods (a lack of established processes, missing incentive structures, missing training, and the pressure on data scientists to deliver fast results in practical business environments).