Econometrics in Python, Difference-in-differences — Multiple groups and periods (FE-DiD model)

Luis Garcia Fuentes
9 min readSep 3, 2021

The data science and analytics community gives well-merited love to ML algorithms that aim to mine patterns on data, while not so much to econometric techniques that aim to find causal inference. This short blog aims to give some love to one of these econometric methods, the Difference-in-differences estimator.

how salty I am about the above

Now, much has been written about the basic form of the diff-in-diff model, and before we dive into its more complex cousin, you should be somewhat familiar with the basic model.

Diff-Diff estimation captures the effect of an intervention by comparing the performance of the treated group with its counterfactual, where the counterfactual is approximated through control variables of a regression.

For a deep refresher, I will point you to Rohit’s article ‘Difference in Differences’. A quick refresher from me follows:

In essence, Diff-Diff allows you to observe the effect of an intervention over a treated group, trimming out ‘irrelevant’ effects.

That is, it is a statistical technique employed to see if an intervention had the desired impact without having to run an expensive randomized trial experiment, but instead, utilizing a quasi-experiment.

A quasi-experiment is easier to run (sometimes you find them to occur naturally in the wild, nature is beautiful isn't it) but carries with itself a lot of ‘irrelevant’ effects, which diff-diff estimation can trim out for you. Definition of a quasi-experiment as per all-mighty Google below.

Let's say you run a pilot over some stores you own but not all, you got yourself a quasi-experiment now (congrats!). Now your goal as an analyst (data scientist, data guru, data detective, data engineer, or whatever title your company gave you) is to identify if the pilots helped the aim of the business, that is, to see if stores that went through the pilot program did better than they would otherwise have done.

Of course, you can’t look at the counterfactual event, that is, you can’t know what might have happened in the pilot stores had you not acted the way you did (because of physics). So you will need a plan B.

You could consider the non-pilot stores as a control group to whom to compare the pilot results against. However, because the pilot stores were likely not randomly selected, and thus the pilot and non-pilot stores are not equal but for the participation in the pilot, the observed difference in performance from pilot stores and non-pilot stores will not be purely attributed to the effect of the pilot intervention, as this difference will encompass both the effect of the pilot and inherent differences between both groups.

In plain English, the two groups were never equal, so it makes no sense to expect that performance differences were caused only by the existence of the pilot.

Why you can’t just compare pilot store results vs non-pilot results to observe the effect of the pilot

So you may think to yourself, ‘Well that's a shame, but what if I just look at the performance of the pilot locations after we started the pilot and compare this to their performance before we started them, that must be the effect of the pilot!’ and you would be wrong. You cannot compare the post-intervention figures with the pre-intervention figures of the pilot stores because the figures would have likely been different regardless of whether you had run a pilot or not.

The reason for this is that time tends to change things. The effect of time will be entangled with the effect of the pilot, and it would not be accurate to assume that the observed change is driven purely by the effect of the pilot.

Why you can’t compare the pilot location post-treatment to itself before intervention to measure the pilot effect.

At this point, if this is a refresher for you over the base Diff-in-Diff model, you will know that the model looks as follows;

In the base Diff-Diff model (the above figure), it is through the coefficient λ that we capture the effect of the pilot over our KPI (Y_it), while ‘trimming’ out the effect of the trend of time and the inherent differences that exist between the pilot locations and non-pilot locations (the power of regression).

However, the base Diff-Diff model works only if you have two time periods (pre-treatment and post-treatment) and if you have only two entities (a treated one and a non-treated one).

However, this layout, is already inadequate for our example, as we have many stores, some of which are pilot stores and some which are not. Additionally, what if we also have more than two time periods, such as monthly sales data where 10 months include the pre-pilot period and an additional series of 7 months where the pilot was active in some stores.

To handle the more complex scenario described above, we need to tweak the Diff-in-Diff model to its more generalized version of Difference-in-differences — Multiple groups and periods.

Where λ continues to capture the effect of the pilot over our KPI of interest, but now we control for the fact that we have many time periods and many stores, some of which are part of the pilot and some of which are not.

We do this by employing fixed effects, both entity and time-fixed effects. Entity-fixed effects capture inherent and nonchanging differences between stores, these are the inherent characteristics described earlier. Time-fixed effects instead capture shocks that were experienced equally by all locations during a given month, such as seasonality of sales or the overall economic environment during a month. For a deeper dive into how Fixed effects work, read the accurately named ‘Fixed Effects Models (very important stuff)’ lecture notes.

The sum notation describes the application of fixed effects through dummy variables, where every location or month (but 1 to avoid perfect multicollinearity) is included. While each fixed effect dummy counts with its own coefficient, we are not interested in these. In fact, most statistical packages will by default exclude these FE coefficients from the summary output.

‘Pilot’ is a dummy that equals 1 when the store ‘s’ is a pilot store and when the time ‘t’ corresponds to the time when the pilots are running. Note that this specification also allows us to have pilots starting at different times and to scale the “strength” of the pilot into a continuous index from 0–1, which depending on how rigorous the implementation of pilot activities across locations was, may help us capture differences between pilot implementation and roll-out window.

Of course, maybe you are tired of reading my explanation, so here is a cool prof/YouTuber chewing the hard content for you.

Like and subscribe?

You are back? Perfect! Now let us implement the code in Python to achieve this. First, I will provide the code and then explain each step.

import pandas as pd
import statsmodels.formula.api as sm
import statsmodels.stats.sandwich_covariance as sw
import numpy as np
import statsmodels as statsmodels

We first import the needed, libraries, nothing special here.

After trying ‘linearmodels’ and ‘statsmodels’ libraries, I have decided that statsmodels is the best library for this kind of work. Primarily because of the built-in functionalities to handle clustered standard errors across both time and groups (we will discuss this later) and the ease to code them. For a deeper guide on other available Python libraries, you can visit Vincent Gregorie's blog.

[1] df=pd.read_excel('important_analysis.xlsx')
[2] df['Mcode'] = pd.factorize(df['MonthYr'],sort=True)[0] + 1
[3] df['StoreCode'] = pd.factorize(df['Storename'],sort=True)[0] + 1
[4] df['RevenueLN'] = np.log(df['Revenue'])

We now do some data cleaning. This is needed to get the data in the format that statsmodel is expecting it (the fixed effects need to be inserted in the form of a series and not a pandas data frame).

Note that the data should originally be loaded into a data frame where the columns pilot (the dummy variable), month, store name, and KPI are included.

We thus

[1] Load the data into a data frame;

[2] We then encode the month column in order to transform it from a string into an integer, where each month that previously appeared with format ‘YYYY-MM’ is substituted for an integer where the first month in the available data is indexed as 1. This makes ‘2020–01’, the first period in our data, appear as a 1;

[3] We then also encode each store name (again a string) into an integer (here the sorting is alphabetical but is not really needed). Now each store has its own unique number.;

[4] And finally, we apply a natural-log transformation to our KPI or Y-variable, in this case, monthly sales per location, in order to obtain a coefficient in the log-level form. This allows us to interpret the coefficient λ as ‘being part of the pilot results in a λ% increase in monthly sales figures’.

https://sites.google.com/site/curtiskephart/ta/econ113/interpreting-beta

Once we have taken care of those steps, we can run our regression!

that feeling you get (¬u¬)

It is important to note that we will need to apply clustered standard errors because of… maths…well this a part that I don’t fully understand, but what I remember from undergrad was that the OLS requires certain assumptions about your data to be true, one of which is sampling your data randomly. Because our panel data by default breaks this assumption, we need to correct this. Panel data contains observations that are not fully random (the sales performance of a month tends to be somewhat related to the sales performance of the following month for example), and thus our SE needs to be calculated using the clustered version of the SE formula, and we need to cluster them across both sampling units of our Fixed effects (time and entities).

And this is why I really like the statsmodel library, as we can easily handle that technical mess of clustered SE with just some extra lines of code.

Code to execute and show a summary (in Jupyter) of results.

And hurray! You did it! Sit back, hope you got a coefficient of the desired direction and a p-value below .05 🙄

Best of luck!

Update in 2023:

Since publishing this article, this article has been read over 5K times. Thank you!

If you are a student, know that the contents of this article will be valid for your thesis. If you are a real-life practitioner, and you have a multiperiod DiD exercise in front of you, please read Difference-in-Differences with multiple time periods from Brantly Callaway (2021).

In this paper, Callaway proposes a new method for estimating DiD effects in settings where there are multiple time periods before and after the treatment, and when treatments are administered at different start times and to different strength levels.

As you can imagine, this specification goes beyond any undergraduate course, and given that I graduated in 2019, it precedes my graduate knowledge to. However, do consider investigating if you employ the contents of this article for real-life research!

--

--