Causal Inference and Machine Learning

In this section you will find a repository of the Applied Statistics course I took as an undergraduate at PUCP. The course was taught to provide us with the necessary tools for all those who are starting in Machine Learning.

Below I will detail in a summarized way what you will find in each folder of the repository.

1. Introduction to Causal Inference

Sript R - Script Python: We analyze whether there is a difference in pay between men and women (gender pay gap). The gender pay gap may partly reflect discrimination against women in the labor market or it may partly reflect a selection effect, i.e. women are relatively more likely to accept occupations that pay somewhat less (e.g. school teaching). We focus on the subset of college-educated workers. We found that, in particular, the difference in average logwage between men and women is equal to $0,0815$. Also, the unconditional gender wage gap is about $8.15$% for the group of never married workers (women get paid less on average in our sample). We also observe that never married working women are relatively more educated than working men and have more working experience.Then, we see that the unconditional wage gap of size $8$% for women decreases to about $5$% after controlling for worker characteristics.

Also, at the end of the Python script you will be able to see the demonstration and a brief explanation of the Frisch-Waugh-Lovell Theorem.

2. Partialling-Out using Lasso

Script R - Script Python: The data set we consider is from the March Supplement of the U.S. Current Population Survey, year 2015. We select white non-hispanic individuals, aged 25 to 64 years, and working more than 35 hours per week during at least 50 weeks of the year. We exclude self-employed workers; individuals living in group quarters; individuals in the military, agricultural or private household sectors; individuals with inconsistent reports on earnings and employment status; individuals with allocated or missing information in any of the variables used in the analysis; and individuals with hourly wage. First, we explain the idea of sample splitting to evaluate the performance of prediction rules to a fellow student and we show how to use it on the wage data. After a long evaluation, we found that the flexible regression perform a better prediction than the other ones. Overall, we see that wage gap is about $8$\% after controlling for worker characteristics.

3. RCT data with Precision Adjusment

Script R - Script Python: First,we describe the multicollinearity. In this lab, we analyze the Pennsylvania re-employment bonus experiment, which was previously studied in “Sequential testing of duration data: the case of the Pennsylvania ‘reemployment bonus’ experiment” (Bilias, 2000), among others. These experiments were conducted in the 1980s by the U.S. Department of Labor to test the incentive effects of alternative compensation schemes for unemployment insurance (UI). In these experiments, UI claimants were randomly assigned either to a control group or one of five treatment groups. Actually, there are six treatment groups in the experiments. Here we focus on treatment group 4, but feel free to explore other treatment groups. In the control group the current rules of the UI applied. Individuals in the treatment groups were offered a cash bonus if they found a job within some pre-specified period of time (qualification period), provided that the job was retained for a specified duration. The treatments differed in the level of the bonus, the length of the qualification period, and whether the bonus was declining over time in the qualification period.

4. Orthogonal Learning and Double Lasso

Script R - Script Python: First, we evaluated the reliability of the Naive Approach and Orthogonal Approach models for inference. Wefind that using the Naive approach since it is not taking into account Neyma’s Orthogonality problem cannot make inference and the beta is even biased. Then, we explain the Double Lasso Approach. Finally, we made an applied case of partialling-out with Lasso to estimate the regression coefficient (Specifically, we are interested in how the rates at which economies of different countries grow are related to the initial wealth levels in each country controlling for country’s institutional, educational, and other similar characteristics )

5. Bootstraping and Causal Tree

Script R - Script Python: We go step by step through the construction of a Bootstrapping. Finally, we find that some changes in the coefficients of interest when we estimate them in the traditional way than when we perform Bootstraping. The difference is not so large since the number of replicates has been quite large (1000).

6. Causal Forest and Debiased Machine Learning

Script Python: We replicate the results found in the paper of Athey and Wager (2018). We describe in detail how the tree was built?, we estimate ATE, we run best linear predictor analysis, we look at school-wise heterogeneity and analysis ignoring clusters. How do the results change? Finally, we used the DML algorithm. We run the next regressions: OLS without including the country characteristics., OLS including the country characteristics., DML using Lasso to predict y an d., DML using Post-Lasso to predict y an d., DML using Random Forest to predict y an d. and we chosed the best combination of models.

Important Note: This repository was created by myself. However, the replicated codes in both R and Python were worked on as a group together with Sandra Martinez and Gianfranco Soria.

Share on

Twitter Facebook LinkedIn

Andrea Clavo