Part II: Causal Inference with Models

教材的第二部分快速解读.

书的网址: https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/

Contents

11. Why model?
12. IP wighting and marginal structural model
13. Standardization and the parametric g-formula
14. G-estimation of structural nested models
15. Outcome regression and propensity scores
16. Instrumental variable estimation
17. Causal survival analysis
18. Variable selection for causal inference

Big Picturce:

非参数模型 –> parametric models –> structural mean models

  • Models for the marginal mean of a counterfactual outcome are referred to as marginal structural mean models.

  • Effect modification and marginal structural models 在上面的模型增加新的变量 for effect modification

  • Structural nested mean model 会有一个对比某个基准 potential outcome.

In Part II of this book we have described two different types of models for causal inference:

  • propensity models and

  • structural models.

Let us now compare them.

  • Propensity models are models for the probability of treatment \(A\) given the variables \(L\) used to try to achieve conditional exchangeability.

  • Structural models describe the relation between the treatment \(A\) and some component of the distribution (e.g., the mean) of the counterfactual outcome

一个简单例子

[7]:
ls causal_inference_python_code/
chapter11.ipynb  chapter13.ipynb  chapter15.ipynb  codebook.xls
chapter12.ipynb  chapter14.ipynb  chapter16.ipynb  README.md
[2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

数据集合介绍

[3]:
dat = pd.read_excel('NHEFS.xls')
dat.describe()
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
[3]:
seqn qsmk death yrdth modth dadth sbp dbp sex age ... birthcontrol pregnancies cholesterol hightax82 price71 price82 tax71 tax82 price71_82 tax71_82
count 1629.000000 1629.000000 1629.000000 318.000000 322.000000 322.000000 1552.000000 1548.000000 1629.000000 1629.000000 ... 1629.000000 726.00000 1613.000000 1537.000000 1537.000000 1537.000000 1537.000000 1537.000000 1537.000000 1537.000000
mean 16552.364641 0.262738 0.195212 87.569182 6.257764 15.872671 128.709407 77.744832 0.509515 43.915285 ... 1.084715 3.69146 219.973962 0.165908 2.138750 1.806095 1.058581 0.505983 0.332741 0.552614
std 7498.918195 0.440256 0.396485 2.659415 3.615304 8.905488 19.051560 10.634864 0.500063 12.170430 ... 0.947747 2.20560 45.444202 0.372119 0.229016 0.130641 0.216229 0.111894 0.155045 0.150321
min 233.000000 0.000000 0.000000 83.000000 1.000000 1.000000 87.000000 47.000000 0.000000 25.000000 ... 0.000000 1.00000 78.000000 0.000000 1.506592 1.451904 0.524902 0.219971 -0.202698 0.035995
25% 10607.000000 0.000000 0.000000 85.000000 3.000000 8.000000 116.000000 70.000000 0.000000 33.000000 ... 0.000000 2.00000 189.000000 0.000000 2.036621 1.739990 0.944946 0.439941 0.200989 0.460999
50% 20333.000000 0.000000 0.000000 88.000000 6.000000 15.500000 126.000000 77.000000 1.000000 44.000000 ... 1.000000 3.00000 216.000000 0.000000 2.167969 1.814941 1.049805 0.505981 0.335999 0.543945
75% 22719.000000 1.000000 0.000000 90.000000 10.000000 24.000000 140.000000 85.000000 1.000000 53.000000 ... 2.000000 5.00000 245.000000 0.000000 2.241699 1.867676 1.154785 0.571899 0.443787 0.621948
max 25061.000000 1.000000 1.000000 92.000000 12.000000 31.000000 229.000000 130.000000 1.000000 74.000000 ... 2.000000 15.00000 416.000000 1.000000 2.692871 2.103027 1.522461 0.747925 0.612061 0.884399

8 rows × 64 columns

[24]:
codebook = pd.read_excel('causal_inference_python_code/codebook.xls')
codebook.Description = codebook.Description.apply(lambda s: s.lower())
codebook.iloc[:, 1].tolist()
[24]:
['in your usual day, how active are you? in 1971, 0:very active, 1:moderately active, 2:inactive',
 'age in 1971',
 'how often do you drink?  in 1971  0: almost every day, 1: 2-3 times/week, 2: 1-4 times/month, 3: < 12 times/year, 4: no alcohol last year, 5: unknown',
 'when you drink, how much do you drink?  in 1971',
 'have you had 1 drink past year? in 1971,  1:ever, 0:never; 2:missing',
 'which do you most frequently drink? in 1971  1: beer, 2: wine, 3: liquor, 4: other/unknown',
 'use allergies medication in 1971,  1:ever, 0:never',
 'dx asthma in 1971, 1:ever, 0:never',
 'birth control pills past 6 months? in 1971 1:yes, 0:no, 2:missing',
 'check state code - second page',
 'use bowel trouble medication in 1971,  1:ever, 0:never, ; 2:missing',
 'dx chronic bronchitis/emphysema in 1971,  1:ever, 0:never',
 'serum cholesterol (mg/100ml) in 1971',
 'dx chronic cough in 1971,  1:ever, 0:never',
 'dx colitis in 1971,  1:ever, 0:never',
 'day of death',
 'diastolic blood pressure in 1982',
 'death by 1992, 1:yes, 0:no',
 'dx diabetes in 1971,  1:ever, 0:never, 2:missing',
 'amount of education by 1971: 1: 8th grade or less, 2: hs dropout, 3: hs, 4:college dropout, 5: college or more',
 'in recreation, how much exercise? in 1971, 0:much exercise,1:moderate exercise,2:little or no exercise',
 'dx hay fever in 1971, 1:ever, 0:never',
 'dx high blood pressure in 1971, 1:ever, 0:never, 2:missing',
 'use high blood pressure medication in 1971,  1:ever, 0:never, ; 2:missing',
 'use headache medication in 1971,  1:ever, 0:never',
 'dx hepatitis in 1971,  1:ever, 0:never',
 'dx heart failure in 1971, 1:ever, 0:never',
 'living in a highly taxed state in 1982, high taxed state of residence=1, 0 otherwise',
 'height in centimeters in 1971',
 'total family income in 1971  11:<$1000, 12: 1000-1999, 13: 2000-2999, 14: 3000-3999, 15: 4000-4999, 16: 5000-5999, 17: 6000-6999, 18: 7000-9999, 19: 10000-14999, 20: 15000-19999, 21: 20000-24999, 22: 25000+',
 'use infection medication in 1971,  1:ever, 0:never',
 'uselack of pep medication in 1971,  1:ever, 0:never',
 'marital status in 1971  1: under 17, 2: married, 3: widowed, 4: never married, 5: divorced, 6: separated, 8: unknown',
 'month of death',
 'use nerves medication in 1971,  1:ever, 0:never',
 'dx nervous breakdown in 1971,  1:ever, 0:never',
 'use other pains medication in 1971,  1:ever, 0:never',
 'dx peptic ulcer in 1971,  1:ever, 0:never',
 'do you eat dirt or clay, starch or other non standard food?  in 1971  1:ever, 0:never; 2:missing',
 'dx polio in 1971, 1:ever, 0:never',
 'total number of pregnancies? in 1971',
 'avg tobacco price in state of residence 1971 (us$2008)',
 'difference in avg tobacco price in state of residence 1971-1982 (us$2008)',
 'avg tobacco price in state of residence 1982 (us$2008)',
 'quit smoking between 1st questionnaire and 1982, 1:yes, 0:no',
 '0: white 1: black or other in 1971',
 'systolic blood pressure in 1982',
 'highest grade of regular school ever  in 1971',
 'unique personal identifier',
 '0: male 1: female',
 'number of cigarettes smoked per day in 1971',
 'increase in number of cigarettes/day between 1971 and 1982',
 'years of smoking',
 'tobacco tax in state of residence 1971 (us$2008)',
 'difference in tobacco tax in state of residence 1971-1982 (us$2008)',
 'tobacco tax in state of residence 1971 (us$2008)',
 'dx tuberculosis in 1971,  1:ever, 0:never',
 'dx malignant tumor/growth in 1971,  1:ever, 0:never',
 'use weak heart medication in 1971,  1:ever, 0:never',
 'weight in kilograms  in 1971',
 'weight in kilograms  in 1982',
 'weight change in kilograms',
 'use weight loss medication in 1971,  1:ever, 0:never',
 'year of death']
[31]:
codebook
[31]:
Variable name Description
0 active in your usual day, how active are you? in 1971...
1 age age in 1971
2 alcoholfreq how often do you drink? in 1971 0: almost ev...
3 alcoholhowmuch when you drink, how much do you drink? in 1971
4 alcoholpy have you had 1 drink past year? in 1971, 1:ev...
5 alcoholtype which do you most frequently drink? in 1971 1...
6 allergies use allergies medication in 1971, 1:ever, 0:n...
7 asthma dx asthma in 1971, 1:ever, 0:never
8 bithcontrol birth control pills past 6 months? in 1971 1:y...
9 birthplace check state code - second page
10 boweltrouble use bowel trouble medication in 1971, 1:ever,...
11 bronch dx chronic bronchitis/emphysema in 1971, 1:ev...
12 cholesterol serum cholesterol (mg/100ml) in 1971
13 chroniccough dx chronic cough in 1971, 1:ever, 0:never
14 colitis dx colitis in 1971, 1:ever, 0:never
15 dadth day of death
16 dbp diastolic blood pressure in 1982
17 death death by 1992, 1:yes, 0:no
18 diabetes dx diabetes in 1971, 1:ever, 0:never, 2:missing
19 education amount of education by 1971: 1: 8th grade or l...
20 exercise in recreation, how much exercise? in 1971, 0:m...
21 hayfever dx hay fever in 1971, 1:ever, 0:never
22 hbp dx high blood pressure in 1971, 1:ever, 0:neve...
23 hbpmed use high blood pressure medication in 1971, 1...
24 headache use headache medication in 1971, 1:ever, 0:never
25 hepatitis dx hepatitis in 1971, 1:ever, 0:never
26 hf dx heart failure in 1971, 1:ever, 0:never
27 hightax82 living in a highly taxed state in 1982, high t...
28 ht height in centimeters in 1971
29 income total family income in 1971 11:<$1000, 12: 10...
... ... ...
34 nerves use nerves medication in 1971, 1:ever, 0:never
35 nervousbreak dx nervous breakdown in 1971, 1:ever, 0:never
36 otherpain use other pains medication in 1971, 1:ever, 0...
37 pepticulcer dx peptic ulcer in 1971, 1:ever, 0:never
38 pica do you eat dirt or clay, starch or other non s...
39 polio dx polio in 1971, 1:ever, 0:never
40 pregnancies total number of pregnancies? in 1971
41 price71 avg tobacco price in state of residence 1971 (...
42 price71_82 difference in avg tobacco price in state of re...
43 price82 avg tobacco price in state of residence 1982 (...
44 qsmk quit smoking between 1st questionnaire and 198...
45 race 0: white 1: black or other in 1971
46 sbp systolic blood pressure in 1982
47 school highest grade of regular school ever in 1971
48 seqn unique personal identifier
49 sex 0: male 1: female
50 smokeintensity number of cigarettes smoked per day in 1971
51 smkintensity 82_71 increase in number of cigarettes/day between 1...
52 smokeyrs years of smoking
53 tax71 tobacco tax in state of residence 1971 (us$2008)
54 tax71_82 difference in tobacco tax in state of residenc...
55 tax82 tobacco tax in state of residence 1971 (us$2008)
56 tb dx tuberculosis in 1971, 1:ever, 0:never
57 tumor dx malignant tumor/growth in 1971, 1:ever, 0:...
58 weakheart use weak heart medication in 1971, 1:ever, 0:...
59 wt71 weight in kilograms in 1971
60 wt82 weight in kilograms in 1982
61 wt82_71 weight change in kilograms
62 wtloss use weight loss medication in 1971, 1:ever, 0...
63 yrdth year of death

64 rows × 2 columns

EDA

[25]:
dat.columns
[25]:
Index(['seqn', 'qsmk', 'death', 'yrdth', 'modth', 'dadth', 'sbp', 'dbp', 'sex',
       'age', 'race', 'income', 'marital', 'school', 'education', 'ht', 'wt71',
       'wt82', 'wt82_71', 'birthplace', 'smokeintensity', 'smkintensity82_71',
       'smokeyrs', 'asthma', 'bronch', 'tb', 'hf', 'hbp', 'pepticulcer',
       'colitis', 'hepatitis', 'chroniccough', 'hayfever', 'diabetes', 'polio',
       'tumor', 'nervousbreak', 'alcoholpy', 'alcoholfreq', 'alcoholtype',
       'alcoholhowmuch', 'pica', 'headache', 'otherpain', 'weakheart',
       'allergies', 'nerves', 'lackpep', 'hbpmed', 'boweltrouble', 'wtloss',
       'infection', 'active', 'exercise', 'birthcontrol', 'pregnancies',
       'cholesterol', 'hightax82', 'price71', 'price82', 'tax71', 'tax82',
       'price71_82', 'tax71_82'],
      dtype='object')
[28]:
d = dat[['sex', 'wt82']]
d.head(10)
[28]:
sex wt82
0 0 68.946040
1 0 61.234970
2 1 66.224486
3 0 64.410117
4 0 92.079251
5 1 103.419060
6 1 58.967008
7 1 58.967008
8 0 62.142155
9 0 72.121187
[38]:
d.boxplot(column='wt82', by='sex')
[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8650d6ec50>
_images/02-Part(II)有模型因果推断_12_1.png
[39]:
d.groupby('sex').describe()
[39]:
wt82
count mean std min 25% 50% 75% max
sex
0 762.0 80.131080 14.567845 38.101759 70.306817 78.925072 87.543327 136.531303
1 804.0 67.155366 15.022770 35.380205 57.039241 63.956524 73.481964 136.531303
[45]:
d_test = pd.DataFrame(data=[1,2,3,10,4],columns=['x'])
d_test
[45]:
x
0 1
1 2
2 3
3 10
4 4
[46]:
d_test.boxplot(column='x')
[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8651b33978>
_images/02-Part(II)有模型因果推断_15_1.png
[ ]: