Part II: Causal Inference with Models¶
教材的第二部分快速解读.
书的网址: https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
Contents
11. Why model?
12. IP wighting and marginal structural model
13. Standardization and the parametric g-formula
14. G-estimation of structural nested models
15. Outcome regression and propensity scores
16. Instrumental variable estimation
17. Causal survival analysis
18. Variable selection for causal inference
Big Picturce:
非参数模型 –> parametric models –> structural mean models
Models for the marginal mean of a counterfactual outcome are referred to as marginal structural mean models.
Effect modification and marginal structural models 在上面的模型增加新的变量 for effect modification
Structural nested mean model 会有一个对比某个基准 potential outcome.
In Part II of this book we have described two different types of models for causal inference:
propensity models and
structural models.
Let us now compare them.
Propensity models are models for the probability of treatment \(A\) given the variables \(L\) used to try to achieve conditional exchangeability.
Structural models describe the relation between the treatment \(A\) and some component of the distribution (e.g., the mean) of the counterfactual outcome
一个简单例子¶
[7]:
ls causal_inference_python_code/
chapter11.ipynb chapter13.ipynb chapter15.ipynb codebook.xls
chapter12.ipynb chapter14.ipynb chapter16.ipynb README.md
[2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
数据集合介绍¶
[3]:
dat = pd.read_excel('NHEFS.xls')
dat.describe()
WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero
[3]:
seqn | qsmk | death | yrdth | modth | dadth | sbp | dbp | sex | age | ... | birthcontrol | pregnancies | cholesterol | hightax82 | price71 | price82 | tax71 | tax82 | price71_82 | tax71_82 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 1629.000000 | 1629.000000 | 1629.000000 | 318.000000 | 322.000000 | 322.000000 | 1552.000000 | 1548.000000 | 1629.000000 | 1629.000000 | ... | 1629.000000 | 726.00000 | 1613.000000 | 1537.000000 | 1537.000000 | 1537.000000 | 1537.000000 | 1537.000000 | 1537.000000 | 1537.000000 |
mean | 16552.364641 | 0.262738 | 0.195212 | 87.569182 | 6.257764 | 15.872671 | 128.709407 | 77.744832 | 0.509515 | 43.915285 | ... | 1.084715 | 3.69146 | 219.973962 | 0.165908 | 2.138750 | 1.806095 | 1.058581 | 0.505983 | 0.332741 | 0.552614 |
std | 7498.918195 | 0.440256 | 0.396485 | 2.659415 | 3.615304 | 8.905488 | 19.051560 | 10.634864 | 0.500063 | 12.170430 | ... | 0.947747 | 2.20560 | 45.444202 | 0.372119 | 0.229016 | 0.130641 | 0.216229 | 0.111894 | 0.155045 | 0.150321 |
min | 233.000000 | 0.000000 | 0.000000 | 83.000000 | 1.000000 | 1.000000 | 87.000000 | 47.000000 | 0.000000 | 25.000000 | ... | 0.000000 | 1.00000 | 78.000000 | 0.000000 | 1.506592 | 1.451904 | 0.524902 | 0.219971 | -0.202698 | 0.035995 |
25% | 10607.000000 | 0.000000 | 0.000000 | 85.000000 | 3.000000 | 8.000000 | 116.000000 | 70.000000 | 0.000000 | 33.000000 | ... | 0.000000 | 2.00000 | 189.000000 | 0.000000 | 2.036621 | 1.739990 | 0.944946 | 0.439941 | 0.200989 | 0.460999 |
50% | 20333.000000 | 0.000000 | 0.000000 | 88.000000 | 6.000000 | 15.500000 | 126.000000 | 77.000000 | 1.000000 | 44.000000 | ... | 1.000000 | 3.00000 | 216.000000 | 0.000000 | 2.167969 | 1.814941 | 1.049805 | 0.505981 | 0.335999 | 0.543945 |
75% | 22719.000000 | 1.000000 | 0.000000 | 90.000000 | 10.000000 | 24.000000 | 140.000000 | 85.000000 | 1.000000 | 53.000000 | ... | 2.000000 | 5.00000 | 245.000000 | 0.000000 | 2.241699 | 1.867676 | 1.154785 | 0.571899 | 0.443787 | 0.621948 |
max | 25061.000000 | 1.000000 | 1.000000 | 92.000000 | 12.000000 | 31.000000 | 229.000000 | 130.000000 | 1.000000 | 74.000000 | ... | 2.000000 | 15.00000 | 416.000000 | 1.000000 | 2.692871 | 2.103027 | 1.522461 | 0.747925 | 0.612061 | 0.884399 |
8 rows × 64 columns
[24]:
codebook = pd.read_excel('causal_inference_python_code/codebook.xls')
codebook.Description = codebook.Description.apply(lambda s: s.lower())
codebook.iloc[:, 1].tolist()
[24]:
['in your usual day, how active are you? in 1971, 0:very active, 1:moderately active, 2:inactive',
'age in 1971',
'how often do you drink? in 1971 0: almost every day, 1: 2-3 times/week, 2: 1-4 times/month, 3: < 12 times/year, 4: no alcohol last year, 5: unknown',
'when you drink, how much do you drink? in 1971',
'have you had 1 drink past year? in 1971, 1:ever, 0:never; 2:missing',
'which do you most frequently drink? in 1971 1: beer, 2: wine, 3: liquor, 4: other/unknown',
'use allergies medication in 1971, 1:ever, 0:never',
'dx asthma in 1971, 1:ever, 0:never',
'birth control pills past 6 months? in 1971 1:yes, 0:no, 2:missing',
'check state code - second page',
'use bowel trouble medication in 1971, 1:ever, 0:never, ; 2:missing',
'dx chronic bronchitis/emphysema in 1971, 1:ever, 0:never',
'serum cholesterol (mg/100ml) in 1971',
'dx chronic cough in 1971, 1:ever, 0:never',
'dx colitis in 1971, 1:ever, 0:never',
'day of death',
'diastolic blood pressure in 1982',
'death by 1992, 1:yes, 0:no',
'dx diabetes in 1971, 1:ever, 0:never, 2:missing',
'amount of education by 1971: 1: 8th grade or less, 2: hs dropout, 3: hs, 4:college dropout, 5: college or more',
'in recreation, how much exercise? in 1971, 0:much exercise,1:moderate exercise,2:little or no exercise',
'dx hay fever in 1971, 1:ever, 0:never',
'dx high blood pressure in 1971, 1:ever, 0:never, 2:missing',
'use high blood pressure medication in 1971, 1:ever, 0:never, ; 2:missing',
'use headache medication in 1971, 1:ever, 0:never',
'dx hepatitis in 1971, 1:ever, 0:never',
'dx heart failure in 1971, 1:ever, 0:never',
'living in a highly taxed state in 1982, high taxed state of residence=1, 0 otherwise',
'height in centimeters in 1971',
'total family income in 1971 11:<$1000, 12: 1000-1999, 13: 2000-2999, 14: 3000-3999, 15: 4000-4999, 16: 5000-5999, 17: 6000-6999, 18: 7000-9999, 19: 10000-14999, 20: 15000-19999, 21: 20000-24999, 22: 25000+',
'use infection medication in 1971, 1:ever, 0:never',
'uselack of pep medication in 1971, 1:ever, 0:never',
'marital status in 1971 1: under 17, 2: married, 3: widowed, 4: never married, 5: divorced, 6: separated, 8: unknown',
'month of death',
'use nerves medication in 1971, 1:ever, 0:never',
'dx nervous breakdown in 1971, 1:ever, 0:never',
'use other pains medication in 1971, 1:ever, 0:never',
'dx peptic ulcer in 1971, 1:ever, 0:never',
'do you eat dirt or clay, starch or other non standard food? in 1971 1:ever, 0:never; 2:missing',
'dx polio in 1971, 1:ever, 0:never',
'total number of pregnancies? in 1971',
'avg tobacco price in state of residence 1971 (us$2008)',
'difference in avg tobacco price in state of residence 1971-1982 (us$2008)',
'avg tobacco price in state of residence 1982 (us$2008)',
'quit smoking between 1st questionnaire and 1982, 1:yes, 0:no',
'0: white 1: black or other in 1971',
'systolic blood pressure in 1982',
'highest grade of regular school ever in 1971',
'unique personal identifier',
'0: male 1: female',
'number of cigarettes smoked per day in 1971',
'increase in number of cigarettes/day between 1971 and 1982',
'years of smoking',
'tobacco tax in state of residence 1971 (us$2008)',
'difference in tobacco tax in state of residence 1971-1982 (us$2008)',
'tobacco tax in state of residence 1971 (us$2008)',
'dx tuberculosis in 1971, 1:ever, 0:never',
'dx malignant tumor/growth in 1971, 1:ever, 0:never',
'use weak heart medication in 1971, 1:ever, 0:never',
'weight in kilograms in 1971',
'weight in kilograms in 1982',
'weight change in kilograms',
'use weight loss medication in 1971, 1:ever, 0:never',
'year of death']
[31]:
codebook
[31]:
Variable name | Description | |
---|---|---|
0 | active | in your usual day, how active are you? in 1971... |
1 | age | age in 1971 |
2 | alcoholfreq | how often do you drink? in 1971 0: almost ev... |
3 | alcoholhowmuch | when you drink, how much do you drink? in 1971 |
4 | alcoholpy | have you had 1 drink past year? in 1971, 1:ev... |
5 | alcoholtype | which do you most frequently drink? in 1971 1... |
6 | allergies | use allergies medication in 1971, 1:ever, 0:n... |
7 | asthma | dx asthma in 1971, 1:ever, 0:never |
8 | bithcontrol | birth control pills past 6 months? in 1971 1:y... |
9 | birthplace | check state code - second page |
10 | boweltrouble | use bowel trouble medication in 1971, 1:ever,... |
11 | bronch | dx chronic bronchitis/emphysema in 1971, 1:ev... |
12 | cholesterol | serum cholesterol (mg/100ml) in 1971 |
13 | chroniccough | dx chronic cough in 1971, 1:ever, 0:never |
14 | colitis | dx colitis in 1971, 1:ever, 0:never |
15 | dadth | day of death |
16 | dbp | diastolic blood pressure in 1982 |
17 | death | death by 1992, 1:yes, 0:no |
18 | diabetes | dx diabetes in 1971, 1:ever, 0:never, 2:missing |
19 | education | amount of education by 1971: 1: 8th grade or l... |
20 | exercise | in recreation, how much exercise? in 1971, 0:m... |
21 | hayfever | dx hay fever in 1971, 1:ever, 0:never |
22 | hbp | dx high blood pressure in 1971, 1:ever, 0:neve... |
23 | hbpmed | use high blood pressure medication in 1971, 1... |
24 | headache | use headache medication in 1971, 1:ever, 0:never |
25 | hepatitis | dx hepatitis in 1971, 1:ever, 0:never |
26 | hf | dx heart failure in 1971, 1:ever, 0:never |
27 | hightax82 | living in a highly taxed state in 1982, high t... |
28 | ht | height in centimeters in 1971 |
29 | income | total family income in 1971 11:<$1000, 12: 10... |
... | ... | ... |
34 | nerves | use nerves medication in 1971, 1:ever, 0:never |
35 | nervousbreak | dx nervous breakdown in 1971, 1:ever, 0:never |
36 | otherpain | use other pains medication in 1971, 1:ever, 0... |
37 | pepticulcer | dx peptic ulcer in 1971, 1:ever, 0:never |
38 | pica | do you eat dirt or clay, starch or other non s... |
39 | polio | dx polio in 1971, 1:ever, 0:never |
40 | pregnancies | total number of pregnancies? in 1971 |
41 | price71 | avg tobacco price in state of residence 1971 (... |
42 | price71_82 | difference in avg tobacco price in state of re... |
43 | price82 | avg tobacco price in state of residence 1982 (... |
44 | qsmk | quit smoking between 1st questionnaire and 198... |
45 | race | 0: white 1: black or other in 1971 |
46 | sbp | systolic blood pressure in 1982 |
47 | school | highest grade of regular school ever in 1971 |
48 | seqn | unique personal identifier |
49 | sex | 0: male 1: female |
50 | smokeintensity | number of cigarettes smoked per day in 1971 |
51 | smkintensity 82_71 | increase in number of cigarettes/day between 1... |
52 | smokeyrs | years of smoking |
53 | tax71 | tobacco tax in state of residence 1971 (us$2008) |
54 | tax71_82 | difference in tobacco tax in state of residenc... |
55 | tax82 | tobacco tax in state of residence 1971 (us$2008) |
56 | tb | dx tuberculosis in 1971, 1:ever, 0:never |
57 | tumor | dx malignant tumor/growth in 1971, 1:ever, 0:... |
58 | weakheart | use weak heart medication in 1971, 1:ever, 0:... |
59 | wt71 | weight in kilograms in 1971 |
60 | wt82 | weight in kilograms in 1982 |
61 | wt82_71 | weight change in kilograms |
62 | wtloss | use weight loss medication in 1971, 1:ever, 0... |
63 | yrdth | year of death |
64 rows × 2 columns
EDA¶
[25]:
dat.columns
[25]:
Index(['seqn', 'qsmk', 'death', 'yrdth', 'modth', 'dadth', 'sbp', 'dbp', 'sex',
'age', 'race', 'income', 'marital', 'school', 'education', 'ht', 'wt71',
'wt82', 'wt82_71', 'birthplace', 'smokeintensity', 'smkintensity82_71',
'smokeyrs', 'asthma', 'bronch', 'tb', 'hf', 'hbp', 'pepticulcer',
'colitis', 'hepatitis', 'chroniccough', 'hayfever', 'diabetes', 'polio',
'tumor', 'nervousbreak', 'alcoholpy', 'alcoholfreq', 'alcoholtype',
'alcoholhowmuch', 'pica', 'headache', 'otherpain', 'weakheart',
'allergies', 'nerves', 'lackpep', 'hbpmed', 'boweltrouble', 'wtloss',
'infection', 'active', 'exercise', 'birthcontrol', 'pregnancies',
'cholesterol', 'hightax82', 'price71', 'price82', 'tax71', 'tax82',
'price71_82', 'tax71_82'],
dtype='object')
[28]:
d = dat[['sex', 'wt82']]
d.head(10)
[28]:
sex | wt82 | |
---|---|---|
0 | 0 | 68.946040 |
1 | 0 | 61.234970 |
2 | 1 | 66.224486 |
3 | 0 | 64.410117 |
4 | 0 | 92.079251 |
5 | 1 | 103.419060 |
6 | 1 | 58.967008 |
7 | 1 | 58.967008 |
8 | 0 | 62.142155 |
9 | 0 | 72.121187 |
[38]:
d.boxplot(column='wt82', by='sex')
[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8650d6ec50>
[39]:
d.groupby('sex').describe()
[39]:
wt82 | ||||||||
---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | |
sex | ||||||||
0 | 762.0 | 80.131080 | 14.567845 | 38.101759 | 70.306817 | 78.925072 | 87.543327 | 136.531303 |
1 | 804.0 | 67.155366 | 15.022770 | 35.380205 | 57.039241 | 63.956524 | 73.481964 | 136.531303 |
[45]:
d_test = pd.DataFrame(data=[1,2,3,10,4],columns=['x'])
d_test
[45]:
x | |
---|---|
0 | 1 |
1 | 2 |
2 | 3 |
3 | 10 |
4 | 4 |
[46]:
d_test.boxplot(column='x')
[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8651b33978>
[ ]: