{ "cells": [ { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "# Part II: Causal Inference with Models\n", "\n", "教材的第二部分快速解读.\n", "\n", "书的网址: https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/\n", "\n", "\n", "**Contents**\n", "\n", "```\n", "11. Why model?\n", "12. IP wighting and marginal structural model\n", "13. Standardization and the parametric g-formula\n", "14. G-estimation of structural nested models\n", "15. Outcome regression and propensity scores\n", "16. Instrumental variable estimation\n", "17. Causal survival analysis\n", "18. Variable selection for causal inference\n", "```" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "**Big Picturce:**\n", "\n", "非参数模型 --> parametric models --> structural mean models\n", "\n", "- Models for the marginal mean of a counterfactual outcome are referred to as marginal structural mean models.\n", "- Effect modification and marginal structural models 在上面的模型增加新的变量 for effect modification\n", "- Structural nested mean model 会有一个对比某个基准 potential outcome.\n", "\n", "In Part II of this book we have described two different types of models for causal inference: \n", "\n", "- propensity models and \n", "- structural models. \n", "\n", "Let us now compare them. \n", "\n", "- Propensity models are models for the probability of treatment $A$ given the variables $L$ used to try to achieve conditional exchangeability. \n", "- Structural models describe the relation between the treatment $A$ and some component of the distribution (e.g., the mean) of the counterfactual outcome" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false", "toc-hr-collapsed": true }, "source": [ "## 一个简单例子" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "Collapsed": "false" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "chapter11.ipynb chapter13.ipynb chapter15.ipynb codebook.xls\n", "chapter12.ipynb chapter14.ipynb chapter16.ipynb README.md\n" ] } ], "source": [ "ls causal_inference_python_code/" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "### 数据集合介绍" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "Collapsed": "false" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "WARNING *** OLE2 inconsistency: SSCS size is 0 but SSAT size is non-zero\n" ] }, { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
seqnqsmkdeathyrdthmodthdadthsbpdbpsexage...birthcontrolpregnanciescholesterolhightax82price71price82tax71tax82price71_82tax71_82
count1629.0000001629.0000001629.000000318.000000322.000000322.0000001552.0000001548.0000001629.0000001629.000000...1629.000000726.000001613.0000001537.0000001537.0000001537.0000001537.0000001537.0000001537.0000001537.000000
mean16552.3646410.2627380.19521287.5691826.25776415.872671128.70940777.7448320.50951543.915285...1.0847153.69146219.9739620.1659082.1387501.8060951.0585810.5059830.3327410.552614
std7498.9181950.4402560.3964852.6594153.6153048.90548819.05156010.6348640.50006312.170430...0.9477472.2056045.4442020.3721190.2290160.1306410.2162290.1118940.1550450.150321
min233.0000000.0000000.00000083.0000001.0000001.00000087.00000047.0000000.00000025.000000...0.0000001.0000078.0000000.0000001.5065921.4519040.5249020.219971-0.2026980.035995
25%10607.0000000.0000000.00000085.0000003.0000008.000000116.00000070.0000000.00000033.000000...0.0000002.00000189.0000000.0000002.0366211.7399900.9449460.4399410.2009890.460999
50%20333.0000000.0000000.00000088.0000006.00000015.500000126.00000077.0000001.00000044.000000...1.0000003.00000216.0000000.0000002.1679691.8149411.0498050.5059810.3359990.543945
75%22719.0000001.0000000.00000090.00000010.00000024.000000140.00000085.0000001.00000053.000000...2.0000005.00000245.0000000.0000002.2416991.8676761.1547850.5718990.4437870.621948
max25061.0000001.0000001.00000092.00000012.00000031.000000229.000000130.0000001.00000074.000000...2.00000015.00000416.0000001.0000002.6928712.1030271.5224610.7479250.6120610.884399
\n", "

8 rows × 64 columns

\n", "
" ], "text/plain": [ " seqn qsmk death yrdth modth \\\n", "count 1629.000000 1629.000000 1629.000000 318.000000 322.000000 \n", "mean 16552.364641 0.262738 0.195212 87.569182 6.257764 \n", "std 7498.918195 0.440256 0.396485 2.659415 3.615304 \n", "min 233.000000 0.000000 0.000000 83.000000 1.000000 \n", "25% 10607.000000 0.000000 0.000000 85.000000 3.000000 \n", "50% 20333.000000 0.000000 0.000000 88.000000 6.000000 \n", "75% 22719.000000 1.000000 0.000000 90.000000 10.000000 \n", "max 25061.000000 1.000000 1.000000 92.000000 12.000000 \n", "\n", " dadth sbp dbp sex age \\\n", "count 322.000000 1552.000000 1548.000000 1629.000000 1629.000000 \n", "mean 15.872671 128.709407 77.744832 0.509515 43.915285 \n", "std 8.905488 19.051560 10.634864 0.500063 12.170430 \n", "min 1.000000 87.000000 47.000000 0.000000 25.000000 \n", "25% 8.000000 116.000000 70.000000 0.000000 33.000000 \n", "50% 15.500000 126.000000 77.000000 1.000000 44.000000 \n", "75% 24.000000 140.000000 85.000000 1.000000 53.000000 \n", "max 31.000000 229.000000 130.000000 1.000000 74.000000 \n", "\n", " ... birthcontrol pregnancies cholesterol hightax82 \\\n", "count ... 1629.000000 726.00000 1613.000000 1537.000000 \n", "mean ... 1.084715 3.69146 219.973962 0.165908 \n", "std ... 0.947747 2.20560 45.444202 0.372119 \n", "min ... 0.000000 1.00000 78.000000 0.000000 \n", "25% ... 0.000000 2.00000 189.000000 0.000000 \n", "50% ... 1.000000 3.00000 216.000000 0.000000 \n", "75% ... 2.000000 5.00000 245.000000 0.000000 \n", "max ... 2.000000 15.00000 416.000000 1.000000 \n", "\n", " price71 price82 tax71 tax82 price71_82 \\\n", "count 1537.000000 1537.000000 1537.000000 1537.000000 1537.000000 \n", "mean 2.138750 1.806095 1.058581 0.505983 0.332741 \n", "std 0.229016 0.130641 0.216229 0.111894 0.155045 \n", "min 1.506592 1.451904 0.524902 0.219971 -0.202698 \n", "25% 2.036621 1.739990 0.944946 0.439941 0.200989 \n", "50% 2.167969 1.814941 1.049805 0.505981 0.335999 \n", "75% 2.241699 1.867676 1.154785 0.571899 0.443787 \n", "max 2.692871 2.103027 1.522461 0.747925 0.612061 \n", "\n", " tax71_82 \n", "count 1537.000000 \n", "mean 0.552614 \n", "std 0.150321 \n", "min 0.035995 \n", "25% 0.460999 \n", "50% 0.543945 \n", "75% 0.621948 \n", "max 0.884399 \n", "\n", "[8 rows x 64 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dat = pd.read_excel('NHEFS.xls')\n", "dat.describe()" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "Collapsed": "false" }, "outputs": [ { "data": { "text/plain": [ "['in your usual day, how active are you? in 1971, 0:very active, 1:moderately active, 2:inactive',\n", " 'age in 1971',\n", " 'how often do you drink? in 1971 0: almost every day, 1: 2-3 times/week, 2: 1-4 times/month, 3: < 12 times/year, 4: no alcohol last year, 5: unknown',\n", " 'when you drink, how much do you drink? in 1971',\n", " 'have you had 1 drink past year? in 1971, 1:ever, 0:never; 2:missing',\n", " 'which do you most frequently drink? in 1971 1: beer, 2: wine, 3: liquor, 4: other/unknown',\n", " 'use allergies medication in 1971, 1:ever, 0:never',\n", " 'dx asthma in 1971, 1:ever, 0:never',\n", " 'birth control pills past 6 months? in 1971 1:yes, 0:no, 2:missing',\n", " 'check state code - second page',\n", " 'use bowel trouble medication in 1971, 1:ever, 0:never, ; 2:missing',\n", " 'dx chronic bronchitis/emphysema in 1971, 1:ever, 0:never',\n", " 'serum cholesterol (mg/100ml) in 1971',\n", " 'dx chronic cough in 1971, 1:ever, 0:never',\n", " 'dx colitis in 1971, 1:ever, 0:never',\n", " 'day of death',\n", " 'diastolic blood pressure in 1982',\n", " 'death by 1992, 1:yes, 0:no',\n", " 'dx diabetes in 1971, 1:ever, 0:never, 2:missing',\n", " 'amount of education by 1971: 1: 8th grade or less, 2: hs dropout, 3: hs, 4:college dropout, 5: college or more',\n", " 'in recreation, how much exercise? in 1971, 0:much exercise,1:moderate exercise,2:little or no exercise',\n", " 'dx hay fever in 1971, 1:ever, 0:never',\n", " 'dx high blood pressure in 1971, 1:ever, 0:never, 2:missing',\n", " 'use high blood pressure medication in 1971, 1:ever, 0:never, ; 2:missing',\n", " 'use headache medication in 1971, 1:ever, 0:never',\n", " 'dx hepatitis in 1971, 1:ever, 0:never',\n", " 'dx heart failure in 1971, 1:ever, 0:never',\n", " 'living in a highly taxed state in 1982, high taxed state of residence=1, 0 otherwise',\n", " 'height in centimeters in 1971',\n", " 'total family income in 1971 11:<$1000, 12: 1000-1999, 13: 2000-2999, 14: 3000-3999, 15: 4000-4999, 16: 5000-5999, 17: 6000-6999, 18: 7000-9999, 19: 10000-14999, 20: 15000-19999, 21: 20000-24999, 22: 25000+',\n", " 'use infection medication in 1971, 1:ever, 0:never',\n", " 'uselack of pep medication in 1971, 1:ever, 0:never',\n", " 'marital status in 1971 1: under 17, 2: married, 3: widowed, 4: never married, 5: divorced, 6: separated, 8: unknown',\n", " 'month of death',\n", " 'use nerves medication in 1971, 1:ever, 0:never',\n", " 'dx nervous breakdown in 1971, 1:ever, 0:never',\n", " 'use other pains medication in 1971, 1:ever, 0:never',\n", " 'dx peptic ulcer in 1971, 1:ever, 0:never',\n", " 'do you eat dirt or clay, starch or other non standard food? in 1971 1:ever, 0:never; 2:missing',\n", " 'dx polio in 1971, 1:ever, 0:never',\n", " 'total number of pregnancies? in 1971',\n", " 'avg tobacco price in state of residence 1971 (us$2008)',\n", " 'difference in avg tobacco price in state of residence 1971-1982 (us$2008)',\n", " 'avg tobacco price in state of residence 1982 (us$2008)',\n", " 'quit smoking between 1st questionnaire and 1982, 1:yes, 0:no',\n", " '0: white 1: black or other in 1971',\n", " 'systolic blood pressure in 1982',\n", " 'highest grade of regular school ever in 1971',\n", " 'unique personal identifier',\n", " '0: male 1: female',\n", " 'number of cigarettes smoked per day in 1971',\n", " 'increase in number of cigarettes/day between 1971 and 1982',\n", " 'years of smoking',\n", " 'tobacco tax in state of residence 1971 (us$2008)',\n", " 'difference in tobacco tax in state of residence 1971-1982 (us$2008)',\n", " 'tobacco tax in state of residence 1971 (us$2008)',\n", " 'dx tuberculosis in 1971, 1:ever, 0:never',\n", " 'dx malignant tumor/growth in 1971, 1:ever, 0:never',\n", " 'use weak heart medication in 1971, 1:ever, 0:never',\n", " 'weight in kilograms in 1971',\n", " 'weight in kilograms in 1982',\n", " 'weight change in kilograms',\n", " 'use weight loss medication in 1971, 1:ever, 0:never',\n", " 'year of death']" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "codebook = pd.read_excel('causal_inference_python_code/codebook.xls')\n", "codebook.Description = codebook.Description.apply(lambda s: s.lower())\n", "codebook.iloc[:, 1].tolist()" ] }, { "cell_type": "code", "execution_count": 31, "metadata": { "Collapsed": "false", "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Variable nameDescription
0activein your usual day, how active are you? in 1971...
1ageage in 1971
2alcoholfreqhow often do you drink? in 1971 0: almost ev...
3alcoholhowmuchwhen you drink, how much do you drink? in 1971
4alcoholpyhave you had 1 drink past year? in 1971, 1:ev...
5alcoholtypewhich do you most frequently drink? in 1971 1...
6allergiesuse allergies medication in 1971, 1:ever, 0:n...
7asthmadx asthma in 1971, 1:ever, 0:never
8bithcontrolbirth control pills past 6 months? in 1971 1:y...
9birthplacecheck state code - second page
10boweltroubleuse bowel trouble medication in 1971, 1:ever,...
11bronchdx chronic bronchitis/emphysema in 1971, 1:ev...
12cholesterolserum cholesterol (mg/100ml) in 1971
13chroniccoughdx chronic cough in 1971, 1:ever, 0:never
14colitisdx colitis in 1971, 1:ever, 0:never
15dadthday of death
16dbpdiastolic blood pressure in 1982
17deathdeath by 1992, 1:yes, 0:no
18diabetesdx diabetes in 1971, 1:ever, 0:never, 2:missing
19educationamount of education by 1971: 1: 8th grade or l...
20exercisein recreation, how much exercise? in 1971, 0:m...
21hayfeverdx hay fever in 1971, 1:ever, 0:never
22hbpdx high blood pressure in 1971, 1:ever, 0:neve...
23hbpmeduse high blood pressure medication in 1971, 1...
24headacheuse headache medication in 1971, 1:ever, 0:never
25hepatitisdx hepatitis in 1971, 1:ever, 0:never
26hfdx heart failure in 1971, 1:ever, 0:never
27hightax82living in a highly taxed state in 1982, high t...
28htheight in centimeters in 1971
29incometotal family income in 1971 11:<$1000, 12: 10...
.........
34nervesuse nerves medication in 1971, 1:ever, 0:never
35nervousbreakdx nervous breakdown in 1971, 1:ever, 0:never
36otherpainuse other pains medication in 1971, 1:ever, 0...
37pepticulcerdx peptic ulcer in 1971, 1:ever, 0:never
38picado you eat dirt or clay, starch or other non s...
39poliodx polio in 1971, 1:ever, 0:never
40pregnanciestotal number of pregnancies? in 1971
41price71avg tobacco price in state of residence 1971 (...
42price71_82difference in avg tobacco price in state of re...
43price82avg tobacco price in state of residence 1982 (...
44qsmkquit smoking between 1st questionnaire and 198...
45race0: white 1: black or other in 1971
46sbpsystolic blood pressure in 1982
47schoolhighest grade of regular school ever in 1971
48seqnunique personal identifier
49sex0: male 1: female
50smokeintensitynumber of cigarettes smoked per day in 1971
51smkintensity 82_71increase in number of cigarettes/day between 1...
52smokeyrsyears of smoking
53tax71tobacco tax in state of residence 1971 (us$2008)
54tax71_82difference in tobacco tax in state of residenc...
55tax82tobacco tax in state of residence 1971 (us$2008)
56tbdx tuberculosis in 1971, 1:ever, 0:never
57tumordx malignant tumor/growth in 1971, 1:ever, 0:...
58weakheartuse weak heart medication in 1971, 1:ever, 0:...
59wt71weight in kilograms in 1971
60wt82weight in kilograms in 1982
61wt82_71weight change in kilograms
62wtlossuse weight loss medication in 1971, 1:ever, 0...
63yrdthyear of death
\n", "

64 rows × 2 columns

\n", "
" ], "text/plain": [ " Variable name Description\n", "0 active in your usual day, how active are you? in 1971...\n", "1 age age in 1971\n", "2 alcoholfreq how often do you drink? in 1971 0: almost ev...\n", "3 alcoholhowmuch when you drink, how much do you drink? in 1971\n", "4 alcoholpy have you had 1 drink past year? in 1971, 1:ev...\n", "5 alcoholtype which do you most frequently drink? in 1971 1...\n", "6 allergies use allergies medication in 1971, 1:ever, 0:n...\n", "7 asthma dx asthma in 1971, 1:ever, 0:never\n", "8 bithcontrol birth control pills past 6 months? in 1971 1:y...\n", "9 birthplace check state code - second page\n", "10 boweltrouble use bowel trouble medication in 1971, 1:ever,...\n", "11 bronch dx chronic bronchitis/emphysema in 1971, 1:ev...\n", "12 cholesterol serum cholesterol (mg/100ml) in 1971\n", "13 chroniccough dx chronic cough in 1971, 1:ever, 0:never\n", "14 colitis dx colitis in 1971, 1:ever, 0:never\n", "15 dadth day of death\n", "16 dbp diastolic blood pressure in 1982\n", "17 death death by 1992, 1:yes, 0:no\n", "18 diabetes dx diabetes in 1971, 1:ever, 0:never, 2:missing\n", "19 education amount of education by 1971: 1: 8th grade or l...\n", "20 exercise in recreation, how much exercise? in 1971, 0:m...\n", "21 hayfever dx hay fever in 1971, 1:ever, 0:never\n", "22 hbp dx high blood pressure in 1971, 1:ever, 0:neve...\n", "23 hbpmed use high blood pressure medication in 1971, 1...\n", "24 headache use headache medication in 1971, 1:ever, 0:never\n", "25 hepatitis dx hepatitis in 1971, 1:ever, 0:never\n", "26 hf dx heart failure in 1971, 1:ever, 0:never\n", "27 hightax82 living in a highly taxed state in 1982, high t...\n", "28 ht height in centimeters in 1971\n", "29 income total family income in 1971 11:<$1000, 12: 10...\n", ".. ... ...\n", "34 nerves use nerves medication in 1971, 1:ever, 0:never\n", "35 nervousbreak dx nervous breakdown in 1971, 1:ever, 0:never\n", "36 otherpain use other pains medication in 1971, 1:ever, 0...\n", "37 pepticulcer dx peptic ulcer in 1971, 1:ever, 0:never\n", "38 pica do you eat dirt or clay, starch or other non s...\n", "39 polio dx polio in 1971, 1:ever, 0:never\n", "40 pregnancies total number of pregnancies? in 1971\n", "41 price71 avg tobacco price in state of residence 1971 (...\n", "42 price71_82 difference in avg tobacco price in state of re...\n", "43 price82 avg tobacco price in state of residence 1982 (...\n", "44 qsmk quit smoking between 1st questionnaire and 198...\n", "45 race 0: white 1: black or other in 1971\n", "46 sbp systolic blood pressure in 1982\n", "47 school highest grade of regular school ever in 1971\n", "48 seqn unique personal identifier\n", "49 sex 0: male 1: female\n", "50 smokeintensity number of cigarettes smoked per day in 1971\n", "51 smkintensity 82_71 increase in number of cigarettes/day between 1...\n", "52 smokeyrs years of smoking\n", "53 tax71 tobacco tax in state of residence 1971 (us$2008)\n", "54 tax71_82 difference in tobacco tax in state of residenc...\n", "55 tax82 tobacco tax in state of residence 1971 (us$2008)\n", "56 tb dx tuberculosis in 1971, 1:ever, 0:never\n", "57 tumor dx malignant tumor/growth in 1971, 1:ever, 0:...\n", "58 weakheart use weak heart medication in 1971, 1:ever, 0:...\n", "59 wt71 weight in kilograms in 1971\n", "60 wt82 weight in kilograms in 1982\n", "61 wt82_71 weight change in kilograms\n", "62 wtloss use weight loss medication in 1971, 1:ever, 0...\n", "63 yrdth year of death\n", "\n", "[64 rows x 2 columns]" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "codebook" ] }, { "cell_type": "markdown", "metadata": { "Collapsed": "false" }, "source": [ "### EDA" ] }, { "cell_type": "code", "execution_count": 25, "metadata": { "Collapsed": "false" }, "outputs": [ { "data": { "text/plain": [ "Index(['seqn', 'qsmk', 'death', 'yrdth', 'modth', 'dadth', 'sbp', 'dbp', 'sex',\n", " 'age', 'race', 'income', 'marital', 'school', 'education', 'ht', 'wt71',\n", " 'wt82', 'wt82_71', 'birthplace', 'smokeintensity', 'smkintensity82_71',\n", " 'smokeyrs', 'asthma', 'bronch', 'tb', 'hf', 'hbp', 'pepticulcer',\n", " 'colitis', 'hepatitis', 'chroniccough', 'hayfever', 'diabetes', 'polio',\n", " 'tumor', 'nervousbreak', 'alcoholpy', 'alcoholfreq', 'alcoholtype',\n", " 'alcoholhowmuch', 'pica', 'headache', 'otherpain', 'weakheart',\n", " 'allergies', 'nerves', 'lackpep', 'hbpmed', 'boweltrouble', 'wtloss',\n", " 'infection', 'active', 'exercise', 'birthcontrol', 'pregnancies',\n", " 'cholesterol', 'hightax82', 'price71', 'price82', 'tax71', 'tax82',\n", " 'price71_82', 'tax71_82'],\n", " dtype='object')" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "dat.columns" ] }, { "cell_type": "code", "execution_count": 28, "metadata": { "Collapsed": "false" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sexwt82
0068.946040
1061.234970
2166.224486
3064.410117
4092.079251
51103.419060
6158.967008
7158.967008
8062.142155
9072.121187
\n", "
" ], "text/plain": [ " sex wt82\n", "0 0 68.946040\n", "1 0 61.234970\n", "2 1 66.224486\n", "3 0 64.410117\n", "4 0 92.079251\n", "5 1 103.419060\n", "6 1 58.967008\n", "7 1 58.967008\n", "8 0 62.142155\n", "9 0 72.121187" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d = dat[['sex', 'wt82']]\n", "d.head(10)" ] }, { "cell_type": "code", "execution_count": 38, "metadata": { "Collapsed": "false" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "d.boxplot(column='wt82', by='sex')" ] }, { "cell_type": "code", "execution_count": 39, "metadata": { "Collapsed": "false" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
wt82
countmeanstdmin25%50%75%max
sex
0762.080.13108014.56784538.10175970.30681778.92507287.543327136.531303
1804.067.15536615.02277035.38020557.03924163.95652473.481964136.531303
\n", "
" ], "text/plain": [ " wt82 \\\n", " count mean std min 25% 50% 75% \n", "sex \n", "0 762.0 80.131080 14.567845 38.101759 70.306817 78.925072 87.543327 \n", "1 804.0 67.155366 15.022770 35.380205 57.039241 63.956524 73.481964 \n", "\n", " \n", " max \n", "sex \n", "0 136.531303 \n", "1 136.531303 " ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d.groupby('sex').describe()" ] }, { "cell_type": "code", "execution_count": 45, "metadata": { "Collapsed": "false" }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
x
01
12
23
310
44
\n", "
" ], "text/plain": [ " x\n", "0 1\n", "1 2\n", "2 3\n", "3 10\n", "4 4" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "d_test = pd.DataFrame(data=[1,2,3,10,4],columns=['x'])\n", "d_test" ] }, { "cell_type": "code", "execution_count": 46, "metadata": { "Collapsed": "false" }, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD8CAYAAABn919SAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDMuMC4yLCBodHRwOi8vbWF0cGxvdGxpYi5vcmcvOIA7rQAACkJJREFUeJzt3V+Ipfddx/HP152UNNsUW1LH0mpHb2RgvVCOoBVltuuVKeiFYgPWPyzsjcYqgl1ZJDcu3YKIBb0JXTWijmhatDRQW9o5iIjB2bRo0xEETeKf2KSI2A1Ku+vPi4wS4m5m5znnzOx+9/WCYWbPPs/z+y4c3jz7nPPMqTFGALjzfc1xDwDAcgg6QBOCDtCEoAM0IegATQg6QBOCDtCEoAM0IegATawd5WIPPPDA2NjYOMol4Za89NJLOXny5HGPATd05cqVL40x3nLQdkca9I2Njezu7h7lknBL5vN5tra2jnsMuKGqevZWtnPJBaAJQQdoQtABmhB0gCYEHaCJA4NeVb9ZVS9U1edf8dibq+pTVfV3+9/ftNoxYTW2t7dz6tSpnDlzJqdOncr29vZxjwST3crbFn87ya8n+Z1XPHY+yafHGJeq6vz+n9+//PFgdba3t3PhwoVcvnw5169fz4kTJ3L27NkkyUMPPXTM08HhHXiGPsb4syT/9qqHfyDJY/s/P5bkB5c8F6zcxYsXc/ny5Zw+fTpra2s5ffp0Ll++nIsXLx73aDDJ1BuL1scYzyfJGOP5qvq6m21YVeeSnEuS9fX1zOfziUvCcu3t7eX69euZz+e5evVq5vN5rl+/nr29Pc9T7kgrv1N0jPFokkeTZDabDXfjcbvY3NzMiRMnsrW19X93iu7s7GRzc9Ndo9yRpr7L5YtV9dYk2f/+wvJGgqNx4cKFnD17Njs7O7l27Vp2dnZy9uzZXLhw4bhHg0mmnqF/LMmPJ7m0//1PljYRHJH/feHz4Ycfzt7eXjY3N3Px4kUviHLHqjHGa29QtZ1kK8kDSb6Y5JEkf5zkD5N8Y5LnkvzwGOPVL5z+P7PZbPjlXNyO/HIubmdVdWWMMTtouwPP0McYNztdOXPoqQBYGXeKAjQh6ABNCDpAE4IO0ISgAzQh6ABNCDpAE4IO0ISgAzQh6ABNCDpAE4IO0ISgAzQh6ABNCDpAE4IO0ISgAzQh6ABNCDpAE4IO0ISgAzQh6ABNCDpAE4IO0ISgAzQh6ABNCDpAE4IO0ISgAzQh6ABNCDpAE4IO0ISgAzQh6ABNCDpAEwsFvap+rqqerqrPV9V2Vd27rMEAOJzJQa+qtyX5mSSzMcapJCeSvGdZgwFwOItecllL8vqqWktyX5J/WXwkAKZYm7rjGOOfq+pXkjyX5D+TfHKM8clXb1dV55KcS5L19fXM5/OpS8LKXL161XOTO16NMabtWPWmJB9J8iNJ/j3JHyV5fIzxuzfbZzabjd3d3UnrwSrN5/NsbW0d9xhwQ1V1ZYwxO2i7RS65fF+SfxhjvDjG+GqSjyZ55wLHA2ABiwT9uSTfWVX3VVUlOZNkbzljAXBYk4M+xngyyeNJnkryN/vHenRJcwFwSJNfFE2SMcYjSR5Z0iwALMCdogBNCDpAE4IO0ISgAzQh6ABNCDpAE4IO0ISgAzQh6ABNCDpAE4IO0ISgAzQh6ABNCDpAE4IO0ISgAzQh6ABNCDpAE4IO0ISgAzQh6ABNCDpAE4IO0ISgAzQh6ABNCDpAE4IO0ISgAzQh6ABNCDpAE4IO0ISgAzQh6ABNCDpAEwsFvaq+tqoer6q/raq9qvquZQ0GwOGsLbj/h5J8YozxQ1X1uiT3LWEmACaYHPSqemOS703yE0kyxvhKkq8sZywADmuRSy7fnOTFJL9VVZ+tqg9X1cklzQXAIS1yyWUtybcneXiM8WRVfSjJ+SS/9MqNqupcknNJsr6+nvl8vsCSsBpXr1713OSOV2OMaTtWfX2SvxxjbOz/+XuSnB9jPHizfWaz2djd3Z20HqzSfD7P1tbWcY8BN1RVV8YYs4O2m3zJZYzxr0n+saq+Zf+hM0m+MPV4ACxm0Xe5PJzk9/bf4fL3SX5y8ZEAmGKhoI8xPpfkwP8GALB67hQFaELQAZoQdIAmBB2gCUEHaELQAZoQdIAmBB2gCUEHaELQAZoQdIAmBB2gCUEHaELQAZoQdIAmBB2gCUEHaELQAZoQdIAmBB2gCUEHaELQAZoQdIAmBB2gCUEHaELQAZoQdIAmBB2gCUEHaELQAZoQdIAmBB2gCUEHaELQAZoQdIAmFg56VZ2oqs9W1ceXMRAA0yzjDP19SfaWcBwAFrBQ0Kvq7UkeTPLh5YwDwFSLnqH/WpJfSPLfS5gFgAWsTd2xqt6d5IUxxpWq2nqN7c4lOZck6+vrmc/nU5fkLvVTn34pL331cPs8+8F3r2aYV3nH+2/9paOT9yS/cebkCqfhbldjjGk7Vn0gyXuTXEtyb5I3JvnoGONHb7bPbDYbu7u7k9bj7rVx/ok8c+nBla4xn8+ztbW10jWO4t9BT1V1ZYwxO2i7yZdcxhi/OMZ4+xhjI8l7knzmtWIOwGp5HzpAE5Ovob/SGGOeZL6MYwEwjTN0gCYEHaAJQQdoQtABmhB0gCYEHaAJQQdoQtABmhB0gCYEHaAJQQdoQtABmhB0gCYEHaAJQQdoQtABmhB0gCaW8olFsEr3b57Ptz52fvULPbbaw9+/mSQ+JJrVEXRue1/eu5RnLq02hPP5PFtbWytdY+P8Eys9PrjkAtCEoAM0IegATQg6QBOCDtCEoAM0IegATQg6QBOCDtCEoAM0IegATQg6QBOCDtCEoAM0IegATQg6QBOTg15V31BVO1W1V1VPV9X7ljkYAIezyCcWXUvy82OMp6rq/iRXqupTY4wvLGk2AA5h8hn6GOP5McZT+z9/OclekrctazAADmcp19CraiPJtyV5chnHA+DwFv6Q6Kp6Q5KPJPnZMcZ/3ODvzyU5lyTr6+uZz+eLLsld6Eg+YPkTq13j5D3x/Gelaowxfeeqe5J8PMmfjjF+9aDtZ7PZ2N3dnbwerMrG+SfyzKUHj3sMuKGqujLGmB203SLvcqkkl5Ps3UrMAVitRa6hf3eS9yZ5V1V9bv/r+5c0FwCHNPka+hjjz5PUEmcBYAHuFAVoQtABmhB0gCYEHaAJQQdoQtABmhB0gCYEHaAJQQdoQtABmhB0gCYEHaAJQQdoQtABmhB0gCYEHaAJQQdoYvInFsHt7OWPvD3kPh88/DqLfMg6LJszdFoaYxzqa2dn59D7iDm3G0EHaELQAZoQdIAmBB2gCUEHaELQAZoQdIAmBB2giTrKmyOq6sUkzx7ZgnDrHkjypeMeAm7iHWOMtxy00ZEGHW5XVbU7xpgd9xywCJdcAJoQdIAmBB1e9uhxDwCLcg0doAln6ABNCDpAE4IO0ISgc1erqu+oqr+uqnur6mRVPV1Vp457LpjCi6Lc9arql5Pcm+T1Sf5pjPGBYx4JJhF07npV9bokf5Xkv5K8c4xx/ZhHgklccoHkzUnekOT+vHymDnckZ+jc9arqY0n+IMk3JXnrGOOnj3kkmGTtuAeA41RVP5bk2hjj96vqRJK/qKp3jTE+c9yzwWE5QwdowjV0gCYEHaAJQQdoQtABmhB0gCYEHaAJQQdoQtABmvgfCNNO8P2aX9AAAAAASUVORK5CYII=\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "d_test.boxplot(column='x')" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "Collapsed": "false" }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.4" }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": true, "toc_window_display": true }, "varInspector": { "cols": { "lenName": 16, "lenType": 16, "lenVar": 40 }, "kernels_config": { "python": { "delete_cmd_postfix": "", "delete_cmd_prefix": "del ", "library": "var_list.py", "varRefreshCmd": "print(var_dic_list())" }, "r": { "delete_cmd_postfix": ") ", "delete_cmd_prefix": "rm(", "library": "var_list.r", "varRefreshCmd": "cat(var_dic_list()) " } }, "types_to_exclude": [ "module", "function", "builtin_function_or_method", "instance", "_Feature" ], "window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }