# ObsDatAna2CatVars

January 8, 2018 | Author: Anonymous | Category: Science, Health Science, Oncology

#### Description

Exploratory data analysis with two qualitative variables Not in FPP

1

Exploratory data analysis with two qualitative/categorical variables  Main tools  Contigency tables  Conditional, marginal, and joint frequencies

2

Motivating example  Surviving the Titanic  Was there a class discrimination in survival of the wreck of the

Titanic?  “It has been suggested before the Enquiry that the third-class

passengers had been unfairly treated, that their access to the boat deck had been impeded; and that when they reached the deck the first and second-class passengers were given precedence in getting places in the boats.” Lord Mersey, 1912

3

Titanic: Class by survival 1st Class

4

2nd Class

3rd Class

Crew

122

167

528

696

1513

Alive

203

118

178

212

711

325

285

706

908

2224

Titanic: Marginal frequencies  % Dead = 1513/2224 = 0.68

 % Alive = 711/2224 = 0.32

 % in first class

= 325/2224 = 0.14  % in second class = 285/2224 = 0.13  % in third class = 706/2224 = 0.32  % crew = 908/2224 = 0.41

5

Titanic: Conditional frequenceis  % (Alive | 1st)

 % (Alive | 2nd)  % (Alive | 3rd)  % (Alive | Crew)

= 203/325 = 0.625 = 118/285 = 0.414 = 178/706 = 0.252 = 212/908 = 0.233

 Based on these frequencies does there appear to be class

discrimination?

6

Titanic: Class by person type 1st Class

Child.

7

2nd 3rd Crew Class Class 6 24 79 0

109

Wom.

144

93

165

23

425

Men

175

168

462

885

1690

325

285

706

908

2224

Titanic: percentage of men in each class  % (Man | 1st)

 % (Man | 2nd)  % (Man | 3rd)  % (Man | Crew)

= 175/325 = 0.54 = 168/285 = 0.59 = 462/706 = 0.65 = 885/908 = 0.97

 There are larger percentages of men in third class and crew

8

Surviving the Titanic  A reason for class differences in survival:  Larger percentages of men died  3rd class consisted of mostly men.  Hence, a larger percentage of 3rd class passengers died.

 Once again keep in mind possible lurking variables that could

be driving the relationship seen between two measured variables

9

Relative risk and odds ratios  Motivating example  Physicians’ health study (1989): randomized experiment with

22071 male physicians at least 40 years old  Half the subjects assigned to take aspirin every other day  Other half assigned to take a placebo, a dummy pill that looked

and tasted like aspirin

10

Physicians’ health study  Here are the number of people in each cell: Heart attack

11

No heart attack

Aspirin

104

10933

11037

Placebo

189

10845

11034

293

21778

22071

Relative risk y1

y2

x1

a

b

a+b

Risk of y1 for level x1=a/(a+b)

x2

c

d

c+d

Risk of y1 for level x2=c/(c+d)

a+c b+d 12

a/(a + b) Relative risk = c /(c  d)

Relative risk for physicians’ health study  Relative risk of a heart attack when taking aspirin versus

when taking a placebo equals

104 /(104  10933) RR   0.55 189 /(189  10845)  People that took aspirin are 0.55 times as likely to have a

heart attack than people that took the placebo  Or people that took placebo are 1/0.55 = 1.82 times as likely to have a heart attack than people that took aspirin



13

Odds ratios y1

y2

x1

a

b

Odds of y1 for level x1=a/b

x2

c

d

Odds of y1 for level x2=c/d

a/b Odds ratio= c /d 14

Odds ratios for physicians’ health study  Relative risk of a heart attack when taking aspirin versus

taking a placebo is

104 /(104  10933) RR   0.55 189 /(189  10845)  Odds of having a heart attack when taking aspirin over odds



of a heart attack when taking a placebo (odds ratio)

104/10933 OR   0.546 189/10845 15

Interpreting odds ratios and relative risks  When the variables X and Y are independent  odds ratio = 1

relative risk = 1

 When subjects with level x1 are more likely to have y1 than

subjects with level x2, the  odds ratio > 1

relative risk > 1

 When subjects with level x1 are less likely to have y1 than

subjects with level x2, then  odds ratio < 1 16

relative risk < 1

Which one should be used?  If Relative Risk is available then it should be used

 In a cohort study, the relative risk can be calculated directly  In a case-control study the relative risk cannot be calculated

directly, so an odds ratio is used instead  Case-control studies is an example. They compare subjects who have a

“condition” to subjects that don’t but have similar controls  In this type of study we know %(exposure|disease). But to compute the RR we need %(disease|exposure).  Recall that RR = %(disease|exposure)/%(disease|placebo)

 Not available in more complex modeling (logistic regression)

17

Odds ratio vs relative risk  When is odds ratio a good approximation of relative risk  When cases are representative of diseased population  When controls are representative of population without disease  When the disease being studied occurs at low frequency  Of itself, an odds ratio is a useful measure of association

18

Relative risk vs absolute risk  % smokers who get lung cancer: 8% (conservative guess

here)  Relative risk of lung cancer for smokers: 800%  Getting lung cancer is not commonplace, even for smokers.

But, smokers’ chances of getting lung cancer are much, much higher than non-smokers’ chances.

19

Simpsons paradox  When a third variable seemingly reverses the association

between two other variables  Hot hand example

20