We are a sharing community. So please help us by uploading **1** new document or like us to download:

OR LIKE TO DOWNLOAD IMMEDIATELY

Exploratory data analysis with two qualitative variables Not in FPP

1

Exploratory data analysis with two qualitative/categorical variables Main tools Contigency tables Conditional, marginal, and joint frequencies

2

Motivating example Surviving the Titanic Was there a class discrimination in survival of the wreck of the

Titanic? “It has been suggested before the Enquiry that the third-class

passengers had been unfairly treated, that their access to the boat deck had been impeded; and that when they reached the deck the first and second-class passengers were given precedence in getting places in the boats.” Lord Mersey, 1912

3

Titanic: Class by survival 1st Class

4

2nd Class

3rd Class

Crew

Dead

122

167

528

696

1513

Alive

203

118

178

212

711

325

285

706

908

2224

Titanic: Marginal frequencies % Dead = 1513/2224 = 0.68

% Alive = 711/2224 = 0.32

% in first class

= 325/2224 = 0.14 % in second class = 285/2224 = 0.13 % in third class = 706/2224 = 0.32 % crew = 908/2224 = 0.41

5

Titanic: Conditional frequenceis % (Alive | 1st)

% (Alive | 2nd) % (Alive | 3rd) % (Alive | Crew)

= 203/325 = 0.625 = 118/285 = 0.414 = 178/706 = 0.252 = 212/908 = 0.233

Based on these frequencies does there appear to be class

discrimination?

6

Titanic: Class by person type 1st Class

Child.

7

2nd 3rd Crew Class Class 6 24 79 0

109

Wom.

144

93

165

23

425

Men

175

168

462

885

1690

325

285

706

908

2224

Titanic: percentage of men in each class % (Man | 1st)

% (Man | 2nd) % (Man | 3rd) % (Man | Crew)

= 175/325 = 0.54 = 168/285 = 0.59 = 462/706 = 0.65 = 885/908 = 0.97

There are larger percentages of men in third class and crew

8

Surviving the Titanic A reason for class differences in survival: Larger percentages of men died 3rd class consisted of mostly men. Hence, a larger percentage of 3rd class passengers died.

Once again keep in mind possible lurking variables that could

be driving the relationship seen between two measured variables

9

Relative risk and odds ratios Motivating example Physicians’ health study (1989): randomized experiment with

22071 male physicians at least 40 years old Half the subjects assigned to take aspirin every other day Other half assigned to take a placebo, a dummy pill that looked

and tasted like aspirin

10

Physicians’ health study Here are the number of people in each cell: Heart attack

11

No heart attack

Aspirin

104

10933

11037

Placebo

189

10845

11034

293

21778

22071

Relative risk y1

y2

x1

a

b

a+b

Risk of y1 for level x1=a/(a+b)

x2

c

d

c+d

Risk of y1 for level x2=c/(c+d)

a+c b+d 12

a/(a + b) Relative risk = c /(c d)

Relative risk for physicians’ health study Relative risk of a heart attack when taking aspirin versus

when taking a placebo equals

104 /(104 10933) RR 0.55 189 /(189 10845) People that took aspirin are 0.55 times as likely to have a

heart attack than people that took the placebo Or people that took placebo are 1/0.55 = 1.82 times as likely to have a heart attack than people that took aspirin

13

Odds ratios y1

y2

x1

a

b

Odds of y1 for level x1=a/b

x2

c

d

Odds of y1 for level x2=c/d

a/b Odds ratio= c /d 14

Odds ratios for physicians’ health study Relative risk of a heart attack when taking aspirin versus

taking a placebo is

104 /(104 10933) RR 0.55 189 /(189 10845) Odds of having a heart attack when taking aspirin over odds

of a heart attack when taking a placebo (odds ratio)

104/10933 OR 0.546 189/10845 15

Interpreting odds ratios and relative risks When the variables X and Y are independent odds ratio = 1

relative risk = 1

When subjects with level x1 are more likely to have y1 than

subjects with level x2, the odds ratio > 1

relative risk > 1

When subjects with level x1 are less likely to have y1 than

subjects with level x2, then odds ratio < 1 16

relative risk < 1

Which one should be used? If Relative Risk is available then it should be used

In a cohort study, the relative risk can be calculated directly In a case-control study the relative risk cannot be calculated

directly, so an odds ratio is used instead Case-control studies is an example. They compare subjects who have a

“condition” to subjects that don’t but have similar controls In this type of study we know %(exposure|disease). But to compute the RR we need %(disease|exposure). Recall that RR = %(disease|exposure)/%(disease|placebo)

Not available in more complex modeling (logistic regression)

17

Odds ratio vs relative risk When is odds ratio a good approximation of relative risk When cases are representative of diseased population When controls are representative of population without disease When the disease being studied occurs at low frequency Of itself, an odds ratio is a useful measure of association

18

Relative risk vs absolute risk % smokers who get lung cancer: 8% (conservative guess

here) Relative risk of lung cancer for smokers: 800% Getting lung cancer is not commonplace, even for smokers.

But, smokers’ chances of getting lung cancer are much, much higher than non-smokers’ chances.

19

Simpsons paradox When a third variable seemingly reverses the association

between two other variables Hot hand example

20

View more...
1

Exploratory data analysis with two qualitative/categorical variables Main tools Contigency tables Conditional, marginal, and joint frequencies

2

Motivating example Surviving the Titanic Was there a class discrimination in survival of the wreck of the

Titanic? “It has been suggested before the Enquiry that the third-class

passengers had been unfairly treated, that their access to the boat deck had been impeded; and that when they reached the deck the first and second-class passengers were given precedence in getting places in the boats.” Lord Mersey, 1912

3

Titanic: Class by survival 1st Class

4

2nd Class

3rd Class

Crew

Dead

122

167

528

696

1513

Alive

203

118

178

212

711

325

285

706

908

2224

Titanic: Marginal frequencies % Dead = 1513/2224 = 0.68

% Alive = 711/2224 = 0.32

% in first class

= 325/2224 = 0.14 % in second class = 285/2224 = 0.13 % in third class = 706/2224 = 0.32 % crew = 908/2224 = 0.41

5

Titanic: Conditional frequenceis % (Alive | 1st)

% (Alive | 2nd) % (Alive | 3rd) % (Alive | Crew)

= 203/325 = 0.625 = 118/285 = 0.414 = 178/706 = 0.252 = 212/908 = 0.233

Based on these frequencies does there appear to be class

discrimination?

6

Titanic: Class by person type 1st Class

Child.

7

2nd 3rd Crew Class Class 6 24 79 0

109

Wom.

144

93

165

23

425

Men

175

168

462

885

1690

325

285

706

908

2224

Titanic: percentage of men in each class % (Man | 1st)

% (Man | 2nd) % (Man | 3rd) % (Man | Crew)

= 175/325 = 0.54 = 168/285 = 0.59 = 462/706 = 0.65 = 885/908 = 0.97

There are larger percentages of men in third class and crew

8

Surviving the Titanic A reason for class differences in survival: Larger percentages of men died 3rd class consisted of mostly men. Hence, a larger percentage of 3rd class passengers died.

Once again keep in mind possible lurking variables that could

be driving the relationship seen between two measured variables

9

Relative risk and odds ratios Motivating example Physicians’ health study (1989): randomized experiment with

22071 male physicians at least 40 years old Half the subjects assigned to take aspirin every other day Other half assigned to take a placebo, a dummy pill that looked

and tasted like aspirin

10

Physicians’ health study Here are the number of people in each cell: Heart attack

11

No heart attack

Aspirin

104

10933

11037

Placebo

189

10845

11034

293

21778

22071

Relative risk y1

y2

x1

a

b

a+b

Risk of y1 for level x1=a/(a+b)

x2

c

d

c+d

Risk of y1 for level x2=c/(c+d)

a+c b+d 12

a/(a + b) Relative risk = c /(c d)

Relative risk for physicians’ health study Relative risk of a heart attack when taking aspirin versus

when taking a placebo equals

104 /(104 10933) RR 0.55 189 /(189 10845) People that took aspirin are 0.55 times as likely to have a

heart attack than people that took the placebo Or people that took placebo are 1/0.55 = 1.82 times as likely to have a heart attack than people that took aspirin

13

Odds ratios y1

y2

x1

a

b

Odds of y1 for level x1=a/b

x2

c

d

Odds of y1 for level x2=c/d

a/b Odds ratio= c /d 14

Odds ratios for physicians’ health study Relative risk of a heart attack when taking aspirin versus

taking a placebo is

104 /(104 10933) RR 0.55 189 /(189 10845) Odds of having a heart attack when taking aspirin over odds

of a heart attack when taking a placebo (odds ratio)

104/10933 OR 0.546 189/10845 15

Interpreting odds ratios and relative risks When the variables X and Y are independent odds ratio = 1

relative risk = 1

When subjects with level x1 are more likely to have y1 than

subjects with level x2, the odds ratio > 1

relative risk > 1

When subjects with level x1 are less likely to have y1 than

subjects with level x2, then odds ratio < 1 16

relative risk < 1

Which one should be used? If Relative Risk is available then it should be used

In a cohort study, the relative risk can be calculated directly In a case-control study the relative risk cannot be calculated

directly, so an odds ratio is used instead Case-control studies is an example. They compare subjects who have a

“condition” to subjects that don’t but have similar controls In this type of study we know %(exposure|disease). But to compute the RR we need %(disease|exposure). Recall that RR = %(disease|exposure)/%(disease|placebo)

Not available in more complex modeling (logistic regression)

17

Odds ratio vs relative risk When is odds ratio a good approximation of relative risk When cases are representative of diseased population When controls are representative of population without disease When the disease being studied occurs at low frequency Of itself, an odds ratio is a useful measure of association

18

Relative risk vs absolute risk % smokers who get lung cancer: 8% (conservative guess

here) Relative risk of lung cancer for smokers: 800% Getting lung cancer is not commonplace, even for smokers.

But, smokers’ chances of getting lung cancer are much, much higher than non-smokers’ chances.

19

Simpsons paradox When a third variable seemingly reverses the association

between two other variables Hot hand example

20

We are a sharing community. So please help us by uploading **1** new document or like us to download:

OR LIKE TO DOWNLOAD IMMEDIATELY

Thank you for interesting in our services. We are a non-profit group that run this website to share documents. We need your help to maintenance this website.

To keep our site running, we need your help to cover our server cost (about $400/m), a small donation will help us a lot.