INTERPRETING STANDARD DEVIATIONS

February 10, 2018 | Author: Anonymous | Category: Math, Statistics And Probability, Normal Distribution
Share Embed Donate


Short Description

Download INTERPRETING STANDARD DEVIATIONS...

Description

DATA ANALYSIS WORKBOOK

LAB 3

INTERPRETING STANDARD DEVIATIONS OVERVIEW The purpose of this lab is to use properties of the normal distribution to aid in the interpretation of the standard deviation of a variable measured at either the interval or ratio levels (or, in the case of an ordinal variable, treated as interval). The mean and the standard deviation are the two basic descriptive statistics which are the foundation of more advanced statistics. Understanding how to interpret the standard deviation depends upon the metric of that variable.

CONCEPTS: standard deviation, z-scores, normal distribution Standard Deviation A measure of dispersion describes the amount of variation in a variable. A measure of central tendency gives us a sense of the typical value in a distribution. A measure of dispersion gives us an idea of the degree to which the cases differ from this typical value. In fact, one could argue that the need for statistics arises because of the variability in the phenomena that social scientists study. Social scientists typically work with two measures of variation: the variance and the standard deviation. The variance is the arithmetic average of the mean deviation squared for each case.1 The standard deviation is the square root of the variance. The standard deviation is more informative than the variance because it expresses the variability in terms of the metric used in the measure of the variable rather than the square of the metric. For example, if the variable is dollar income, the variance expresses the variability in terms of dollars squared, while the standard deviation expresses variability in terms of dollars. The variance is still valuable since it has mathematical properties that make it easier to work with than the standard deviation. For this reason, we work with both the standard deviation and the variance in labs to come. In this lab and, indeed, those labs that focus on the distribution of a single variable, however, we focus on just the standard deviation.

1

To get the mean deviation for a case, you subtract the mean of the variable from the value of the variable for the case. To get the variance, you square the deviations, sum them, and divide by the number of cases (or, in the case of sample estimates of the population variance, the number of cases minus one.)

3.1

LAB 3

DATA ANALYSIS WORKBOOK

Table 1. Calculating z-scores from raw scores and raw scores from z-scores Compute z-score (1) z  (y  y) / sy

Equation Example y =50 s y =10 y = 42

Compute Raw Score (2) y  y  (z) sy

z = (42 - 50) / 10

y = 50 + (.8) 10

z = .8

y = 42

Z-scores We interpret the standard deviation as the typical deviation of a case from the mean. In order to educate your intuition about the standard deviation we introduce the concept of z-scores and the properties of the normal distribution. A z-score expresses the difference between the value of a variable and the mean of that variable in standard deviation units. As equation (1) in Table 1 shows, you get a z-score from a raw score y by dividing the difference between y and the mean ( y ) by the standard deviation ( s y ). For example, if the mean of a variable is 50, and the standard deviation is 10, the raw score 42 corresponds to a z-score of -.8. (The raw score 42 is .8 of a standard deviation less than the mean.) You can use equation (2) to transform a z-score back to a raw score. Multiply the z-score by the standard deviation ( s y ) and add the product to the mean. Transforming an original variable by subtracting the mean “centres” the variable. As a consequence of centering, the mean of the new variable will be zero. Transforming the values of a variable by computing z-scores in effect changes the metric of the original variable, for example years in the case of age, to standard deviation units. A consequence of “standardizing” a variable is that its standard deviation (and variance) will be one. (*You can think of a set of z-scores as a standardized, centered variable.) Table 2 uses the respondent’s age from the 1987 NORC GSS to illustrate the relation between raw and standardized (z) scores. Table 2. The mean and standard deviation of age expressed as raw and standard scores. Variable

Type

Metric

Age Age

raw score z-score

years standard deviation

Mean

Standard deviation

44.92 yrs 0 s.d.

17.71 yrs 1 s.d.

Because a z-score is measured in standard deviation units, we can get some insight into the standard deviation by examing a distribution of z-scores. We can increase our understanding by viewing the distribution of z-scores from the perspective of a standard, normal distribution.

3.2

DATA ANALYSIS WORKBOOK

LAB 3

Normal Distributions The normal distribution is symmetric and bell-shaped. The center of the distribution is the mean, median, and mode. It is called “normal” because there is no skew, and the distribution is neither flat nor peaked. Consquently, the distribution of a normal variable can be determined completely by its mean and its standard deviation. A standard normal distribution is a normal distribution of a variable that has been transformed into z-scores. Like all standardized variables, it has a mean of zero and a standard deviation of one. (Note that standardizing a variable does NOT make it normal.) Table 3 contains the relative frequencies for the standard normal distribution partitioned into intervals or areas with the z-score as the boundary for each interval. There are six intervals—A, B, C, D, E, F. For example the first interval, A, has a lower limit z-score value of -3 and upper limit z-score value of -2.2 Remember that according to the convention used in this course, we would place a case with the lower limit of –3 in this interval along with all other cases with z-scores between -3.0 and up to but not including -2.0. A case with a z-score of -2 would be contained in the next interval, B. In a normal distribution, interval A contains (roughly) 2.1% of the cases. Interval B contains 13.6% of the cases, and interval C contains 34.1% of the cases. Note that the upper limit of interval C, 0, corresponds to the mean of the raw scores. Due to the symmetry of a normal distribution, intervals D, E, and F also contain 34.1%, 13.6%, and 2.1% of the cases, respectively. Table 3. The Relative Frequency Distribution of the Standard Normal Distribution Partitioned into Six Intervals of One, Two, and Three Standard Deviations Above and Below the Mean.

A LL -3

B UL LL -2

2.1%

C UL LL

-2

-1

13.6%

D UL LL

-1

0

34.1%

E UL LL

0

1

34.1%

Interval

F UL LL

1

2

13.6%

2

Interval Limits z-score

UL 3

2.1%

Proportion

| ................68% ..................| |........................................95%........................................| |..........................................................99.75%.............................................................|

2

Technically, a normal distribution begins at minus infinity and ends at plus infinity. Consequently, there will be a few cases with z-scores either less than –3 or greater than +3.

3.3

LAB 3

DATA ANALYSIS WORKBOOK

The distribution of cases in Table 3 are only approximate because, rather than using +2 and –2 as the interval limits, we should use  1.96 . (You can confirm this fact with a table of z-scores.) Keeping this qualification in mind, we can combine the intervals A to F, as shown at the bottom of Table 3, order to develop the following statements about the distribution of cases in a normal distribution:   

68% of the cases fall in an interval one standard deviation below and above the mean 95% of the cases fall in an interval two standard deviations below and above the mean 99.75% (or nearly all) of the cases fall in an interval three standard deviations below and above the mean.

At some point during this discussion you might be wondering about the relevance of this information for the study of social science variables. After all, few of the ratio, interval, or ordinal (treated as interval) measures that social scientists study will have a normal distribution. There are two answers to this objection. First, when we get to inferential statistics in subsequent labs, we will find that a class of distributions (called “sampling distributions) are normal (or approximately normal). In the case of this lab, however, you will see that the three statements above will apply to many distributions that are NOT normal. For this reason, we refer to these statements as the “empirical rule.” It works because a “large” percentage of cases of, say, interval C, will be offset by a “small” percentage of cases in interval D, so the percentage of cases in intervals C and D comes pretty close to 68%. We refer to this rule as “empirical” because it happens to work for many variables. Remember, however, that it does not always work, and even when it does, it often only works approximately. The final point to keep in mind when using the empirical rule is it is supposed to give you a basis for “interpreting” a standard deviation. For example, if the mean and standard deviation of age are 45 and 15 years, respectively, you can expect that approximately 68% of the people are between the ages of 30 and 60, 95% will be between the ages of 15 and 75, and nearly everybody will be between the ages of zero and 90.3

DATA ANALYSIS EXAMPLE Research Question Summarize the distribution of the variable age (AGE, v04). Use the standard deviation and mean of the variable in your description. How well does the distribution of age (AGE v04) in the 1987 GSS compare with the characteristics of a normal distribution? (i.e., Do approximately 68% of the cases fall in an interval one standard deviation below and above the mean? Do 13.4% of cases fall in the interval between one and two standard deviations below the mean, etc.?) 3

In the case of a survey of the adult population, the minimum age will be around 18. The reason interval A is empty and the lower limit of B is below this minium is due to the fact that the distribution of age is positively skewed.

3.4

DATA ANALYSIS WORKBOOK

LAB 3

Results First, become familiar with the measurement characteristics of the variable age. Use the blue code book and complete the variable attributes information (1) in Table 4. Table 4. Example from a Yellow Sheet for Lab 3. (1) Variable Attributes Index and Name Description

v4

(2) Statistics AGE

Mean

44.92

Respondent's age

Std Dev

17.71

Minimum

18

Minimum

18

Maximum

89

Maximum

89

Metric

years

Valid Cases

1807

Level of Measurement

ratio

NA

NA

The descriptive statistics for age and the z-score for age are given in Table 5 below. The mean age of the respondents in the 1987 NORC GSS is 44.92 years, and the standard deviation is 17.71 years. These statistics are based on 1807 cases out of 1819 possible cases. Copy these results into the statistics column (2) of Table 4. Table 5. Descriptive Statistics for Age from SPSS. Descriptive Statistics N Minimum Maximum AGE

1807

Valid N (listwise)

1807

18

89

Mean

Std. Deviation

44.92

17.705

To find the percentage of cases that fall in the intervals one, two and three standard deviations below and above the mean, we have to find the age that falls three standard deviations below the mean of 44.92 and use this number as the starting point in constructing a distribution. That is, we have to transform the z-score -3 into a raw score.4 We use equation (2) in Table 1 to do this. Plugging the values for the mean and standard deviation into the equation, we find that -8.21 is the value for age that is three standard deviations below the mean. (The fact that it is impossible to get this value means that the distribution of age is positively skewed. See footnote 3.) 4

You may need to use a z-score that is more extreme than -3 (e.g., -4 or -5) in the case of variables with an extreme negative skew. (See footnote 2.)

3.5

LAB 3

DATA ANALYSIS WORKBOOK

(e.g.) y = 44.92 + (-3)17.71 = 44.92 - 53.13 = -8.21 Using -8.21 as the starting point, we divided age into the six intervals one standard deviation wide (17.71). These resulting intervals, (which we label z04), correspond to the intervals A – F in Table 3. Table 6 contains the frequency distribution for this variable generated by SPSS. We examine the valid percents since SPSS omits cases with missing values when making these computations. Table 6. Frequency Distribtion of Age Divided into Intervals of 1, 2, and 3 Standard Deviations Above and Below the Mean (z04) Value Label

Value

Frequency

Percent

-8.2 to 9.5 9.5 to 27.2 27.2 to 44.9 44.9 to 62.6 62.6 to 80.3 80.3 to 98.0 98.0 to

-3 -2 -1 0 1 2 3

0 321 683 426 325 52 0 12 1819

0.0 17.6 37.5 23.4 17.9 2.9 0.0 .6 100.0

z04

TOTAL

Valid Percent 0.0 17.8 37.8 23.6 18.0 2.9 0.0 Missing 100.0

Cum Percent 0.0 17.8 55.6 79.1 97.1 100.0 100.0

To see whether the empirical rule applies to this distribution, we can compare the valid percents with the percents given by the empirical rule. (In addition to the percentages in Table 3, the yellow sheets also contain this distribution.) A criterion for deciding whether the observed percentage distribution behaves according to the empirical rule, use five per cent as the criterion. If the two sets of percentages for an interval differ by more than five per cent, conclude that the distribution of the variable does behave according to the empirical rule. In the case of the age distribution, we observe the following: First, 61.4% (= 37.8 + 23.6) of the cases fall in the interval one standard deviation below and above the mean (27.2 to 62.6). Second, 96.2% (= 17.8 + 61.4 + 18.0) fall in the interval two standard deviations below and above the mean (9.5 to 80.3). Third, all (100%) of the cases fall in the interval three standard deviations below and above the mean (-8.2 to 98.0). In deciding whether the empirical rule applies, we could conclude that the first statement is off by a bit since 61.4% differs from 68% by more than 5%. On the other hand, the two other statements fit the data. (Both 97.2% and 100% are less than 5% away from 95% and 99.75%, respectively.)

3.6

DATA ANALYSIS WORKBOOK

LAB 3

Explanations The box below provides an example of how you would summarize the results. Be sure to include the descriptive statistics (the mean and standard deviation), the range of scores for the intervals one and two standard deviations above and below the mean, and the percentage of cases in that range. It is important to be simple in the description and use the metric of the variable in your description. In the interpretation part, briefly describe the fit of the empirical rule to the frequency distribution.

For the variable age, a typical deviation from the mean of 45 years is about 18 years (standard deviation). Sixty-one per cent of the cases are between 27 and 63 years of age. About 97% of the cases are between 18 and 80 years of age, and all of the cases are between the ages of 18 and 89.5 Although 61% is less than the expected 68%, the other two percentages fit the empirical rule.

RESEARCH QUESTIONS Examine the frequency distribution of the following four variables. Does the empirical rule “fit” the distribution for each of the variables?

1. 2. 3. 4.

The prestige of the spouse's occupation. (v28) The total income of the respondent's family. (v32) The number of children the respondent has had. (v19) The number of organizations to which the respondent belongs. (v36)

Step 1 Variable Attributes (Lab Exercise 3.1)   

Use the blue code book to determine the attributes of each of the variables. What is the metric and level of measurement of the variables? Copy the information from the codebook to the yellow sheet

Step 2 Statistically Describe the Variables (Lab Exercise 3.1)   

Use SPSS (Statistics/Summarize/Descriptives) to get the statistics (Mean, Standard Deviation, Valid Cases). Copy the information from the screen to the yellow sheet. Repeat this step for each of the variables by replacing the appropriate variable in the variable box.

5

Because the respondents had to be a minimum of 18 years old to participate in the study, the low end of the distribution is truncated. In addition, the upper end also is truncated since NORC codes everybody aged ninety or over as 89.

3.7

LAB 3

DATA ANALYSIS WORKBOOK

Step 3 Distributions of Variables in Standard Deviation Units 

  

 

 

To compare the distribution of a variable with the normal distribution, the coding of the variable must be done in standard deviation units. We have transformed the raw values of each variable into six categories whose limits are 1, 2, and 3 z-score values below and above the mean. These variables are named z28, z32, z19, z36. (We use the prefix “z” rather than “v” since limits of the intervals represente z-scores.) You will need the transformed variables to get the observed frequency distributions. Copy the information from the screen to the yellow sheet (one sheet per variable). Use SPSS (Statistics/Summarize/Frequencies) to get a frequency distribution of the transformed or z-score variables. Note that the Value of the variable is now in z-score units that is equal to a standard deviation of that variable. The value denotes the lower limit (i.e. left-end value or boundary of the category). Z-scores are continuous scores. Note that the Value Label denotes the upper and lower limit values or scores of the variable in its original metric and treats the variable as if it were continuous. Use the continuous score lower and upper limits to determine the discrete score lower and upper limits. Hint: refer to the minimum and maximum values and the metric of the variable. Note the Valid Percent of cases in each standard deviation category. Repeat this step for each of the variables by replacing the appropriate z-score variable in the variable box.

Step 4 Compare the Observed Distributions with a Normal Distribution 





Compare the observed and expected percentage distributions as you noted in Step 3 and determine if the variable is approximately normally distributed. Note any large differences (greater than 5%) between the observed and expected percentages. From Step 3, determine the discrete raw scores that are plus and minus one, two and three standard deviations from the mean. What percentage of the cases lie in these intervals? For each variable on the back side of the yellow sheet in narrative form, describe the distribution of the variable using the mean and standard deviation. Report the range of scores which are plus and minus one standard deviation of the mean and the percentage of cases in the actual distribution within that range. Likewise, report the range and percentage for scores within two standard deviations of the mean. Refer to the example in the white sheets.

\c:\workbook\white\inst3 r8.02

3.8

View more...

Comments

Copyright � 2017 NANOPDF Inc.
SUPPORT NANOPDF