Social Studies 201

September 19, 2003

Presenting data – chapter 4

1. Introduction

Chapter 4 presents various ways of organizing data. When presenting quantitative data, it is not common to merely list all the values of the variable, as is the case for the list of incomes in the stem and leaf display handout or the spreadsheet of the data set ssae98.sav. Rather, values of the variable are organized into tables, diagrams, graphs, and summary statistics (chapter 5) to make the data more amenable to examination and understanding. There are no necessarily correct ways to organize the values of a variable from a particular data set, but there are a number of incorrect or misleading ways to organize the data. The aim is to present data in a form that properly portrays the characteristics of the sample or population and clearly and accurately illustrates the issues or phenomena being studied.

The material in chapter 4 provides guidelines about how to organize quantitative data. Many of the rules, procedures, and notation outlined in that chapter 4 are consistent with methods statisticians use when analyzing data sets.

Today’s notes present the following:

• Rules concerning notation when working with statistics.
• Frequency, proportion, and percentage distributions.
• Class limits and real class limits.

Issues concerning production and definition of data are not discussed in these notes. It will be assumed that the data have been well produced (chapter 2), and the population and variables clearly defined (chapter 3). Here I discuss how the data obtained by the researcher can be organized for statistical analysis.

2. Tally or count – pp. 104-6

If there are not too many cases in a data set, it may be possible to count the number of cases taking on each value of the variable, and this an be a quick and efficient way of organizing or summarizing results from a small data set.

Example. The first ten cases in the ssae98.sav data set gave the following responses to question 13, what should be the priority for using the federal surplus. The responses have been coded into values 1-4, for purposes of entry into the computer data set. The codes are as follows: 1 represents reducing debt, 2 represents reducing taxes, 3 represents spending for infrastructure, and 4 represents spending for social programs.

 Identification no. Response 1 4 2 4 3 1 4 1 5 4 6 4 7 2 8 2 9 1 10 1

A count or tally shows that 4 respondents favour reducing debt (code 1), 2 respondents favour reducing taxes (code 2), none favour insfrastructure spending (code 3), and 4 favour social spending (code 4).

While a tally or count of this sort may be the most efficient way of organizing data in a small data set (in this example there were only ten cases), this method is not so efficient when organizing data from a data set with more cases.

3. Stem and leaf display – pp. 105-114

This was discussed last day. The stem and leaf method is an efficient way of organizing data into categories or intervals when there are a larger number of cases than in the previous example, but not too many cases. I use the stem and leaf display when there are from, say, twenty-five cases to up to one hundred or a few more cases. If there are over 150 cases, it may be too time consuming to use this method.

The stem and leaf display is also most useful when the values of the variable have a greater range than from 0 to 9. In the above example, the stem and leaf display would not be necessary because the possible values are only 1-4. But if the values for a variable exceed 10, the stem and leaf display may be a useful procedure. For example for ages, hours worked, or grades of respondents, values that range from 0 to 80 or 90, the stem and leaf display is an ideal method of organizing the data.

One advantage a stem and leaf display has over a tally is that all the original values of the data continue to be recorded in the display, so none of the original information is lost. For example, in the ordered stem and leaf display of September 16, all the original incomes of households are listed in order in the ordered stem and leaf display.

A stem and leaf display is thus an efficient way of organizing some data sets – so long as there are not too many cases, it provides a way of maintaining all the original information but organizing all the values in numerical order, from smallest to largest.

3. Computers

A statistical program, such as SPSS, takes all the original values of the data and organizes them into tables and diagrams. In the labs we will become familiar with the various procedures available for organizing data using SPSS.

4. Notation conventions

When learning any new academic discipline or procedure, a large part of the process is to learn the language of the discipline. In the case of statistical analysis, there are some conventions about how to label and organize data. Some of these follow.

• Variables: X, Y, Z – upper-case letters near the end of the alphabet are used to denote variables. For example X might represent variables such as age or grades. In Chapter 6, Z will be used to represent the standardized normal variable.
• Frequency of occurrence: f – a lower-case letter f is used to denote the number of times each value of a variable occurs. In the above example, f takes on values 4, 2, 0, and 4, ie. these are the frequencies of occurrence of the different values of the variable, the number of respondents who take on each value.
• Proportion: p – the lower-case letter p is used to denote the proportion of total cases in the sample that take on a particular value.
• Percentage: P – the upper-case letter P is used to denote the percentage of total cases in the sample that take on a particular value.
• Indexes – pp. 99-100. These are the subscripts on f and the variables X, Y, etc. Subscripts on f, X, and Y are termed indexes and are used to denote the different values of the variable.
• Sample size: n – a lower case letter n is ordinarily the symbol for the sample size, that is, the number of cases in the data set.
• Population size: N – an upper case letter N is ordinarily the symbol for the size of the population from which a sample of size n is drawn. If a census of the population is conducted so that all N members of the population are surveyed, then n=N. But it is more common to sample only a subset of the population, so generally n<N.

5. Distributions

When summarizing data for statistical analysis, it is most common to present data as a frequency distribution (or proportional or percentage distribution). A distribution is a way of presenting the data that provides the list of possible values of the variable (X), along with the relative occurrence of each value or set of values. The occurrence may be in terms of frequencies (f), proportions, or percentages. Examples of each are provided in the following notes.

a. Frequency Distribution – p. 95.

A frequency distribution lists the values of variable along with the frequency of occurrence of the variable.

For the above example of n=10 responses to question 13 of the survey, the frequency distribution table is as follows.

 X f 1 4 2 2 3 0 4 4 Total 10

When presenting a frequency distribution table with X and f, these algebraic symbols should be defined somewhere in, or near, the table, and the table should be labelled. In this case, this might be done as follows.

Frequency distribution table of n=10 respondents’ view of

priority for use of federal surplus (X)

 X f 1 4 2 2 3 0 4 4 Total 10

The frequency distribution table is the most common way to present a table of statistical data. These will be used extensively through the semester.

b. Proportional distribution – p. 117

Instead of presenting the data as frequencies of occurrence, the distribution may be presented as a proportional distribution. In such a table, instead of showing frequencies of occurrence, the table lists the proportion of cases taking on each value of the variable.

The proportion (p) is the fraction of cases taking on any particular value. Thus

p = f / n

That is, if there is a sample of size n cases, and f represents the frequency of occurrence of each value, then the proportion of cases taking on that value is p or f/n. The proportional distribution for the above example is as follows.

Proportional distribution table of n=10 respondents’ view of

priority for use of federal surplus (X)

 X p 1 0.4 2 0.2 3 0.0 4 0.4 Total 1.0

For the above table, the values of p are obtained as follows:

For X = 1, p = f / n = 4 / 10 = 0.4

For X = 2, p = f / n = 2 / 10 = 0.2

and so on.

Note that the sum of the proportions must equal 1.0 – that is, all the n cases must be included. If you construct a proportional distribution and the sum of the proportions is too far off from 1.0, there may be a calculating error. If it is 0.999 or 1.002, a small difference of this sort may emerge just from rounding off values of the proportions (see pp. 126-7 for some rules on rounding).

b. Percentage distribution – p. 117

Instead of presenting the data as frequencies of occurrence, the distribution may be presented as a percentage distribution. In such a table, instead of showing frequencies of occurrence, the table lists the percentage of cases taking on each value of the variable.

The percentage (P) is the fraction of cases taking on any particular value (p), multiplied by 100 per cent. That is, a percentage is just a proportion per one hundred or per cent. Thus

p = (f / n) x 100%

That is, if there is a sample of size n cases, and f represents the frequency of occurrence of each value of a variable X, then the percentage of cases taking on a particular value of X is P, with is p x 100% or (f / n) x 100%. The percentage distribution for the above example is as follows.

Percentage distribution table of n=10 respondents’ view of

priority for use of federal surplus (X)

 X P (%) 1 40 2 20 3 0 4 40 Total 100

The values of P are obtained as follows:

For X = 1, P = (f / n) x 100% = (4 / 10) x 100% = 0.4 x 100% = 40%

For X = 2, p = (f / n) x 100% = (2 / 10) x 100% = 0.2 x 100% = 20%

and so on.

Note that the sum of the percentages must equal 100% – that is, all the n cases must be included and accounted for. If you construct a percentage distribution and the sum of the proportions is too far off from 100, there may be a calculating error. If it is 99.9% or 100.2%, a small difference of this sort may emerge just from rounding off values of the proportions (see pp. 126-7 for some rules on rounding). In this case, you may wish to adjust one of the percentages slightly so the sum of the column is actually 100% (see notes on p. 133 for an example of how to deal with this problem).

6. Class Limits – p. 134

For presentation of a frequency or percentage distribution of a continuous variable, the values of the variable must be grouped into intervals. The same is true for a discrete variable with many possible values – it would be inefficient to list all the values for such a variable. When working with data of this type, the end points of the intervals are the class limits. In some cases, there is a gap between the end points of the intervals and, for graphical presentation of the data, it may be worthwhile to determine what are called the real class limits.

The example of the stem and leaf display illustrates this. In the stem and leaf presentation, the variable X is income in dollars. While income is a continuous variable, the values have been rounded to the nearest dollar. The values could potentially be from 0 to 1110,000, meaning there would be 110,000 rows to the table. In this case, I grouped the data into groups of ten thousand dollars. That is, the variable X is income in thousands of dollars, and the intervals used for summarizing these data were:

0-9

10-19

20-29

30-39

40-49

etc.

The values 0, 9, 10, 19, 20, 29, 30, 39, and so on, are the apparent class limits. But there is a gap of one unit between the intervals. This gap emerges not because there is really a gap in the sample or population, but because of my decision to group into the categories noted above. For example, if I had decided to group into categories such as 0-4, 5-9, 10-14, etc. the gaps would be different. In order to correct for these gaps between intervals of grouping data, the real class limits can be calculated.

The real class limits are the values midway between the end points of the adjacent intervals. The first interval ends at 9 and the next begins at 10; the midpoint between these two is 9.5. Similarly, the midpoint between 19 and 20 is 19.5. The resulting real class limits are as follows:

-0.5 – 9.5

9.5 –19.5

19.5 - 29.5

29.5 - 39.5

39.5 - 49.5

etc.

That is, calculation and use of the real class limits means that the apparent gap between intervals is eliminated. This gap originally emerged because of the way the analyst decided to organize the data, not because of any inherent gap in the data. If real class limits are used when presenting data as a frequency distribution table or histogram (p. 145), there is no gap between the bars. There should be no gap since the variable is continuous, so there should be no values that cannot occur. Figure 1 in the stem and leaf display shows how the frequency distribution table of Table 3 can be presented as a histogram.

Class limits also make sense in two other ways:

• Interval width. The width of the interval is the difference between the upper and lower class limits of the interval. If the apparent class limits of 9, 10, 19, 20, etc. are used, it appears that the interval width is 9 thousand dollars, ie. 19 – 10 = 9, 29 – 20 = 9, and so on. But this is misleading, since the interval from 10 to 19 really represents 10, not 9, units of the variable. By using real class limits, the interval widths in the above example are 10, that is, 19.5 – 9.5 = 10, 29.5 – 19.5 = 10, and so on. This is the proper interval width, since the data have been organized into groups of 10. Note that in order to make the first interval ten units wide, it is necessary to make the first interval start at –0.5, rather than at 0. While this looks a bit odd, this preserves the proper interval width of 10, since 9.5 – (-0.5) = 10.
• Rounding. The data in the stem and leaf display have been rounded to the nearest ten thousand dollars. A case reported as 9 thousand dollars thus might have any income up to \$9,500. An income such as \$9,521, just above \$9,500, would have been rounded up to 10 thousand dollars, and included in the 10-19 interval. In contrast, an income such as \$9,498 would be rounded down to 9 thousand dollars, and included in the 0-9 interval. Thus the values of \$9,500, or 9.5 thousand dollars, is the proper dividing point between the 0-9 and 10-19 intervals, and this is where the real class limit for these intervals is located. Using the real class limits preserves the dividing point between categories, when conventional rules on rounding are used to group data (see pp. 139-142 for a fuller explanation and other examples).

When graphing data with large values, eg. 100-199, 200-299, 300-399, etc., the difference between the apparent and real class limits is too small to show up in the graph. In this case, it is probably preferable to just use intervals such as 100-200, 200-300, 300-400, etc. and forget about the real class limits – they are only a 0.5 difference, too small to have much effect on values such as 300, 400, etc.

If the data have been grouped into intervals where there is no gap between the end points of the intervals, then the real and apparent class limits are identical and this whole problem of dealing with the gap is eliminated. For example, in the grouping

0-5

5-10

10-15

15-20

etc.

then the values 0, 5, 10, 15, 20, and so on are both the apparent and real class limits. In this case, do not attempt to split the difference between the end points of adjacent intervals – the analyst has already taken care of this problem by running each interval right up against the next interval

Paul Gingrich

September 23, 2003