Social Studies 201

# January 14, 2004

The notes for January 14, 2004 cover the section on continuous and discrete variables in Chapter 3 and the first part of Chapter 4, presenting data.

A.  Continuous and discrete

Another way of looking at the type of variables is to divide variables into those that have a discrete, or countable, set of possible values, and those that are continuous in nature, with an infinite and uncountable number of possible values.  While this distinction is not as important as the nominal, ordinal, interval, ratio distinction, statistical analysis can differ depending on whether the scale is discrete or continuous.

Definition of discrete scale (p. 72).  A discrete variable is a variable which can take on only a countable number of values.

Many variables have this characteristic in that there are only a few possible values – sex (male or female), political party supported (Liberal, NDP, Saskatchewan, Green), and year of program (first, second, third, fourth).  While there may be more political parties, it will always be possible to count the number of political parties.  In the case of a variable such as religion or ethnicity, there may be a very large number of possible values, but again they could potentially be counted, if a researcher had sufficient time and resources.  Any variable such as this, where there are a number of distinct and countable characteristics is a discrete variable.

It may be possible for a discrete variable to have an extremely large number of values, even an infinite number, but to be discrete they must be countable.  That is, each possible value must be a distinct and independent entity for the variable to have a discrete set of values.  Any measurement of number of people, for example, the number of students in a class, is a discrete variable – each student is a distinct human being, and the number of students can be counted.  The number of stars in the universe may be infinite, but each star is a distinct and independent entity, distinguishable from other stars.  As a result, the number of stars can be considered a discrete variable.  In most cases where the term “number of” is used, this means that the variable is discrete.

In contrast to discrete variables, if the number of possible values of some variables cannot be counted, then the variable may be continuous.  For example, the number of possible temperatures or heights cannot be counted – these are characteristics that are inherently continuous.

Definition of continuous variable (p. 73).  A continuous variable is a variable that can assume any value along some line interval.

Any characteristic that can be matched up with all the points along a line can be considered continuous.  The number of points on a line cannot be counted, rather the variable can move continuously along the line, to any point on the line.  In the case of age, while we ordinarily round our age to age as of last birthday, or nearest birthday, in terms of anyone’s actual age, it increases continuously.  Age if measured in time, and time goes on continuously, not in discrete jumps.

Liquids are continuous in nature, in that they are not divided into discrete parts, but flow continuously.  Measures of volume, such as litres or gallons, are thus continuous measures.

It is common for researchers or statistical analysts to round variables that are continuous in nature to a discrete set of measurements.  In the case of height, while children grow in a continuous fashion until they reach their adult height, the height may be reported by rounding to the nearest centimetre or inch.  Just because the variables have been rounded does not mean they are discrete.  The way I often describe these is to say these are continuous variables, but reported as a discrete set of measurements.

Two commonly used social science variables that may be confusing in this respect are attitudes and income.  I consider both of these as continuous in nature, but often these are reported in only a discrete set of values.   Consider the following line:

Strongly disagree                                  Strongly agree

__________________________________

Suppose a researcher asks an individual to state his or her position on the issue as at some point along the line.  This might be a useful way of obtaining data on attitudes or opinions, and since all the possible attitudes or opinions can be matched with points along a continuous line, this demonstrates that attitudes are continuous.

In the case of income, the situation is similar.  Incomes are ordinarily measured in dollars and the dollar value can be anywhere from zero (very poor) to many millions of dollars (very rich).  While the smallest monetary unit in circulation is one cent, there is no reason why any monetary value cannot be calculated in fractions of a cent.  If you examine the business pages of the newspaper, foreign exchange values are often given to several decimal places.  As a result, income or any other variable measured in monetary terms (dollars), can be considered to be continuous in nature.  However, we usually round these values and report incomes to the nearest dollar.

The distinction between continuous and discrete is less important than the scale of measurement.  But again, the way that data is presented and the mathematical and statistical operations that can be used on continuous data differ somewhat from what is possible with discrete data.

Conclusion to Chapter 3

There are many differences in the different types of variables.  For statistical work, the scale of measurement is the most important consideration.  When encountering a variable that you have not used before, one of the first questions to ask is how it was measured, what type of scale it has (nominal, ordinal, interval, ratio) and whether it is discrete or continuous.  Depending on the answers to these, the way the data are presented and the forms of statistical analysis conducted on the data may differ considerably.

In Chapter 4, we examine how data can be presented.  As you proceed through this chapter, first take note of the type of variable, and this will help you determine how the data about the variable is most appropriately presented.

B.  Presenting data – chapter 4

1. Introduction

Chapter 4 presents various ways of organizing data.  When presenting quantitative data, it is not common to merely list all the values of the variable, as is the case for the list of incomes in the stem and leaf display handout or the spreadsheet of the data set ssae98.sav.  Rather, values of the variable are organized into tables, diagrams, graphs, and summary statistics (chapter 5) to make the data more amenable to examination and understanding.  There are no necessarily correct ways to organize the values of a variable from a particular data set, but there are a number of incorrect or misleading ways to organize the data.  The aim is to present data in a form that properly portrays the characteristics of the sample or population and clearly and accurately illustrates the issues or phenomena being studied.

The material in chapter 4 provides guidelines about how to organize quantitative data.  Many of the rules, procedures, and notation outlined in that chapter 4 are consistent with methods statisticians use when analyzing data sets.

These notes examine the following:

·        Rules concerning notation when working with statistics.

·        Frequency, proportion, and percentage distributions.

·        Class limits and real class limits.

Issues concerning production and definition of data are not discussed in these notes.  It will be assumed that the data have been well produced (chapter 2), and the population and variables clearly defined (chapter 3).  Here I discuss how the data obtained by the researcher can be organized for statistical analysis.

2. Tally or count – pp. 104-6

If there are not too many cases in a data set, it may be possible to count the number of cases taking on each value of the variable, and this an be a quick and efficient way of organizing or summarizing results from a small data set.

Example.  The first ten cases in the ssae98.sav data set gave the following responses to question 13, what should be the priority for using the federal surplus.  The responses have been coded into values 1-4, for purposes of entry into the computer data set.  The codes are as follows: 1 represents reducing debt, 2 represents reducing taxes, 3 represents spending for infrastructure, and 4 represents spending for social programs.

 Identification no. Response 1 4 2 4 3 1 4 1 5 4 6 4 7 2 8 2 9 1 10 1

A count or tally shows that 4 respondents favour reducing debt (code 1), 2 respondents favour reducing taxes (code 2), none favour insfrastructure spending (code 3), and 4 favour social spending (code 4).

While a tally or count of this sort may be the most efficient way of organizing data in a small data set (in this example there were only ten cases), this method is not so efficient when organizing data from a data set with more cases.

3.  Stem-and-leaf display – pp. 105-114

See the file on the stem-and-leaf display.  The stem-and-leaf method is an efficient way of organizing data into categories or intervals when there are a larger number of cases than in the previous example, but not too many cases.  I use the stem and leaf display when there are from, say, twenty-five cases to up to one hundred or a few more cases.  If there are over 150 cases, it may be too time consuming to use this method.

The stem and leaf display is also most useful when the values of the variable have a greater range than from 0 to 9.  In the above example, the stem and leaf display would not be necessary because the possible values are only 1-4.  But if the values for a variable exceed 10, the stem and leaf display may be a useful procedure.  For example for ages, hours worked, or grades of respondents, values that range from 0 to 80 or 90, the stem and leaf display is an ideal method of organizing the data.

One advantage a stem and leaf display has over a tally is that all the original values of the data continue to be recorded in the display, so none of the original information is lost.  For example, in the ordered stem and leaf display of September 16, all the original incomes of households are listed in order in the ordered stem and leaf display.

A stem and leaf display is thus an efficient way of organizing some data sets – so long as there are not too many cases, it provides a way of maintaining all the original information but organizing all the values in numerical order, from smallest to largest.

3.  Computers

A statistical program, such as SPSS, takes all the original values of the data and organizes them into tables and diagrams.  In the labs we will become familiar with the various procedures available for organizing data using SPSS.

4.  Notational conventions

When learning any new academic discipline or procedure, a large part of the process is to learn the language of the discipline.  In the case of statistical analysis, there are some conventions about how to label and organize data.  Some of these follow.

·        Variables:  X, Y, Z – upper-case letters near the end of the alphabet are used to denote variables.  For example X might represent variables such as age or grades.  In Chapter 6, Z will be used to represent the standardized normal variable.

·        Frequency of occurrence:  f – a lower-case letter f is used to denote the number of times each value of a variable occurs.  In the above example, f takes on values 4, 2, 0, and 4, ie. these are the frequencies of occurrence of the different values of the variable, the number of respondents who take on each value.

·        Proportion:  p – the lower-case letter p is used to denote the proportion of total cases in the sample that take on a particular value.

·        Percentage:  P – the upper-case letter P is used to denote the percentage of total cases in the sample that take on a particular value.

·        Indexes – pp. 99-100.  These are the subscripts on f and the variables X, Y, etc.  Subscripts on f, X, and Y are termed indexes and are used to denote the different values of the variable.

·        Sample size:  n – a lower case letter n is ordinarily the symbol for the sample size, that is, the number of cases in the data set.

·        Population size: N – an upper case letter N is ordinarily the symbol for the size of the population from which a sample of size n is drawn.  If a census of the population is conducted so that all N members of the population are surveyed, then n=N.  But it is more common to sample only a subset of the population, so generally n<N.

5.  Distributions

When summarizing data for statistical analysis, it is most common to present data as a frequency distribution (or proportional or percentage distribution).  A distribution is a way of presenting the data that provides the list of possible values of the variable (X), along with the relative occurrence of each value or set of values.  The occurrence may be in terms of frequencies (f), proportions, or percentages.  Examples of each are provided in the following notes.

a. Frequency Distribution – p. 95.

A frequency distribution lists the values of variable along with the frequency of occurrence of the variable.

For the above example of n=10 responses to question 13 of the survey, the frequency distribution table is as follows.

 X f 1 4 2 2 3 0 4 4 Total 10

When presenting a frequency distribution table with X and f, these algebraic symbols  should be defined somewhere in, or near, the table, and the table should be labelled.   In this case, this might be done as follows.

Frequency distribution table of n=10 respondents’ view of

priority for use of federal surplus (X)

 X f 1 4 2 2 3 0 4 4 Total 10

The frequency distribution table is the most common way to present a table of statistical data.  These will be used extensively through the semester.

b. Proportional distribution – p. 117

Instead of presenting the data as frequencies of occurrence, the distribution may be presented as a proportional distribution.  In such a table, instead of showing frequencies of occurrence, the table lists the proportion of cases taking on each value of the variable.

The proportion (p) is the fraction of cases taking on any particular value.  Thus

p = f / n

That is, if there is a sample of size n cases, and f represents the frequency of occurrence of each value, then the proportion of cases taking on that value is p or f/n.  The proportional distribution for the above example is as follows.

Proportional distribution table of n=10 respondents’ view of

priority for use of federal surplus (X)

 X p 1 0.4 2 0.2 3 0.0 4 0.4 Total 1.0

For the above table, the values of p are obtained as follows:

For X = 1, p = f / n = 4 / 10 = 0.4

For X = 2, p = f / n = 2 / 10 = 0.2

and so on.

Note that the sum of the proportions must equal 1.0 – that is, all the n cases must be included.  If you construct a proportional distribution and the sum of the proportions is too far off from 1.0, there may be a calculating error.  If it is 0.999 or 1.002, a small difference of this sort may emerge just from rounding off values of the proportions (see pp. 126-7 for some rules on rounding).

b. Percentage distribution – p. 117

Instead of presenting the data as frequencies of occurrence, the distribution may be presented as a percentage distribution.  In such a table, instead of showing frequencies of occurrence, the table lists the percentage of cases taking on each value of the variable.

The percentage (P) is the fraction of cases taking on any particular value (p), multiplied by 100 per cent.  That is, a percentage is just a proportion per one hundred or per cent.  Thus

p = (f / n) x 100%

That is, if there is a sample of size n cases, and f represents the frequency of occurrence of each value of a variable X, then the percentage of cases taking on a particular value of X is P, with is p x 100% or (f / n) x 100%.  The percentage distribution for the above example is as follows.

Percentage distribution table of n=10 respondents’ view of

priority for use of federal surplus (X)

 X P (%) 1 40 2 20 3 0 4 40 Total 100

The values of P are obtained as follows:

For X = 1, P = (f / n) x 100% = (4 / 10) x 100% = 0.4 x 100% = 40%

For X = 2, p = (f / n) x 100% = (2 / 10) x 100% = 0.2 x 100% = 20%

and so on.

Note that the sum of the percentages must equal 100% – that is, all the n cases must be included and accounted for.  If you construct a percentage distribution and the sum of the proportions is too far off from 100, there may be a calculating error.  If it is 99.9% or 100.2%, a small difference of this sort may emerge just from rounding off values of the proportions (see pp. 126-7 for some rules on rounding).  In this case, you may wish to adjust one of the percentages slightly so the sum of the column is actually 100% (see notes on p. 133 for an example of how to deal with this problem).

6.  Class Limits – p. 134

For presentation of a frequency or percentage distribution of a continuous variable, the values of the variable must be grouped into intervals.  The same is true for a discrete variable with many possible values – it would be inefficient to list all the values for such a variable.  When working with data of this type, the end points of the intervals are the class limits.  In some cases, there is a gap between the end points of the intervals and, for graphical presentation of the data, it may be worthwhile to determine what are called the real class limits.

The example of the stem and leaf display illustrates this.  In the stem and leaf presentation, the variable X is income in dollars.  While income is a continuous variable, the values have been rounded to the nearest dollar.  The values could potentially be from  0 to 1110,000, meaning there would be 110,000 rows to the table.  In this case, I grouped the data into groups of ten thousand dollars.  That is, the variable X is income in thousands of dollars, and the intervals used for summarizing these data were:

0-9

10-19

20-29

30-39

40-49

etc.

The values 0, 9, 10, 19, 20, 29, 30, 39, and so on, are the apparent class limits.  But there is a gap of one unit between the intervals.  This gap emerges not because there is really a gap in the sample or population, but because of my decision to group into the categories noted above.  For example, if I had decided to group into categories such as 0-4, 5-9, 10-14, etc. the gaps would be different.  In order to correct for these gaps between intervals of grouping data, the real class limits can be calculated.

The real class limits are the values midway between the end points of the adjacent intervals.  The first interval ends at 9 and the next begins at 10; the midpoint between these two is 9.5.  Similarly, the midpoint between 19 and 20 is 19.5.  The resulting real class limits are as follows:

-0.5 – 9.5

9.5 –19.5

19.5 - 29.5

29.5 - 39.5

39.5 - 49.5

etc.

That is, calculation and use of the real class limits means that the apparent gap between intervals is eliminated.  This gap originally emerged because of the way the analyst decided to organize the data, not because of any inherent gap in the data.  If real class limits are used when presenting data as a frequency distribution table or histogram (p. 145), there is no gap between the bars.  There should be no gap since the variable is continuous, so there should be no values that cannot occur.  Figure 1 in the stem and leaf display shows how the frequency distribution table of Table 3 can be presented as a histogram.

Class limits also make sense in two other ways:

·        Interval width.  The width of the interval is the difference between the upper and lower class limits of the interval.  If the apparent class limits of 9, 10, 19, 20, etc. are used, it appears that the interval width is 9 thousand dollars, ie. 19 – 10 = 9, 29 – 20 = 9, and so on.  But this is misleading, since the interval from 10 to 19 really represents 10, not 9, units of the variable.  By using real class limits, the interval widths in the above example are 10, that is, 19.5 – 9.5 = 10, 29.5 – 19.5 = 10, and so on.  This is the proper interval width, since the data have been organized into groups of 10.  Note that in order to make the first interval ten units wide, it is necessary to make the first interval start at –0.5, rather than at 0.  While this looks a bit odd, this preserves the proper interval width of 10, since 9.5 – (-0.5) = 10.

·        Rounding.  The data in the stem and leaf display have been rounded to the nearest ten thousand dollars.  A case reported as 9 thousand dollars thus might have any income up to \$9,500.  An income such as \$9,521, just above \$9,500, would have been rounded up to 10 thousand dollars, and included in the 10-19 interval.  In contrast, an income such as \$9,498 would be rounded down to 9 thousand dollars, and included in the 0-9 interval.  Thus the values of \$9,500, or 9.5 thousand dollars, is the proper dividing point between the 0-9 and 10-19 intervals, and this is where the real class limit for these intervals is located.  Using the real class limits preserves the dividing point between categories, when conventional rules on rounding are used to group data (see pp. 139-142 for a fuller explanation and other examples).

When graphing data with large values, eg. 100-199, 200-299, 300-399, etc., the difference between the apparent and real class limits is too small to show up in the graph.  In this case, it is probably preferable to just use intervals such as 100-200, 200-300, 300-400, etc. and forget about the real class limits – they are only a 0.5 difference, too small to have much effect on values such as 300, 400, etc.

If the data have been grouped into intervals where there is no gap between the end points of the intervals, then the real and apparent class limits are identical and this whole problem of dealing with the gap is eliminated.  For example, in the grouping

0-5

5-10

10-15

15-20

etc.

then the values 0, 5, 10, 15, 20, and so on are both the apparent and real class limits.  In this case, do not attempt to split the difference between the end points of adjacent intervals – the analyst has already taken care of this problem by running each interval right up against the next interval

Next day – histograms and densities.

Last edited on January 17, 2004