Social Studies 201

Fall 2003

Wednesday and Friday, September 10-12, 2003

Production of data – Chapter 2

A. Introduction

This chapter has a discussion of issues that are more methodological than statistical, concerning how data are defined, obtained, and used, rather than examining data themselves. Since any work with quantitative data involves obtaining original data or working with the data of others, it is always worthwhile to consider these methodological issues while looking at the numbers themselves. If a researcher or analyst is to use quantitative data well, it is important for him or her to understand how the data are produced, what their limitations are, and how they might best be approached. Manipulating and present the numerical data is the work of statistics, but statistics by itself can be misleading or inappropriate, unless set in a proper context. It is this context that is the subject of Chapter 2 of the text.

Chapter 2 examines issues, concepts, and ideas that a data analyst should consider when using quantitative data. If the analyst obtains his or her own data, he or she must produce these data, and this chapter outlines some of the issues that a data producer and analyst should address. If other researchers, or another agency such as Statistics Canada or a survey research firm, have produced the data, anyone using these data for analysis or research should be aware of the definitions, concepts, and methods that these groups have used to obtain the data.

The following notes first discuss some general issues associated with production of data. Following this is a discussion of five major issues in data production – availability, control and use, definition, and potential errors.

B. General Issues

1. Quantitative or qualitative. Statistics can be used to measure readily quantifiable phenomena such as age, height, attitude, or income, or it can count qualitative phenomena such as crimes, university majors, or supporters of various political parties. As a result, statistics may not be as restrictive a method as sometimes thought – it can be involved in any phenomena where classification is possible, and this includes most physical and social phenomena. At the same time, those primarily involved in dealing with quantitative data should remember that the quantity associated with any phenomenon is only one aspect, with qualitative considerations also being important and providing insight into the social issue in question.

2. Collection or production. The title of the chapter is production of data, rather than the collection of data. Data are not readily available to be collected in some natural form, but are social products. This means that people involved in the production of data must make decisions about what and how to these data are to be produced. For researchers working with quantitative data, it is important to consider who makes these decisions, along with the purposes, interests, and aims of the data producers, or those who provide funding for the production of data. For example, data production may be funded by advertisers or politicians, with specific interests involved in production of these data, or they may be produced by agencies or individuals aiming to meet a more general interest. As examples of the latter, public health indicators (water quality, indicators of health and mortality) are presumably aimed at protecting or improving the health of the whole population.

Note that there are incorrect or wrong ways to obtain data, but there is no one, single correct way to obtain data. Proper methods of data production depend on the characteristics of the social issue being investigated and the structure of the population being examined. When encountering any single datum or a data set, it is always worthwhile to consider some of these issues of data production.

3. Quality of data. When initially working with data, it may not be apparent whether the data are high quality or not; at the same time, a user of the data should consider the issue of the quality of the data. Among the data issues to be considered are the questions of whether the data adequately and honestly represent the phenomena and issues being addressed. If the data have many errors or are poorly defined, there may be no way to rescue the data so they can be usefully applied to consideration of a social issue. Well-founded social analysis requires high quality data. If the data are poor quality, then they may still be useful, but only if very carefully used and analyzed.

C. Major issues in data production

1. Availability. Section 2.3.1, pp. 13-16.

a. Use data that are already produced whenever possible

If at all possible, for term papers, theses, or research reports, attempt to find data that have already been produced and are accessible to you. While there are reasons why it might be preferable to use data that you produce yourself, such data production may be extremely difficult.

Among the problems associated with data production are the time taken to produce the data and the financial costs associated with obtaining data. For social research, there is also a nuisance factor – you have to interfere in the lives of those from whom data is to be obtained. While this may be a relatively minor consideration, there are ethical issues involved in such intervention, and it may not be worth the bother and effort to ensure that ethical considerations of informed consent and non-interference in the lives of others are met. Obtaining data can also lead to the alteration or destruction of a population – an anthropologist entering a traditional village may introduce new ideas that forever change the social structure of that village.

There are also some positive reasons for using data produced by other researchers. Previous researchers may have produced well-constructed data sets that can be re-examined for further insights. In situations where a researcher decides to produce his or her own data, it is usually worthwhile to produce these data in a way that they can be compared with research results produced by earlier researchers. Comparison of data across different populations with different methods of obtaining the data often produce new conclusions.

Another possibility is that data sets produced earlier may have been poorly constructed and re-examination of these data may reveal shortcomings to earlier research. One example of the latter is Stephen Jay Gould’s revisiting of the research on brain and skull sizes of nineteenth century researchers (see his book The Mismeasure of Man). The earlier researchers claimed that western european skull and brain sizes exceeded those of people from Africa and Asia. Gould, by revisiting the same skulls and brains, demonstrates the value-laden and eurocentric assumptions adopted by these earlier researchers.

b. Types of data sources

Secondary data sources – Statistics Canada, Census of Canada, Government Publications in University of Regina Library, administrative sources, other surveys (General Social Survey, Labour Force Survey), administrative records, business records, historical documents.

Web sites – data may be available on these, but pay attention to their origin and quality. While there are many web sites with data sources, pay attention to the completeness and quality of the data on these sites. Data from sites such as Statistics Canada are generally high quality, but such quality is not assured on all web sites. When examining data from web sites, look for the original source of the data and considerations such as those noted in the following notes – definitions, method of production, and potential errors in the data.

Problems. When using data that have already been produced, there is little alternative but to use the data as they have been presented. While these area are often of high quality, they may not address the exact issue that a researcher or analyst wishes to address, or adopt the perspective of this analyst. For example, in the discipline of sociology, social class is major concept, but there are little data from established statistical agencies that give information on social class, let alone measurement of social solidarity or class consciousness of these social classes.

c. Uneven production and availability of data. When examining why certain types of data are available and others are not, an analysis must consider why the data were produced and for what purposes they were produced. Those who have the financial and organizational resources to obtain data may do so. Those problems or people that no one has an interest in may have no data available.

For example, there may be few studies of the situation of poor or homeless people – they have no resources to finance studies of their situation. In contrast, there are extensive data sets about financial issues – stock markets, bonds, interest rates, etc., since these are all of interest and concern to those with financial resources to ensure that these data are obtained and made available. Economic issues – many newspaper pages are devoted to reports on the behaviour of stocks and bonds, wheat production, weather. In contrast, information about refugees, the poor, aboriginal people, environmental and occupational health issues, etc. is often unavailable.

The SARS and mad cow scares demonstrate the importance of producing data about issues related to public health. In the latter case, there may have been inadequate testing of beef, resulting in worries and problems for anyone who consumes beef and for beef producers.

Having full information about issues is generally good – informed population essential to participation in a democracy.

d. Examples

We will use data from the survey Student Attitudes and Experiences, Fall 1998 (SSAE98) in the computer labs of this class. The data from this data set may not be all that remarkable but it would not be available if the survey of undergraduate students in the Fall 1998 semester had not been produced for Social Studies 306.

Labour Force Survey of Statistics Canada. This is a massive survey in terms of scope, coverage, and timeliness. The survey is conducted monthly by Statistics Canada – information about the Canadian labour force in August 2003 was released last Friday, September 5. These data provide detailed and summary descriptions of the Canadian and Saskatchewan labour force, with extensive data about who has jobs and who does not, what is the nature of the jobs, etc. These are high quality data, obtained from a large sample survey, and constitute a major resource for social science research and social policy in Canada. At the same time, anyone using these data is restricted to analysis of the data that are available from this survey. For example, this Survey contains no information on ethnicity, so analysis of issues of discrimination or unequal treatment by ethnic origin would not find much useful information in data from this Survey.

2. Control and Use of Data. Section 2.3.2, pp. 16-18.

a. Data as property - information age - data and power.

· Those who have resources can obtain and own data. Cost of the data is considerable.

· Military, police investigations.

· But who is data about, and what uses are made? Mistakes.

b. Power associated with having data and information about others. Those who have the data may be able to use it to further their own interests at the expense of others. Insider trading on stock markets; bureaucracies that control data, so that individuals have little or no information concerning how others are treated. It may be difficult for an individual to dealing equally with others who have access to or control of more information – in this sense, having certain types of data may be an instrument of power.

c. Confidentiality, anonymity, secrecy, privacy.

· Important for individuals to protect these, so that there is no misuse of the data, especially uses which could harm the individual.

· Ethical issues concerning proper use of data.

· Researchers – wish to have lots of data, so interests of individuals and researchers may clash. Issues here are primarily associated with how the data are obtained and how the uses to which they are put.

d. Examples

Statistics Canada. Data obtained by the national statistical agency is generally treated by the agency as confidential personal data. Data the agency obtains in the Census of Canada and other surveys are treated as confidential, so that no data released to the public can be identified with any particular individual. Anything that would reveal the identity of the individual is stripped from the data set. One result of this procedure is that it sometimes limits the usefulness of the data set for research purposes. Data are aggregated or stripped of individual identifiers.

Credit cards and agencies, Safeway Club Card. An individual has no control over these, although these agencies have no direct control over individual either, except where credit rating may be affected.

3. Definition of Population and Variables. Section 2.4, pp. 19-29.

a. Population or Universe

Population - Set of all people or objects with common (observable) characteristics. (p. 20).

When working with data, it is necessary to delineate the limits or boundaries of the population, i.e. state which individuals or objects are in the population and which are not. In order to do this, a researcher should identify and state the criteria associated with inclusion in or exclusion from the population for any individual or object.

Examples:

· Population of Regina. Necessary to decide on the date of measurement of the population, nature of residence (temporary or permanent – what about students?), plus temporary absences of those usually resident in the city. Having a “permanent” residence in the city may be a means of deciding on these matters.

· Population of poor people. What criteria are being used to define who is poor and who is not – a researcher may define on basis of a specific income. Those below this specific income level are poor and those above are not. But how is this cutoff point to be determined; does it differ by region, family size, and other characteristics?

· SSAE98. Full-time undergrads (9+ hours of credit) in day time classes. Some others included, but Survey is primarily of these.

· Base population for the Labour Force Survey (pp. 44-46). The survey does not include the whole population of Canada. The first exclusion is by age, excluding all people under age 15. A second set of exclusions is institutional. Full-time members of the Canadian Forces and inmates of institutions are excluded from the Labour Force Survey. Statistics Canada argues that people in the Canadian Forces and in institutions (prisons, long-term care homes, etc.) are not available for paid employment in the way other people are. A third set of exclusions is geographic. Probably for reasons of high cost of surveying a dispersed and remote group, the Yukon and the Territories are not included in the Survey. In addition, persons living on Indian Reserves are not included in the Survey. As a result, the base population for the measurement of the labour force excludes approximately 2% of the population aged 15 or over. Given the forty thousand plus status Indians who live on Indian Reserves in Saskatchewan, the percentage of people excluded from the Survey in this province likely exceeds 5%.

Universe. An alternative approach to considering population is to refer to the universe of observations (pp. 22-23). This is the set of all measurements that concern the researcher.

Examples:

· An example in terms of physical measurement – for measurement of temperature, the universe could be the set of all temperatures recorded at a weather station over the course of a year.

· For a survey of income inequality, the universe could be the set of all incomes of interest to the researcher. In this case though, each income recorded by the researcher is associated with an individual or family, so it may be preferable to provide details about the boundaries and criteria that have been used to define the set of individuals in the data set.

· The data set we will use in the computer labs appear as a set of numbers or measurements. But these are measurements of particular characteristics of individuals who were University of Regina undergraduates in the Fall, 1998 semester, characteristics such as sex, major, age, and attitude to various opinion questions.

b. Variables.

The variables are the characteristics of the members of the population (pp. 23-29). Examples include incomes of households or families, attitudes of individuals, employment status of individuals. Just as good researchers will clearly define the population, so the variables being examined for members of the population should be clearly defined. In doing this, there are theoretical and operational decisions that a researcher must make when defining variables (pp. 26-29).

Theoretical issues. The variable being investigated by the researcher may be associated with a particular theoretical approach. While there is not much theoretical concern about how to measure age (we do that in years), social science variables such as attitudes, supply of labour, social class, and extent of stress may each be associated with a variety of theoretical approaches.

The example in the text concerns the measurement of social class (pp. 27-8). As any student of sociology will be aware, there are several different and competing theoretical views of what social class means. A conventional stratification approach considers a population to be divided into several classes from upper class to middle class to lower class. A Marxian approach is primarily concerned with the division of the population into the two classes of bourgeoisie and proletariat. Social stratification theorists in the tradition of Max Weber may regard the primary indication of social class to be the common relation of a group of people to a market. Many other variants also exist. Before conducting research on social class, a researcher would have to decide which theoretical approach he or she would take.

Operational issues. These are the practical criteria associated with obtaining information about the variable. That is, once a researcher has adopted a particular theoretical approach, he or she has to operationalize the definitions of the variables so data can be obtained about them.

In the case of social class, a researcher adopting a stratification approach would have to decide on definitions and levels of income, education, or status of people – these data would be obtained from individuals and used to sort people into upper, middle, or lower class. A researcher taking a Marxian approach to social class would not consider income to be so important; for a Marxian the key question is whether an individual owns property and uses it to employ others – these are the bourgeoisie. And the worker, or proletarian, is the individual who has no option other than to sell his or her ability to work to an employer, in return for a wage. A Weberian would have to decide which markets were important in defining classes – Weber himself defined classes such as debtors, creditors, shipowners, bankers, etc. in addition to bourgeoisie and proletariat.

Example of the labour force. (pp. 46-49). The theoretical approach adopted by Statistics Canada is that of conventional economic approaches to the supply of labour in production of goods and services for the market. In theory, the supply of labour is to include all individuals who are available to provide labour to employers, or those who have their own farm or business or are employers themselves. But this theoretical approach to the supply of labour excludes those who work or produce products in the home or provide volunteer labour. That is, unpaid housework and volunteer work are excluded from the definition of who is in the labour force. Only those involved in paid work are included. This is consistent with the concept of the supply of labour in conventional economic approaches.

The operational definition of the labour force involves determining which individuals are available for work in the labour force – that is, all those who were employed or unemployed. Statistics Canada obtains information about each individual surveyed by asking them some specific questions about their work or activity – the wording of these questions constitutes the operationalization of the concepts.

The employed includes all individuals who performed any paid work for an employer, were operators of a farm or business themselves, or were employers. It also includes those who were temporarily absent from work due to illness, vacation, etc.

The unemployed includes those who were without work but had actively looked for work, were on temporary layoff, or had a new job to start within four weeks.

Those not in the labour force include all those not in the labour force – the retired, those attending school and not employed, those on disability, and homemakers.

Statistics Canada generally obtains high quality data about the labour force. But a researcher who disagrees with either the theoretical definition of the supply of labour or the operationalization of the definitions of employed and unemployed, might consider Statistics Canada’s data on the labour force as flawed. For example, those who argue that people who work in the home should be included would consider the count of the labour force as being too low – these people might argue that those who work but do this work for the family or household should also be included in measures of the size of the labour force.

4. Methods of Obtaining Data. Section 2.5, pp. 29-34.

There are many methods of obtaining data – here only the census and sample are mentioned. Other sources of data include experiments (of the sort psychologists conduct) and data from administrative records (for example, information about students, obtained from the University Registrar).

Census. A census is a complete enumeration of all members of a population (p. 29). The Census of Canada is conducted every five years by Statistics Canada – the aim of the Census is to obtain a complete enumeration of every resident of Canada in years ending in 1 or 6 (1991, 1996, 2001, 2006).

Another example is the record of grades for Social Studies 201. I record the grade for each assignment that each student completes. This record of class grades constitutes a census of this class.

While a census of a population would seem to provide complete information, if a population is large, there are several reasons why a census may be difficult.

First, obtaining information from each member of the population may be very costly and time consuming, and can involve much effort. If a population is large, it is only an agency such as Statistics Canada that would have the resources to conduct a census.

Second, if every member of a population was willing to cooperate with the researchers producing the data, information about the population might be high quality and complete. In practice, there is usually incomplete coverage of the population – some individuals are missed, some individuals do not wish to be found, and some individuals do not wish to provide information to the researcher. In addition, there is a certain nuisance factor to obtaining data – it may take up the time and energy of members of the population being examined, so that not all cooperate with the researcher.

Sample. A sample is any subset of a population.

Given the difficulty of obtaining a census of a population, it is more likely that researchers obtain a sample, than a census, of a population. In doing this, the researcher hopes that he or she can find a relatively small group of people who will provide information about themselves, and are members of the population. From this group, a researcher hopes to make conclusions about the whole population.

Representative sample. A researcher would generally like to obtain a representative sample, that is, a subset of the population that represents the population as a whole. For example, in the September 10, 2003 handout, the SSAE98 sample was close to representative in terms of sex, although the sampling error was 2.4 percentage points – under-representing males and over-representing females in the sample. If a researcher can find a sample that represents the population in a number of variables such as age, sex, region of residence, etc. then he or she has reasonable confidence that the sample may represent the characteristics of the population more generally.

Random sample. This is a sample where each member of the population has an equal chance of being selected. While only a small number of members of the population may actually be selected for the sample, every member of the population stands an equal possibility of being chosen in the sample. There is no bias or discrimination against certain individuals or groups when selecting such a sample. Principles of probability apply when selecting random samples and making inferences about the characteristics of the population from the sample.

If a random sample is reasonably large, then the random sample will be close to representative of the population. This is a major reason why researchers wish to select a random sample when sampling. But in practice, especially if the population is large, it may be difficult to make a random sample of the population.

Nonrandom sampling methods. These include all other methods of sampling. If carefully constructed, a sample obtained by nonrandom methods can be close to representative of the population and can yield worthwhile insights into the characteristics of the population. But a poorly constructed sample of this type may not be representative of the population. Some of the more common nonrandom sampling methods are as follows.

· Judgment samples. A researcher may select members of the population on the basis of his or her own judgment. If the researcher is familiar with the population, and takes care to select a cross-section of different types of people in the population, then this can provide a good sample. But if the researcher’s judgment is poor, this may yield a very unrepresentative sample.

· Volunteers. A researcher may advertise for or request volunteers for a study. If a researcher is looking for people with an unusual characteristic (e.g. those with a rare disability) this may be a good method since a random sample will not produce many people with these characteristics. But it is unlikely this will yield a sample that is a cross-section or representative of the whole population.

· Snowball sample. This involves combinations of the above two methods, where some members of the population are initially selected by the researcher and then those selected suggest other names to the researcher. Again, if the researcher is looking for people with unusual characteristics, this can be a good method. While it can yield a larger sample relatively quickly, this method is not likely to produce a sample that is representative of the whole population.

· Quota sample. If the researcher is familiar with a population, the judgment method can be improved by requiring specific numbers of people with certain characteristics. For example, if a researcher knows that 60% of a population is female and 40% male, then if searching for a sample of size 100, the researcher would select 60 females and 40 males. If carefully specified, this method can yield samples that are close to representative of a whole population. The SSAE98 sample was a form of quota sample – I attempted to obtain a sample that was reasonably representative of the undergraduate student body by faculty and year of program.

· Combined methods. Various of the above methods can be combined, along with use of randomness to provide samples that are relatively representative of a whole population.

5. Errors in Data Section 2.6, pp. 34-42.

Sampling errors. The sampling error for a population is the error produced as a result of the fact that not every member of the population was sampled. The example of the 2.4% error in representation of males and females in the SSAE98 sample (September 10, 2003 handout) is an example of sampling error. That is, if SSAE98 had surveyed all undergraduates, the survey would have been exactly representative of undergraduates by sex. But only 714 undergraduates were in the sample, with the result that males were under-represented and females over-represented.

Nonsampling errors. These are any errors in the production of data, other than sampling error. Such errors can emerge because some individuals are left out or refuse to cooperate with the researcher. Other members of the population may give incorrect or misleading answers, either inadvertently or deliberately. On the part of the researcher, survey questions may be poorly constructed so the researcher does not obtain accurate information about the population. Errors in processing the data can also occur, especially if there is a large sample with much data – clerical errors in tabulating and transcribing the survey results may occur.

Provide a report of potential errors. A good researcher will provide a methodological report outlining the shortcomings and potential errors in his or her research. However, it may sometimes be difficult to determine the degree of error in any data set. When working with data produced by other researchers or agencies, a user of data should carefully examine the quality of the data and be aware of the potential errors in the data.

6. Labour Force Survey Section 2.7, pp. 42-60.

This section and the September 10, 2003 handout provides an example of data production of a specific data set. For this class, you need not be familiar with the definitions or data in this section – this section is provided as background, to help you understand how issues 1-5 above are applied by Statistics Canada when obtaining data about the Canadian labour force. If you ever do research on the labour force, it is worthwhile becoming familiar with the methodological and definitional issues involved in the Labour Force Survey. More information about these is provided by Statistics Canada in the publication Guide to the Labour Force Survey. The February revision of this Guide is available from the web site:

http://www.statcan.ca/english/freepub/71-543-GIE/free.htm. Detailed monthly data about the Canadian labour force is available from:

http://www.statcan.ca/english/Pgdb/econoind.htm.

7. Conclusion

In addition to the issues discussed here, there are many other practical problems and issues associated with the production of data. While issues associated with data production are more methodological than statistical, when conducting statistical analysis of quantitative data, it is always worthwhile considering where the data come from, what the quality of the data are, and how the data can most appropriately be used. When working with data throughout the semester, I will attempt to mention some of the issues of data production that may be associated with the data sets we use.

The next section of the notes and chapter 3 deals with different ways of measuring the variables describing members of a population.

Paul Gingrich

Last edited September 12, 2003