Introduction

Sociology 405/805

January 6, 2004

Introduction

This class begins where Social Studies 201/Statistics 151 ends. We will begin with a review of the methods and procedures of descriptive and inferential statistics. Proceed to cover a number of statistics and statistical methods that are usually not dealt with in an introductory statistics class. As much as possible, examples from surveys and sociological journals, reports, or books will be used to illustrate the statistical methods and interpretation.

For each method examined in the class, we will discuss

· The conditions under which the method is most appropriately used or can be used.

· The assumptions built into the method and how violation of the assumptions might affect the results and interpretation of the statistics.

· Calculating the statistics – either with a calculator or on the computer, mostly with SPSS, MINITAB, and perhaps AMOS near the end of the semester.

· Determining statistical significance and the implications of this.

· Interpreting the results in the context of social research.

Since some of the methods use complex formulae that require knowledge of matrix algebra, calculus, etc. we will not be able to study the derivation of the formulae in detail. Rather, we will examine and discuss the assumptions involved in each method and consider how violations of these assumptions might affect the results and their interpretation.

Following this course, you should be able to:

· Read articles, reports, or books that use these statistical procedures and have a reasonable understanding of what the results mean.

· Recognize and understand the methods most appropriate for the data to be used and the questions to be examined.

· Work with the methods discussed and use SPSS and other statistical programs

· Interpret statistical results in terms of the social research questions and issues.

There are also some limitations to what this course will examine.

· We do not have the time or mathematical ability to develop or derive formulae.

· While each student should have a basic knowledge of each method, there will not be time to deal with all the modifications and intricacies of the methods. For example, we will not spend time on how to deal with the violations of regression assumptions that are dealt with in detail in an econometrics course.

· While we may touch on broader methodological or philosophic issues such as the appropriate place of quantitative research in social research, the limitations and uses of such data, relation to empiricism, positivism, realism, etc., these will be addressed only peripherally.

Summary of Methods

Cross-classifications. Begin by reviewing chi-square, followed by an examination of measures of association developed to summarize the relationship between variables with no more than nominal or ordinal levels of measurement. Since sociology uses many such variables, cross-classification tables are widely used in sociology and these measures provide a way of summarizing the strength of weakness of association. For this section of the course, we will use Chapters 10 and 11 of the Social Studies 201 text and H. T. Reynolds, Analysis of Nominal Data (to be on reserve at the University Library at HA33 R48). We will calculate some of the measures with the help or a calculator and some with the computer. Interpretation and meaning of the measures will also be discussed.

Analysis of Variance (ANOVA). There are many types of analysis of variance, and in this class only the relatively straightforward one-way and two-way ANOVA will be discussed. The one way ANOVA is a means of asking whether the means of a (dependent) variable differ when categorized into several groups on the basis of another (independent) variable. For example, whether mean grades of undergraduate students differ for first and second year can be addressed with a sample of first and second year students and a t-test for the difference of means. If we ask whether mean grades differ by year of student, where there are four years for students, then the t-test is inadequate, since there are four categories of students, first through fourth year. ANOVA can be used to ask whether the mean grades differ by year of student.

A two-way ANOVA takes this further and asks whether there are differences in the mean of a dependent variable when categorized on the basis of two independent variables. For example, we might ask whether mean grade differs by year of student and by sex. In this case, the students would be categorized into eight groups and the means across these groups would be examined. In addition to the question of whether means differ by groups, ANOVA can also address the issue of whether there is some interaction between sex and year in terms of affecting grades.

ANOVA is also a more general method that can be used for a variety of statistical purposes. The idea of ANOVA is to take the variability of a variable and ask whether we can statistically determine the sources of that variation. For example, students have different grades and these may be a result of year of study, background preparation, hours of study, type of course, and so on. If we have variables that measure each of these, it may be possible to statistically determine how much of the variability in grades is associated with each of these independent variables. As a result, ANOVA is later used in conjunction with some of the other statistical methods. While we will do some work with ANOVA, if you plan to make extensive use of this method, a course in psychological statistics is advised, since psychology is the discipline that makes greatest use of this method.

Correlation and Regression Models. If two variables have an interval or ratio level of measurement, then it is possible to calculate a measure of association between the two variables which is called the correlation coefficient. The correlation coefficient ranges from -1 (indicating strong negative association – for example, higher labour force participation rates by women may be associated with fewer children born to women) to +1 (indicating positive association – for example, more years of education tend to be positively related to higher earnings). A correlation of close to 0 indicates little, or no association. Whether two variables that have a statistically significant correlation between them are causally related is another issue – all that statistics itself can show is whether the association is statistically significant or not.

It is possible to extend correlation to more than two variables by calculating multiple and partial correlation coefficients, but these procedures are more difficult to calculate and interpret. Correlation coefficients are used in a variety of other statistical procedures such as path analysis and factor analysis, so it is important to have a good grasp of the correlation coefficient.

Where two or more interval or ratio level variables are involved, with a variable to be explained, regression may be an appropriate method. One or more of the variables are independent or explanatory, and there is a variable that is dependent, to be explained. Regression allows the researcher to determine which variables are statistically significant explanatory variables and what is the nature and extent of their statistical influence on the variable to be explained. For example, if a researcher is attempting to explain different earnings of workers in the labour force (dependent variable), the independent variables sex, years of experience, and years of schooling are likely to exercise a statistically significant influence on the dependent variable. A regression model allows the researcher to obtain estimates of the direction, size, and statistical significance of each of these independent variables on earnings.

Regression can include a variety of variables (multivariate regression), more than one dependent variable (simultaneous equations), different functional forms (dummy variables or logarithmic transformations), or corrections for violations of assumptions (multicollinearity, serial correlation, etc). While we will discuss some of these, if you need to do extensive work with regression, a course in econometrics is advised.

Factor Analysis. A researcher may have a large number of variables measuring a variety of variables, none of which is obviously independent or dependent. The researcher may wish to economize by reducing the number of variables or determine the general nature of the connection among the variables. (Note that in SPSS, factor analysis is classified as a method of data reduction). These connections can be considered by comparing pairs of variables, but this may be an awkward way to draw conclusions. For example, a researcher may have thirty attitude or opinion variables, measuring views on a variety of social issues. It is likely that these attitudes or opinions are connected to each other in various ways, but sorting through these pair by pair would take much time and effort. In such a situation, factor analysis may be useful in assisting the researcher in sorting through the variables.

Factor analysis sorts a set of variables into a number of factors or dimensions. In the case of opinions and attitudes on social issues, it might be possible to identify a left/right dimension (or several such dimensions), a feminist dimension, a racial/ethnic dimension, a rural/urban dimension, etc. Note though that factor analysis does not name the dimensions – placing names on the factors is an interpretation introduced by the researcher. This method is controversial, given the assumptions on which it is founded. Some researchers reject factor analysis as an appropriate methods but, in my view, factor analysis can be useful in assisting the researcher in producing economy in the number of variables, and in helping to understand some of the underlying structures that may exist among variables. In this course, we will carry out some factor analysis, but there will not be time to explore all aspects of this.

Structural Equation Models. If there is time near the end of the semester, we will work with some structural equation models. This is a more complex model, where the researcher might wish to examine the relationship among several unobserved variables. If all the variables can be observed, correlation and regression may be sufficient. But in the social sciences, it is difficult to observe all variables, and there may be several observed variables that the researcher has developed to measure the unobserved. In this case, there are really two problems – first, to construct the unobserved and second, to examine the relationship among the unobserved. In this case, it may be possible to develop a structural equation model to estimate the model and test hypotheses concerning the model.

Variables in sociological theory are often unobserved – for example, class consciousness, alienation, social solidarity, social status, etc. Theorists may be comfortable talking about such variables, but none of these can be directly observed. There may be variables that a researcher can construct which indirectly measure these, but it is likely that there will be a variety of such constructed variables, each measuring slightly different aspects of the unobserved. Suppose, for example, that a researcher hypothesizes that class consciousness is increased by alienation and is affected negatively by social status. The researcher may have a variety of measurable variables, such as responses to questions concerning how individuals are connected to the workplace, income, education, etc. A researcher could use regression or factor analysis to construct factors such as alienation, social status, and class consciousness. But this might be limiting in terms of the assumptions that need to be adopted. In this case, a structural equation model might be used to estimate the underlying, unobservable variables and estimate the nature of the relationships among these.

In this semester, we will not have time to develop the structural equation models very much, but there may be time to examine a few examples of these models. For this statistical method, the AMOS program will be used.

Materials for the Class.

Chapters 10 and 11 of Introductory Statistics for the Social Sciences contain a discussion of the chi-square distribution and some measures of association. I will make these available as handouts or on the web site. In Analysis of Nominal Data, H. T. Reynolds provides a more detailed discussion of the measures of association that can be used for cross-classification tables – I will put this book on reserve.

For the section on analysis of variance (ANOVA), we will use Gudmund R. Iverson and Helmut Norpoth, Analysis of Variance.

An introduction to regression is contained in Chapter 11 of Introductory Statistics for the Social Sciences, with a more extensive analysis in Michael S. Lewis-Beck, Applied Regression: an Introduction. The latter also contains some discussion of multivariate regression and a discussion of assumptions and violation of assumptions.

In Introduction to Factor Analysis, Jae-On Kim and Charles W. Mueller provide a good introduction to factor analysis, with some understandable examples and interpretation of how this method can be useful in social analysis.

Other topics that you would like to discuss?

Assignments. Six problem sets (2 weeks each to work on these) and a final project assignment. This last assignment will ask you to use a variety of the methods discussed in the course to analyze a particular data set. If you have a data set of your own that you wish to use for the final project, please discuss with the instructor.

Final project to be due around April 15.

Last edited January 10, 2004