Joseph Nathan Cohen

Associate Professor of Sociology, Queens College in the City University of New York

6530 Kissena Boulevard, Queens, New York, 11367 

An Introduction to Data Reduction

Table of Contents

Introduction

Data reduction is the task of consolidating a large set of metrics into a smaller set.  It is a wrangling and analytical operation that is useful in at least two situations:

  1. To simplify the information being conveyed in our data, so that people can comprehend and discuss it
  2. As a possible method for dealing with collinearity problems in regression analysis

Often, our data will have multiple metrics that attempt to capture the theoretical concepts with which an analysis grapples.  Such situations are quite common in psychology, where a surveyor might use tens of questions to measure one respondent trait. For example, the Minnesota Multiphasic Personality Inventory (a well-known psychopathology test) asks 567 questions to quantify people’s psyche over ten trait scales: hypochondriasis, depression, hysteria, psychopathic deviate, masculinity/femininity, paranoiapsychasthenia, schizophrenia, hypomania, and social introversion. To make sense of this many variables, we have to boil them down to a smaller number.  We need to convert the 567 metrics into 10 composite metrics. The human mind can only process so much complexity, and such a reduction allows people to make sense of an analysis involving many questions.

Of course, one way to reduce the number of variables in a set is to throw variables away. Alternatively, an analyst reduce the number of variables in a set by blending them. They can combine multiple variables by boiling them down to one variable, synthesizing them into composite measures or “indexes”. Here, numerical indexes are numerical scores that summarize the behavior of multiple metrics.

Some well-known indexes:

  • The Dow Jones Industrial Average is as a metric that represents the behavior of the whole stock market by averaging the price performance of 30 stocks.
  • The United Nations’ Human Development Index (HDI) is a metric that measures socio-economic development based on countries’ life expectancies, mean/expected years of schooling, and gross national income per capita.
  • The Economist publishes a Quality-of-Life index that ranks places to live by a range of material wellbeing, health, political, and social metrics.
  • ESPN’s MLB Relative Power Index rates the strength of Major League Baseball teams based on a team’s winning percentage, opponents’ winning percentage, and opponents’ opponents’ average winning percentage.

Illustration

The methods that we will learn this week can be used to develop and validate efforts to consolidate and reduce variables in a data set. Figure 1 (below) presents a depiction of a hypothetical indexation operation.

In this figure, we would describe “Academic Promise” as an underlying concept or latent variable, which is a variable that we cannot observe directly. It is latent – hiding beneath the surface. In contrast, GPA, SAT, and IQ scores are observed variables – we see these scores, because they are in our database.

The simple way to index these scores: (1) standardize each of these metrics (in order to put them on the same scale) and (2) take the average of the three standardized scores. However, doing so assumes that each of these three metrics are truly related to a common, underlying concept (academic promise), and are all equally important in measuring it. Moreover, this simple method assumes that you know how to measure “academic promise” in the first place – what if you have no idea? Factor analysis gives you a suite of tools to solve these kinds of problems.

We assume that our observed variables are related because of their association with a common underlying latent variable. In other words, we expect correlations between GPAs, SATs, and IQs because we presume that they are all outgrowths of one difficult-to-observe quality: academic promise.

Methods

There are many commonly-used operations to consolidate metrics. We will focus on four methods that can be used in active, supervised data analysis:

  • Simple Indexation, combining variables by averaging standardized variables
  • Cronbach’s Alpha, a quick-and-dirty (and commonly-used) method for justifying some combination of variables as related to an underlying latent variable.
  • Exploratory Factor Analysis, which we use to search for common factors or latent variables in a larger number of variables.
  • Confirmatory Factor Analysis, a more sophsiticated method for testing the relationship among metrics tied by a common underlying concept.

In addition, there are many passive, more algorithmically-implemented methods for reducing data. We will reserve these for a later lesson. Begin with a firsthand engagement of the operations that data mining algorithms are performing on your behalf.