Key Considerations in Data Analysis

  • Identify the purpose of the analysis or project
  • Understand the sample(s) under study
  • Understand the instruments being used to collect data
  • Be cognizant of data layouts and formats
  • Establish a unique identifier if matching or merging is necessary
  • Plan your work and work your plan!

Components of a Data Analysis Plan

  • Statement of research questions
  • Methods used to answer research questions
  • Timeline
  • Budget
  • File restructuring procedures (syntax creation, adding new variables as needed)
  • Algorithms for scoring, equating, etc.
  • Data cleaning procedures (e.g. removing outliers)
  • Quality control procedures at every step in the project

Examples of Analyses

  • Frequency Distributions and Cross -Tabulations
  • Descriptive Statistics (Means, Std. Deviations, Correlations)
  • T-tests and Analysis of Variance (ANOVA)
  • Regression
  • Principal Components/Factor Analysis (Data Reduction)
  • Cluster and Discriminant Analyses (Segmentation)
  • Latent Class Analysis (Classification)
  • Hierarchical Linear Modeling (HLM)
  • Differential Item Functioning (DIF)

More Advanced the Analysis, Greater the Amount of Preparation

  • Most analyses can be executed straight from a working data file
  • Some analyses may require transformations of the raw data, subsets, or specific input data to comply with statistical software

Data Types & Representation

Variables may require special coding for different data representation

  • Numeric
  • String
  • Date & time
  • Monetary

ASCII Text Files

  • Usually rectangular in structure
  • One record per observation
  • Each data variable in same position on each record
  • Each record may have multiple instances of data
  • Arrays
  • Repeating blocks (sets of variables)
  • File may have multiple records per observation
  • Number of records per observation can be variable
  • Most government data files come in this format at a minimum
  • Every software package can handle this file type

CSV Files (Comma-Separated Values)

  • Individual data elements separated by commas
  • Usually rectangular structure
  • One record (line) per observation
  • Fixed number of elements on each record
  • Problems if data elements contain delimiter or blank spaces (i.e. text strings)
  • Missing data must be represented by nulls

System Files (SAS®, SPSS®)

  • Data stored in binary (machine) format
  • Issues of portability across platforms
  • Structured as rectangular tables
  • SAS files can be indexed for direct access
  • Self-contained documentation
  • Data variable labels & formats
  • Data value labels
  • Most analysis packages provide facility for reading (but not writing!) system files from other packages (SPSS more than SAS)
  • Using default data formats can yield system files that are much larger than source ASCII files

Overview of Softwares

  • SAS and SPSS are most commonly used and tend to focus on the “classic” statistical routines:
  • Descriptive statistics and non-parametric (“distribution-free”) tests
  • ANOVA / Regression
  • Factor analysis
  • However, many psychometric procedures (e.g. IRT) and newer statistical models are not as well supported by these programs
  • Very specialized programs are used
  • Designed to do a specific task or validate a theory
  • Specialized programs may have issues
  • Interface not very user-friendly
  • Additional data types or files required
  • Expense

What is Excel?

  • Data are organized by worksheets, rows and columns
  • Worksheet limits are 256 columns and 65,536 total cells
  • Cells contain data or formulas with relative or absolute references to other cells
  • Direct manipulation of data and flexibility to move data “around” (e.g. sorting, replacing, merging)
  • Opens many file types
  • Quite useful in prepping files for use in SPSS, SAS or other programs
  • Conditional formatting
  • Also features macro capabilities, replicating user actions, allowing simple automation of regular tasks

Data Presentation Options in Excel

  • Tables and graphs can be exported to a wide variety of software packages
  • Can tweak and perfect example graph or table and then replicate by replacing only the data being used
  • Main advantage is ability to combine data from multiple sources – not just what is found in the data file
  • “Two-for-One” deal – table creation usually puts data into a format that leads to easy graph creation
  • User has control over virtually all aspects of a graph – size, colors, fonts, titles, legends, labels, etc.
  • Can combine graphs with tables and use cell layout to produce more complex presentations
  • Final graphs can be of “publication” quality

What is SAS?

  • A general purpose statistical package with a basic programming capability utilizing scores of statistical and mathematical functions in numerous “modules”
  • Can readily access data from a wide variety of sources, perform data management, and present findings in a variety of report and graph formats
  • Provides powerful tools for both specialized and enterprise-wide analytical needs

SAS – Strengths

  • Versatile data input and output formats
  • SAS provides both SQL and DATA steps to manipulate data:
  • SQL provides a way of carrying out relational algebra on tables and views
  • SAS data sets can be indexed for direct access or processed sequentially, without reading all records into memory, which is sometimes much more efficient

SAS – Weaknesses

  • Steep learning curve; volume of functions, options and documentation can be overwhelming for the novice
  • Inconsistent syntax across different procedures or modules
  • Not a good choice for applications that interact with external systems such as hardware devices or software programs because of its inconvenient interface
  • Difficult interaction with other programming languages
  • Expensive

What is SPSS?

  • A commercially produced statistical software package that is widely used in the fields of Education and Psychology
  • Program functionality is broken into over a dozen different modules which are sold individually
  • Most commonly used are Base, Regression Models, and Advanced Models
  • Other modules can be installed to run more complex analyses
  • SPSS data files include both the data and also variable information (variable and value labels, formats and missing values)

SPSS – Strengths

  • Easily opens data from other programs such as Excel and SAS
  • Variable view screen allows for quick overview of file contents and allows for easy modifications of names, formats, labels, and variable order
  • Having all data information in a single file allows sharing files on a project to be very easy
  • Point-and-click menus do not require memorizing syntax for majority of procedures
  • Many procedures can be expanded beyond the menu options in syntax
  • Split-file command allows all output to be replicated for various groups through a single command
  • Journal file tracks all commands used for life of program, with good resources to find code accidentally deleted

SPSS – Weaknesses

  • Ease of doing data manipulation can sometimes lead to mistakes as the program does not preclude inappropriate modifications to the data
  • Matching feature requires exact match
  • Duplicate records generate warnings but can be marked in file
  • Error logs are hard to interpret at times
  • Incompleteness of menus means some options are only available via syntax
  • While the majority of output is saved as pivot tables allowing great flexibility in modifying tables
  • Output tables and graphs generally not done as well as Excel and are harder to manipulate

LISREL

  • Ideal for discrete data types
  • Test data, Likert scale item data
  • Data can be imported in various types
  • ASCII, Access, Excel, SAS, SPSS, etc.
  • Variable names have length restrictions
  • Data files then stored as system files for later use
  • Basic statistics (e.g. means and correlations) are generated in an underlying program called PRELIS
  • LISREL itself is used to confirm the structural validity of a measurement model for any assessment
  • Requires syntax and input matrices

EQS

  • Ideal for continuous data types (test subscales)
  • Data can be imported in various types – ASCII, Access, Excel, SAS, SPSS, etc. but variable names have length restrictions
  • Data files then stored as system files for later use
  • EQS itself is used to confirm the structural validity of a measurement model for any assessment
  • Some model syntax can be built through the menus

HLM

  • Hierarchical Linear Modeling (HLM) is becoming a more popular type of analysis, namely in cohort trend modeling
  • Also allows you to look at variance component estimates and regression models given a nested sample of respondents
  • Students within countries within global regions on personality variables
  • More tedious to set up analysis with fewer available file types
  • Also requires more upfront work as multiple data files are needed

What Program Should I Use?

  • Microsoft Excel is the most basic and accessible spread sheet program available today
  • It is most ideal for general data exploration, histograms, scatter plots, etc.
  • Appearance of tables can be customized to meet APA standards
  • Allows for easy transition to other programs to complete analyses and write reports
  • However, its heritage is not as a statistical analysis program
  • Certain statistical programs are designed for specific analytic tasks
  • Balance the results and what will being presented
  • Choose wisely in the interests of efficiency and accuracy of results
  • Some output is good for looking at the data through basic exploration and to generate basic tables, but not to present the data

Summary

  • Be very clear about the analysis objectives
  • Be very familiar with all aspects of what defines your data
  • Develop and stay true to your data analysis plans and research questions
  • Be cognizant of which statistical software programs can best answer your research questions and present your results
  • Be thorough in your analyses, express openness to additional investigations, yet be mindful of limitations given the data and the programs you are using