Key Considerations in Data Analysis
- Identify the purpose of the analysis or project
- Understand the sample(s) under study
- Understand the instruments being used to collect data
- Be cognizant of data layouts and formats
- Establish a unique identifier if matching or merging is necessary
- Plan your work and work your plan!
Components of a Data Analysis Plan
- Statement of research questions
- Methods used to answer research questions
- Timeline
- Budget
- File restructuring procedures (syntax creation, adding new variables as needed)
- Algorithms for scoring, equating, etc.
- Data cleaning procedures (e.g. removing outliers)
- Quality control procedures at every step in the project
Examples of Analyses
- Frequency Distributions and Cross -Tabulations
- Descriptive Statistics (Means, Std. Deviations, Correlations)
- T-tests and Analysis of Variance (ANOVA)
- Regression
- Principal Components/Factor Analysis (Data Reduction)
- Cluster and Discriminant Analyses (Segmentation)
- Latent Class Analysis (Classification)
- Hierarchical Linear Modeling (HLM)
- Differential Item Functioning (DIF)
More Advanced the Analysis, Greater the Amount of Preparation
- Most analyses can be executed straight from a working data file
- Some analyses may require transformations of the raw data, subsets, or specific input data to comply with statistical software
Data Types & Representation
Variables may require special coding for different data representation
- Numeric
- String
- Date & time
- Monetary
ASCII Text Files
- Usually rectangular in structure
- One record per observation
- Each data variable in same position on each record
- Each record may have multiple instances of data
- Arrays
- Repeating blocks (sets of variables)
- File may have multiple records per observation
- Number of records per observation can be variable
- Most government data files come in this format at a minimum
- Every software package can handle this file type
CSV Files (Comma-Separated Values)
- Individual data elements separated by commas
- Usually rectangular structure
- One record (line) per observation
- Fixed number of elements on each record
- Problems if data elements contain delimiter or blank spaces (i.e. text strings)
- Missing data must be represented by nulls
System Files (SAS®, SPSS®)
- Data stored in binary (machine) format
- Issues of portability across platforms
- Structured as rectangular tables
- SAS files can be indexed for direct access
- Self-contained documentation
- Data variable labels & formats
- Data value labels
- Most analysis packages provide facility for reading (but not writing!) system files from other packages (SPSS more than SAS)
- Using default data formats can yield system files that are much larger than source ASCII files
Overview of Softwares
- SAS and SPSS are most commonly used and tend to focus on the “classic” statistical routines:
- Descriptive statistics and non-parametric (“distribution-free”) tests
- ANOVA / Regression
- Factor analysis
- However, many psychometric procedures (e.g. IRT) and newer statistical models are not as well supported by these programs
- Very specialized programs are used
- Designed to do a specific task or validate a theory
- Specialized programs may have issues
- Interface not very user-friendly
- Additional data types or files required
- Expense
What is Excel?
- Data are organized by worksheets, rows and columns
- Worksheet limits are 256 columns and 65,536 total cells
- Cells contain data or formulas with relative or absolute references to other cells
- Direct manipulation of data and flexibility to move data “around” (e.g. sorting, replacing, merging)
- Opens many file types
- Quite useful in prepping files for use in SPSS, SAS or other programs
- Conditional formatting
- Also features macro capabilities, replicating user actions, allowing simple automation of regular tasks
Data Presentation Options in Excel
- Tables and graphs can be exported to a wide variety of software packages
- Can tweak and perfect example graph or table and then replicate by replacing only the data being used
- Main advantage is ability to combine data from multiple sources – not just what is found in the data file
- “Two-for-One” deal – table creation usually puts data into a format that leads to easy graph creation
- User has control over virtually all aspects of a graph – size, colors, fonts, titles, legends, labels, etc.
- Can combine graphs with tables and use cell layout to produce more complex presentations
- Final graphs can be of “publication” quality
What is SAS?
- A general purpose statistical package with a basic programming capability utilizing scores of statistical and mathematical functions in numerous “modules”
- Can readily access data from a wide variety of sources, perform data management, and present findings in a variety of report and graph formats
- Provides powerful tools for both specialized and enterprise-wide analytical needs
SAS – Strengths
- Versatile data input and output formats
- SAS provides both SQL and DATA steps to manipulate data:
- SQL provides a way of carrying out relational algebra on tables and views
- SAS data sets can be indexed for direct access or processed sequentially, without reading all records into memory, which is sometimes much more efficient
SAS – Weaknesses
- Steep learning curve; volume of functions, options and documentation can be overwhelming for the novice
- Inconsistent syntax across different procedures or modules
- Not a good choice for applications that interact with external systems such as hardware devices or software programs because of its inconvenient interface
- Difficult interaction with other programming languages
- Expensive
What is SPSS?
- A commercially produced statistical software package that is widely used in the fields of Education and Psychology
- Program functionality is broken into over a dozen different modules which are sold individually
- Most commonly used are Base, Regression Models, and Advanced Models
- Other modules can be installed to run more complex analyses
- SPSS data files include both the data and also variable information (variable and value labels, formats and missing values)
SPSS – Strengths
- Easily opens data from other programs such as Excel and SAS
- Variable view screen allows for quick overview of file contents and allows for easy modifications of names, formats, labels, and variable order
- Having all data information in a single file allows sharing files on a project to be very easy
- Point-and-click menus do not require memorizing syntax for majority of procedures
- Many procedures can be expanded beyond the menu options in syntax
- Split-file command allows all output to be replicated for various groups through a single command
- Journal file tracks all commands used for life of program, with good resources to find code accidentally deleted
SPSS – Weaknesses
- Ease of doing data manipulation can sometimes lead to mistakes as the program does not preclude inappropriate modifications to the data
- Matching feature requires exact match
- Duplicate records generate warnings but can be marked in file
- Error logs are hard to interpret at times
- Incompleteness of menus means some options are only available via syntax
- While the majority of output is saved as pivot tables allowing great flexibility in modifying tables
- Output tables and graphs generally not done as well as Excel and are harder to manipulate
LISREL
- Ideal for discrete data types
- Test data, Likert scale item data
- Data can be imported in various types
- ASCII, Access, Excel, SAS, SPSS, etc.
- Variable names have length restrictions
- Data files then stored as system files for later use
- Basic statistics (e.g. means and correlations) are generated in an underlying program called PRELIS
- LISREL itself is used to confirm the structural validity of a measurement model for any assessment
- Requires syntax and input matrices
EQS
- Ideal for continuous data types (test subscales)
- Data can be imported in various types – ASCII, Access, Excel, SAS, SPSS, etc. but variable names have length restrictions
- Data files then stored as system files for later use
- EQS itself is used to confirm the structural validity of a measurement model for any assessment
- Some model syntax can be built through the menus
HLM
- Hierarchical Linear Modeling (HLM) is becoming a more popular type of analysis, namely in cohort trend modeling
- Also allows you to look at variance component estimates and regression models given a nested sample of respondents
- Students within countries within global regions on personality variables
- More tedious to set up analysis with fewer available file types
- Also requires more upfront work as multiple data files are needed
What Program Should I Use?
- Microsoft Excel is the most basic and accessible spread sheet program available today
- It is most ideal for general data exploration, histograms, scatter plots, etc.
- Appearance of tables can be customized to meet APA standards
- Allows for easy transition to other programs to complete analyses and write reports
- However, its heritage is not as a statistical analysis program
- Certain statistical programs are designed for specific analytic tasks
- Balance the results and what will being presented
- Choose wisely in the interests of efficiency and accuracy of results
- Some output is good for looking at the data through basic exploration and to generate basic tables, but not to present the data
Summary
- Be very clear about the analysis objectives
- Be very familiar with all aspects of what defines your data
- Develop and stay true to your data analysis plans and research questions
- Be cognizant of which statistical software programs can best answer your research questions and present your results
- Be thorough in your analyses, express openness to additional investigations, yet be mindful of limitations given the data and the programs you are using