Analyzing Results of Large-scale Assessments to Ensure Universal Design

NCEO Technical Report 41

Published by the National Center on Educational Outcomes

Prepared by:

Christopher J. Johnstone • Sandra J. Thompson • Ross E. Moen • Sara Bolt • Kentaro Kato

July 2005


Any or all portions of this document may be reproduced and distributed without prior permission, provided the source is cited as:

Johnstone, C.J., Thompson, S.J., Moen, R.E., Bolt, S., & Kato, K. (2005). Analyzing results of large-scale assessments to ensure universal design (Technical Report 41). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes. Retrieved [today's date], from the World Wide Web: http://education.umn.edu/NCEO/OnlinePubs/Technical41.htm


Executive Summary

Universal design of assessment has been an important step forward in making tests more accessible to students with disabilities. An issue affecting the universal design approach is the need to review individual items, potentially hundreds of them. Ideally, there would be a statistical procedure available that would first identify items that are potential sources of problems for students with disabilities. This paper illustrates one method of determining whether items are functioning differentially for students with disabilities in comparison to their non-disabled counterparts. Using various statistical analysis techniques, a large statewide mathematics data set was investigated for items that may have design issues. Multiple methods were used as a means for compensating for the lack of statistical power that is often present when analyzing data for populations with small group sizes. Results indicated that items can be flagged for further review based on differential functioning across disability groups or types of analysis.


Introduction

Research on the design of assessments has demonstrated that when test items are designed to be accessible from the beginning, positive results can occur (Grise, Beattie, & Algozzine, 1982; Johnstone, 2003). Assessments created with access for the greatest number of students (universally designed assessments) are recommended as an important way to improve assessment participation and performance for all students.         

Universally designed assessments are one outcome of the universal design approach to all of education called for in the 2004 reauthorization of the Individuals with Disabilities Education Act (IDEA) (Public Law No: 108-446). IDEA 2004 cites the Assistive Technology Act of 2004 by defining universal design as “a concept or philosophy for designing and delivering products and services that are usable by people with the widest possible range of functional capabilities, which include products and services that are directly accessible (without requiring assistive technologies) and products and services that are interoperable with assistive technologies.” The application of universal design principles to educational assessments is based on arguments of accessibility and equity. Thompson, Johnstone, and Thurlow (2002) specifically suggested that assessments are universally designed if they are accessible to a wide variety of students, have items that are clearly related to intended assessment constructs, are minimally biased, can be presented with accommodations for students with disabilities, have clear instructions and procedures, are comprehensible to a wide audience, and are legible.

Careful scrutiny of assessment items during design stages through field testing may improve the accessibility of tests overall. This report focuses on techniques that can be used for analyzing the results of field test items with representative samples of student subgroups. The objective of this report is to provide states and test designers methods of analysis for discovering which test items are valid for all students and which items may have issues related to universal design. These methods are demonstrated through an analysis of one state’s assessment results.

Three strategies are currently recommended by the Universal Design Project at the National Center on Educational Outcomes (NCEO) for determining whether assessments are universally designed. The first strategy is called the “think aloud method” (Johnstone, Miller, & Thompson, in press), also known as cognitive labs or protocol analysis. This qualitative method asks individual students from target groups and comparison groups to complete items while verbalizing all of their thoughts. This method has been found to be an effective way of uncovering design features that cause confusion and misunderstanding for students. The think aloud method, however, is limited in scope because it requires items to be pre-selected and can only be effectively conducted on a relatively small sample of students (Johnstone et al., in press).

The second strategy is the use of a set of considerations for item development and review. Currently, many states have some form of expert review, often called “content,” “sensitivity,” or “bias” review panels. These panels review hundreds to thousands of potential test items in order to find and change or eliminate bias toward particular populations. These reviews, however, are often unstructured and costly because items can be quickly rejected. In an effort to increase the effectiveness of these processes, NCEO solicited input and validation from experts in a variety of fields, resulting in Considerations for the Development and Review of Universally Designed Assessments found in Thompson, Johnstone, Anderson, and Miller (2005). These considerations function as thinking points for item reviewers.

The third strategy for ensuring universal design of assessments is to conduct large-scale statistical analyses on test item results. This paper describes four statistical techniques that are currently used in field practice by researchers and provides an example of the results of these techniques in a state. The statistical techniques described were used to flag items that may have been biased against certain groups.

Typically, test companies examine items using one of the methods described in this paper—differential item functioning (DIF). Although useful for large groups, DIF becomes less accurate as the size of the group gets smaller. Our purpose in this research was to use a variety of methods of analysis, partially because of small group size, in order to detect problematic items.

None of the methods of analysis described in this report yielded indisputable information largely because each technique loses validity for low incidence populations (such as students with specific types of disabilities). When used in combination with other available strategies (think alouds, considerations), however, patterns may emerge that help to detect problematic items. Use of the statistical techniques described in this report cannot resolve item-related issues entirely, but can provide a realistic starting point to aid states and test companies in searching out items that cause problems unrelated to the content tested.


Method

Overview

This paper describes a variety of techniques for using assessment data to determine whether items are universally designed. Methods of analysis are described, followed by a secondary analysis of all possible “flagged” items. The secondary analysis acts as a way to create a manageable list of items with potential issues.

Analysis of statewide field test data is most efficient when specific groups of interest are identified a priori. Information can be elicited from databases on broad-based groups such as free and reduced lunch status, gender, or ethnicity. Smaller sub-groups may be examined within larger groups. For example, students with disabilities can be examined by disability category (learning disabilities, visual impairment, hearing impairment, emotional disturbance, etc.), accommodation group (read aloud, extended time, large print, etc.), or a combination of the disability category and accommodation group (e.g., students with learning disabilities who receive read aloud accommodations). The advantage of targeting subgroups is that very specific information can be derived. The disadvantage of such actions is that sub-group size is often very small.

 

Sample

This study made use of extant data from a large, Midwestern state’s assessment database. The sample consisted of students with disabilities who participated in the large-scale mathematics assessment at grades 4 and 8. The data selected are from the year 2000.

Subgroups selected for analysis represented a variety of disability categories, including learning disability, speech disorder, mental retardation, emotional/behavioral disorder, other health impaired, hearing impaired, language impaired, partial sight, blind, physical impairment, autism, traumatic brain injury, deaf/blind, and multiple disabilities. The number of students in each group ranged from five (deaf/blind) to 5,464 (learning disability) in 4th grade and from four (deaf/blind) to 5,498 (learning disability) in 8th grade, demonstrating the disparity in population size for various disabilities.

Data were also examined based on accommodation provided. Students of all disability types predominately received five types of accommodations, including oral test accommodations (referred to as “read aloud”), extended time, Braille, large print, and signed assistance. Sample sizes for accommodations also demonstrated disparities. For example, only 13 grade 4 students used large print tests while 4,427 students received read aloud accommodations. In grade 8 the pattern was similar, with 12 students receiving large print accommodations and 2,876 receiving read aloud accommodations. Sample sizes for all groups and accommodation conditions are provided in Table 1.

 

Instrumentation

Data for this analysis are from a Midwestern state’s 2000 large-scale mathematics assessment at grades 4 and 8. The entire test was a norm-referenced multiple choice test. Students took the test under various conditions of accommodation, as well as under standard conditions. All data (accommodated and standard) were included in the data set and all were entered in terms of correct or incorrect student responses for each item. 

Table 1. Sample Size by Grade Level, Disability, and Accommodations Use

 Disability

4th Grade

8th Grade

Learning Disability

5464

5498

Speech

804

90

Mental Retardation

445

724

Emotional/Behavioral Disorder

573

731

Other Health Impaired

332

252

Hearing Impaired

67

76

Language Impaired

539

192

Partial-See

25

17

Blind

10

5

Physical Impairment

33

27

Autism

20

36

Traumatic Brain Injury

17

16

Deaf/Blind

5

4

Multiple Disabilities

55

64

Accommodation

 

 

Students Receiving Read-aloud

4427

2876

Students Receiving Extended Time

2668

1329

Students Receiving Braille

25

20

Students Receiving Large Print

13

12

Students Receiving Signed Assistance

49

27

 

Approaches to Item Analysis

There are multiple methods for examining data to detect design issues. This section explores four methods, moving from simple methods based on classical test theory to those with increasing complexity based on more contemporary item response theories. Rationale, explanations, and data from the state’s mathematics assessment in 4th and 8th grades are reported for the following: Item Ranking, Item Total Correlation, Differential Item Functioning (DIF) using Contingency Tables, and DIF using Item Response Theory (IRT) approaches. All analyses are conducted within a grade level (i.e., no cross-grade analyses were conducted).

 

Analysis Approach 1: Item Ranking

Item ranking is a procedure that requires a comparison of item ranks from different groups to determine whether certain items are more challenging (and potentially biased) for particular students. Item ranking assumes that every item has a particular degree of difficulty and is usually expressed with a P (probability) statistic. For example, if 60% of students in a population answered an item correctly, its P-value (Total) would be (.60). P-values can also be calculated for groups (De Ayala & Kelley, 1997). Items can then be ranked from most to least difficult for the total population and for particular groups.

Current standards-based assessment requirements differ from results typically expected of item ranks. Accountability requirements in all states expect all students to achieve at comparable rates, therefore P-values for particular items should be similar across groups (i.e., items that are difficult for one group should be difficult for another group; items that are easy for one group should be easy for another group). When item ranks vary across groups, it may be an indication that certain items are particularly difficult for one group of students. Such difficulty may be a result of the item placement in the test (e.g., slower students may not attempt items at the end of the test; De Ayala & Kelley, 1997), a possible reflection that certain students have not had the opportunity to learn content (Abedi, Herman, Courtney, Leon, & Kao, 2004), or an indication of item bias (Popham & Lindheim, 1980).

There are three steps to determine differences in item ranks.

1.   Determine the P-value for each item for a target group (students with disabilities) and a reference group (students without identified disabilities).   

2.  Rank items from lowest to highest in terms of P-value (items ranked will range from the most difficult to the easiest for a group).

3.   Examine groups for discrepancies in item ranks.

The importance of discrepancies in rank order depends on both the number of items and student results. If a discrepancy exists between groups on a particular item, the item may warrant further investigation. Decision rules as to whether an item is problematic can vary. In this study it was determined that a five “rank” difference was sufficient for flagging an item for further analysis.

We chose five “ranks” as a rule to flag items by first considering the sampling distribution under the null hypothesis. The sampling distribution depends on several quantities such as the number of items (which determines the range of rank differences), item difficulties (because ranks are based on item difficulties), and sample sizes (smaller sample sizes lead to larger variability of rank differences). The null hypothesis is that the rank of the item is the same for both groups (or more strictly, item difficulties are the same for both groups).

The sampling distribution of rank difference does not follow a standard distribution such as normal or t, so it is difficult to obtain in an analytical form. Thus, it was estimated by statistical simulation. Sample sizes of focal groups vary across disability categories, so simulation was performed for each of the disability and accommodation categories. Data were generated in each simulation to estimate type I error rates. The same set of simulations was repeated for the 4th and 8th grades.

Medians of type I error rates for the 4th and 8th grades are .120 and .090 (see Figure 1), respectively. So, for about half of the comparisons, we tested at significance level 10% or less by using the critical value of rank difference 5. It should be noted that the type I error rate can be very high when the sample size of the focal group is very small. It can be as high as .70. Therefore, other methods of analysis are warranted to ensure defensible conclusions about items.

Figure 1. Type 1 Error for Item Rank Statistics

 

Analysis Approach 2: Item Total Correlation

A second method for flagging items for possible universal design issues is Item Total Correlation (ITC). ITC analysis examines how items correlate to other items on the same test. Therefore, ITC analysis is a within group investigation. ITC determines how well an item’s P-value (see analysis approach 1) correlates with other test items in the test for a particular group. Crocker and Algina (1986) suggested that ITC should be at least .20 for any particular item. In other words, a particular item’s P-value should positively correlate at a rate of .20 with the combined p-values of other items on the test. Such a rate would ensure a reasonable level of reliability. If an item does not correlate with the rest of the test, it may be problematic. A second set of tests will determine whether there are statistically significant differences between ITCs for target and comparison groups.

Stark, Chernyshenko, Chuah, Lee, and Wadlington (2001) noted that testing ITC and flagging items with low ITC is typically part of the test development process. Therefore, checking ITC for target groups is a valuable exercise because such data are sometimes overlooked during field testing of items. 

The first step in calculating ITC is to compute a Pearson point-biserial item-total correlation (rpb) for each item separately, for both the target and reference group. Items that have an initial ITC of less than .20 may be problematic (Crocker & Algina, 1986). A second step in determining potentially problematic items is to determine whether ITCs are statistically different between groups. Statistical tests must then be conducted, item-by-item, for target and reference groups.

Before beginning target group comparisons, it is important to note that group sizes may be very small, causing range restrictions in variance of ITC. Therefore, to more accurately compare groups, a correction suggested by Hunter and Schmidt (1990) can be used:

p = po / a.

In this formula, p is the corrected correlation, po is the computed correlation, and a is derived by calculating the following formula:

Here, u is the ratio of the standard deviation of the target group for an item to that of the reference group for that item (SDtarget/SDreference).

In this study, ITCs  and simple t-tests were used to determine whether there was a statistically significant difference between target and reference groups for a particular item. If a statistically significant difference is found on a particular item, the magnitude of that difference can be tested using Cohen’s (1988) index. The magnitude (q) is computed using the formula: 

According to Cohen (1988), differences of q = .10 are considered small, differences of q =.30 are considered medium, and differences of q = .50 are considered large.

To ensure that findings are statistically reasonable, confidence intervals are constructed. For each item with a significant and large difference, a 95% confidence interval was constructed around the ITC for each group using the following formula:


(* = multiply)

In this formula, zp represents the standardized correlation coefficient using Fisher’s z-transformation, which was determined using an r- to z-table. N, in this formula, represents the target group size. Following the construction of each confidence interval, z-values were transferred back to r-values using the r- to z-table.         

Cohen’s indices of q that are large (.50 or higher) and confidence intervals that are low (.20 or lower) may indicate a problematic item. These values were used in this study to flag an item for further analysis.

 

Analysis Approach 3:  Differential Item Functioning: Contingency Table Methods

Analysis of Differential Item Functioning (DIF) seeks to determine whether a particular item is substantially more difficult for one group than another after the overall differences in knowledge of the subject tested are taken into account. Analyses are predicated on the notion that items should be of similar difficulty level for students of equal achievement levels across target and reference groups. DIF occurs when one item is substantially more difficult for a particular group after students were matched for achievement levels. For example, if equally achieving students with disabilities and non-disabled students were compared by item, and the students with disabilities scored substantially lower on an item than their non-disabled peers, that item would have DIF and possibly universal design shortcomings. DIF does not mean that an item is more difficult for one group than for another, but implies that items function differentially according to student characteristics that are not related to achievement (e.g., disability, gender, and ethnic background; The College Board, 2004).

DIF statistics are calculated by computing the proportion of students who answer an item correctly, within a given overall test score range, in target and reference groups. Statistically significant findings may point to an item’s problematic nature. Two types of analyses can be used for DIF statistics: Mantel-Haenszel Chi Squares and SPD-X calculations.

Mantel-Haenszel Chi-Square statistic (MHc2) and log odds ratio (bMH) for target and comparison groups are calculated using any statistical analysis program (e.g., SIBTEST, SAS). Items for which the chi-square statistic is significant, and D (which represents -2.35 bMH) is significantly different from 0, may have universal design issues. One weakness of MHc2 analyses is that some target groups in this research have few, if any, students in particular achievement categories (sample sizes of less than 50 are unreliable for MHc2 statistics). To minimize error, samples were divided into 5 ability levels based on total score. Small sample sizes, however, were still present. Therefore a second test, used in combination with MHc2, was calculated to determine whether DIF was present for target groups.

This second test of DIF analysis was calculated using the following equation:

In this equation, p represents the proportion of students answering the item correctly at raw score level j for corresponding target (T) and reference groups. Once Dp is calculated, SPD-X is then calculated for each target group using the following equation:

In this formula, j is the raw score level, S is the highest raw score value, and nTj is the target group size at the specified raw score level (one of the features of DIF analysis is that subjects are sorted by raw score achievement as well as qualitative group). SPD-X calculations can be conducted using DIF software.

Large positive results for SPD-X indicate that an item is differentially difficult for a target group and may, therefore, have universal design issues. There is no significance test for SPD-X calculations, so decision rules are largely based on local considerations. Universal Design Project researchers examined the statewide database and determined that SPD-X calculations of .05 indicated that several items had potential universal design shortcomings.

Analysis Approach 4:  Differential Item Functioning: Item Response Theory Approaches

Differential Item Functioning can also be determined using Item Response Theory approaches. IRT analyses focus on individual items and presents statistics based on the “latent traits” of individual test takers (Baker, 2001). In theory, all test takers have latent traits in terms of ability. IRT applications require these traits to be quantified. Rather than calculating DIF using results from individual items with assumed equality of achievement, IRT applications allow researchers to investigate item difficulty based on the individual traits of test takers. In other words, items are weighed for difficulty, but balanced by the latent traits of individuals.

Typically, when items are investigated using IRT approaches, two parameters are measured: item difficulty and item discrimination. Items that are considered difficult have a low P-value for all test takers in a group. Discrimination parameters describe the item’s ability to discriminate students who are high and low achievers. In theory, high achieving students should have a high probability of answering an item correctly. Conversely, low achieving students should have a low P-value. Difficulty and differentiation can be plotted on a graph that represents student latent traits on a scale (x-axis) and P-value for the item (y-axis). Items that are appropriately difficult and discriminating are represented by an S-curve. Figure 2 demonstrates an S-curved item that has a low P-value for low achievers, medium P-value for medium achievers, and high P-value for high achievers.

Figure 2. IRT "S" Curve

Source: Baker, 2001

DIF can also be calculated using IRT approaches by using a freely available, downloadable program called IRTLRDIF (see http://www.unc.edu/~dthissen/dl.html). This program calculates three different item parameters: (a) discrimination,  (b) difficulty, and (c) guessing (Birnbaum, 1968). For each item (and each group) IRTLRDIF provides estimates for (a), (b), and (c), and a latent trait distribution (S-curve) with 100 equidistant points. The probability that a person with a given latent trait will answer the item correctly (Pq), or the value of the latent trait curve, is represented (and solved) with the formula below:

If calculating for 2 groups, 100 points on each S-curve will produce 200 “Pq’s” for each item. To evaluate the magnitude of differences in parameters (differential item functioning for groups, controlling for latent traits and guessing) the following formula was calculated:

In this formula, dGr(q) is the density of the proficiency distribution of the target group at the latent trait value (Holland & Wainer, 1993). Items with significant DIF and a large difference in item parameters (DIF ³ .05) may have universal design issues.

A second set of calculations can be computed using SIBTEST software. To determine DIF, SIBTEST can be run for reference and focal (target) group files. Any items that have a beta of greater or equal to .05 and a P-value smaller than .05 may be problematic.

SIBTEST is an efficient test, but care must be taken before performing calculations. Any cases that have missing information (items omitted) must be excluded from SIBTEST calculations. Because of the missing data constraints encountered during SIBTEST calculations, a column labeled “SIBTEST N” is also reported in the Results section. This column represents the N calculated after cases with missing data were eliminated.


Results

As expected, results varied from analysis to analysis, providing the need for further investigation. Therefore, results are reported in two categories—by analysis and by group. It is noteworthy that all items had universal design issues for some group (i.e., there were six analyses in four broad categories conducted on 19 groups). Among the 114 analysis cells (6 x 19) for each grade level, every item was found in at least one cell, as is demonstrated in the preliminary results found in Table 2 (grade 4) and Table 3 (grade 8). In these tables, items are reflected by letters, and the order of the letters does not reflect the actual order of the items. This protects the security of the original assessment. Finding that all items have been flagged for potential design problems is very helpful. This is why a cross categorical secondary analysis is needed.

Table 2. Items with Universal Design Issues: 4th Grade

Group

N

Item rank

Item-total correlation

M-H

SPD-X

IRT-LRDIF

SIBTEST

SIB(N)

Learning Disability

5464

None

None

A,C,D,G,K,L,
M,N,O,P,Q,S,
T,U,W,X,Y,Z,
BB,DD,EE,FF 

O,Y

A,B,C,E,G, H,I,J,
K,L,M,N,O,P,Q,R,
S,T,U.V,W,X,Y,Z,
AA,BB,CC,DD,EE

X

4752

Speech

804

None

None

M,V,X

None

B,M,V,X

X

735

Mental Retardation

445

A,H,K,L,M,
S,AA,DD 

A,B,C,D,
O,CC,FF

1,2,3,4,5,6,7,8,9,
10,11,12,13,14,
15,16,17,18,19,
20,21,22,
A,E,F,G,H,I,J,
K,L,M,N,O,P,Q,
R,S,T,U,V,W,X,Y,Z,
AA,BB,DD,EE

K,S,
W,Y,Z

A,B,C,D,E,F,
G,J,K,L,M,N,O,
R,S,T,V,W,Y,Z,
BB,CC,EE

A,W

301

Emotional/
Behavioral Disorder

573

A,O

CC

A,J,K,L,M,N,O,
S,W,Y,Z,
AA,FF

None

B,C,F,G,J,
K,L,M,O,U,
V,W,Y,AA,
CC,DD

O,AA

458

Other Health Impaired

332

A,F,H,O

FF

C,G,H,K,M,
N,O,P,S,U,W,
Y,FF

G,P,Y

A,C,G,N,P,R,
U,V,W,Y

G,P,Y

287

Hearing Impaired

67

A,H,K,T,DD

B,R

T

G,H,T,X, DD,EE

N

None

60

Language Impaired

539

A,O,AA

None

A,G,H,K,L,M,
N,O,S,T,U,
W,Y,Z,AA,
DD

A,Y

A,C,D,E,H,K,
S,T,W,Y,CC

A,T,AA

470

Partial-See

25

A,G,H,M,
N,Y,Z,EE

None

Z,EE

B,D,G,I,M,N,
W,X,Y,Z,EE

N,O,U,Z

D

22

Blind

10

J,K,L,T,V,
X,Y,Z,EE

D,E,G,I,M,
O,U,W

K

J,K,L,T,V,X,Y,
AA,EE

None

None

 

Physical Impairment

33

G,N,T,S, DD

None

C,N,W,Z

B,C,D,N,S,

W,Z, CC,EE

C,N,V,W,Z

C,CC

30

Autism

20

G,K,M,N,T,Z,AA

None

G,I,K,M,N,
W,Z,AA

D,I,K,M,N,Q,Z,
AA,CC

A,B,F,H,J,

N,V,W,Z,

AA

R,X,AA

18

Traumatic Brain Injury

17

A,G,M,Q,R,S

None

S

A,E,F,G,M,
P,Q,R,S,FF

C,P,Q,DD

S

15

Deaf/Blind

5

H,L,M,O,S,W,Y,

AA

G,V,AA

AA

B,H,L,M,O,
S,W,X,Y

G,AA

None

 

Multiple Disabilities

55

A,O,S,Y,DD

C

C,M,N,O,S,
Y,Z,EE

B,M,N,O,
S,Y,Z,EE

C,E,N,S,V,
W,Y,CC

None

50

Students Receiving Read-aloud