The Differential Impact of Accommodations in Statewide Assessment: Research Summary


Published by the National Center on Educational Outcomes
April 2003

Julia Shaftel
Evelyn Belton-Kocher
Douglas R. Glasnapp
John P. Poggio

Center for Educational Testing and Evaluation
University of Kansas

This report was supported in part by funds provided by the National Center on Educational Outcomes and the Kansas State Department of Education. Opinions expressed herein are those of the authors and do not necessarily reflect those of the sponsoring agency.

Any or all portions of this document may be reproduced and distributed without prior permission, provided the source is cited as:

Shaftel, J., Belton-Kocher, E., Glasnapp, D. R., & Poggio, J. P. (2003). The differential impact of accommodations in statewide assessment: Research summary. Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes. Retrieved [today's date], from the World Wide Web:

Summary of Findings

To meet the challenge of including special populations in statewide assessments, Kansas has undertaken a series of test development efforts aimed at students with disabilities and those who are learning the English language. Data from Kansas statewide assessments during the past several years reflect the participation of these groups in statewide assessments using both modified and unmodified forms of the same test items. This situation provides a fortuitous opportunity to evaluate the performance of sizeable groups of students under known testing formats and situations.

In particular, the question of when the adaptations of language simplification and calculator use result in comparable scores is addressed for general education students and for special populations. The desired goal is for an accommodation to benefit only those students who need it, having no effect on students without relevant special needs. A well-understood example is large print for students with visual impairments. The most appropriate accommodations allow students with special needs to display actual academic achievement rather than permitting the assessment to become a measure of disability or English proficiency. For this reason, validity studies should demonstrate that with a particular test accommodation the scores of the general population of students for whom the test is intended do not change while the special needs population shows improvement. Improved scores demonstrate the removal of irrelevant impediments to performance for the special population, allowing students to demonstrate true achievement.

The three large studies condensed in this executive summary were undertaken to discover the effects of item characteristics, including specifically engineered item modifications, on student performance. The first study is a unique analysis of both linguistic characteristics and mathematics features in a large pool of mathematics test items at three grade levels. With the item as the level of analysis, the impact of these features on item difficulty for a general sample of students and for special populations was investigated. The second study comprised a three-part investigation of the comparability of student performance on mathematics items that were modified to reduce language complexity while retaining the identical mathematics content. Three separate analyses, addressing general education students and English language learners at three grade levels and students with disabilities at 4th grade, were conducted. Finally, the last study addressed calculator use as an additional modification to language simplification with elementary students. Together these studies make a significant contribution to the current literature on test accommodations and modifications.

Study #1

This study evaluated two related issues in multiple choice large-scale mathematics assessments used for accountability purposes. The first issue is how linguistic features affect the performance of different student populations on mathematics items. In order to isolate the impact of the linguistic features one also has to consider the role of math difficulty. Therefore this study included an analysis of the role of mathematical features and complexity in the performance of different populations. The population for this study consisted of a sample of students in general education, including eligible English language learners and students with disabilities, who responded to the state's general mathematics assessments. The impact of the specific features of the test items was evaluated with respect to these groups and to minority students.

The mathematics items used in this study were all items in the Kansas Mathematics Assessments given at grades 4, 7, and 10, with four parallel forms of the assessment at each grade, or about 200 items per grade level. Each item was rated on two domains, mathematics and linguistic characteristics.

To serve as the dependent variable, mean scores (item difficulties) were computed for each item for the following groups: total sample, students with disabilities, ELL students, and ethnic minority students. Using the items as the unit of analysis, linear regression analyses were conducted at each grade level to examine the relationships between item difficulty levels serving as the dependent variable and ratings of item mathematics and linguistic characteristics serving as independent predictor variables.

The results indicated that few mathematics characteristics had unique effects and these depended on the grade level at which they were included in assessment. For example, 10th graders found items with fractions difficult. In terms of linguistic characteristics, slang words had a unique effect on 4th grade achievement while items with comparatives were more difficult at 7th grade. Only the linguistic feature of specific math vocabulary words affected nearly all groups at each grade. Math complexity was not related to item difficulty except at 4th grade. The unique effects of individual math and linguistic characteristics were greater at grade 4 than in the higher grades.

There were some differences in subgroup responses to item characteristics. At 4th grade minority students had more difficulty with increased pronoun use and students with disabilities had more trouble with items containing whole numbers. At 10th grade, English language learners had more difficulty with comparatives (e.g., greater than, less than), with greater preposition use, and with problems containing exponents.

The combined effect of all features on performance was statistically significant only at 4th and 7th grades, with over 30% of variance in performance explained. At 10th grade, however, these features did not predict a statistically significant amount of variance in group performance.

In summary, English language learners, students with disabilities and ethnic minorities did not show distinct profiles of impact by item characteristics. This study did not identify item features that differentially impact special populations consistently across grade levels.

Study #2

This investigation was designed to evaluate the effects of simplified wording on student performance on mathematics items. Unnecessary language load and complexity of presentation may disadvantage ELL students and interfere with their ability to demonstrate their true knowledge and skill. In order to mitigate the effects of difficult language in mathematics test items, which are not designed to assess language comprehension per se, one test form at each grade level of the Kansas mathematics assessments was constructed using modifications designed to address the needs of English language learners (ELL) during the spring and summer of 2000. Validity studies of differences in performance on original and modified items were conducted with general education students and ELL students in 4th, 7th and 10th grades and with students with disabilities in 4th grade.

I. General Education Students

Assessment forms with matching original and modified items in counterbalanced order were administered to students at one grade level above the grade intended for assessment. The approximate number of students responding to each of the four test forms was 490 at grade 5, 300 at grade 8 and 100 at grade 11.

There was no statistically significant difference in performance between any of the randomly assigned groups of students taking matched versions of the original and plain English test items. The translation of test items into plain English neither advantaged nor disadvantaged the performance of students who were primarily English proficient. Although there was greater variability in the reliability coefficients for grade 11 students, in the main these coefficients also support the conclusion that items in the two versions function the same within their respective item sets, thus confirming the lack of a differential effect.

II. English Language Learners

The data from the first study of general education students support the equivalency of the two versions based on the performance of students who are English language proficient. However, these are not the students for whom the plain English versions of the test were intended. Rather, students identified as ELL were the intended user group. To address the same question of equivalency for this group, data were configured using results from the 2000 and 2001 spring testing in mathematics for ELL students.

Data were evaluated on all students with ELL status who had responded to a set of original items in spring 2000 and the matched but simplified plain English items in spring 2001. The number of students for whom data were available was sufficiently large in each analysis. The sample sizes ranged from a low of 77 for ELL students who took the original grade 10 test form in spring 2000 to a high of 540 for ELL students who received the grade 4 plain English test form in spring 2001. Using the common item anchor block design, three separate but related analyses were conducted.

Analysis 1. In the first analysis, a traditional analysis of covariance (ANCOVA) approach was used. In this design, any difference detected in the two groups on the common anchor block of items (covariate) was used to adjust for differences on the dependent variable (scores from the comparison sets of matched items in their original or plain English form).

The ELL 2000 (original test form) and ELL 2001 (plain English test form) group means were highly comparable on the common anchor block of items (greatest difference is .13 units), on the comparison non-anchor set of items (greatest difference is 1.20 units) and on the adjusted means (greatest difference is 1.22 units). For this difference (1.22 units), and in contrast to expectations, ELL students taking the original items performed better than the ELL students taking the plain English versions of the test at grade 7. At the other two grade levels, ELL students taking the plain English versions had slightly higher adjusted mean scores with the difference of .49 units between mean scores at grade 4 and .91 units between mean scores at grade 10. All three of these differences represent extremely small statistical effects.

Analysis 2. As a second set of analyses, the first set was replicated using the same covariate (common item anchor block scores), but the dependent variable was changed. To explore the generalizability of the results to the actual scores reported, rather than use only the non-common test items in forming the dependent variable scores, all items were used (common and non-common) to obtain a total score, and then this total score was transformed to an equated percent correct score based on the equating formula for that test form established during the spring 2000 baseline testing year.

The results for the equated percent correct scores mirror those reported in the first analyses: The adjusted means for students taking the plain English version were slightly higher at grades 4 and 10 but lower at grade 7. The magnitudes of the differences are extremely small at each grade level and demonstrate no practical differential effect.

Analysis 3. The third set of analyses was based on item response theory (IRT) procedures and more directly addressed the construct and item functioning equivalency of the two versions of the tests at any one grade level. In these analyses, the common item anchor block design was used to put the student ability and item parameters for both versions of the tests on the same scale.

The grade 4 mathematics IRT mean ability and item discrimination and difficulty estimates are almost identical for the two ELL groups. On the grade 7 mathematics form, there are no differences in mean ability or item discrimination estimates but the mean item difficulty estimates do differ somewhat. In this case, the items in the plain English version were more difficult for students then were the items when administered in their original form. For the grade 10 mathematics forms, the mean item discrimination and difficulty estimates are very close but there is a slight difference in the ability estimates, with the plain English version of the test resulting in a slightly higher mean ability estimates (.003 vs. .303).

In summary, the evidence from these three analyses of ELL students confirms the equivalency of the two forms of mathematics test items at every grade level for this population, as was found for general education students. There is no evidence to suggest that performance is impacted differentially for ELL students; hence, these results do not provide support for simplified language as an aide to performance for non-native English speakers.

III. Students with Disabilities

The third portion of the study of plain English items focused on students with disabilities using a model similar to the item response theory analysis for ELL students described above. Due to the introduction of tests designed for special populations in the spring of 2001, students with disabilities were exposed to original English and simplified English items with identical mathematics content over two successive years. Students with disabilities who took the original forms of mathematics test items during the spring of 2000 were compared with students with disabilities who took a new and modified test form during spring 2001.

The entire group of 4th grade students with disabilities who took the modified test numbered 570 during spring 2001. The comparison groups of students with disabilities, all of whom were exposed to the items in their original form, were identified from a random sample of 8000 4th grade students who took all four original test forms in 2000. Mathematics ability and item difficulty parameters were estimated for all students and items using a one-parameter IRT estimation procedure. The study design provided IRT mathematics ability score estimates that were used as a covariate for these analyses.

Resulting from the analyses, a statistically significant difference in favor of the modified test group was found on 9 of the 13 plain English items but on none of the unchanged items. These results indicate that plain English as a modification frequently provided a benefit to students with disabilities.

The finding that only the modified test group, made up solely of lower performing students with disabilities, was helped by the simplified English and modified format of the items on the modified test is the first consistent difference in effect across the three parts of this investigation. Meaningful differences were not found for general education students or English language learners. This result is important within the context of the original validity study, in which the plain English modifications were not found to have any sizeable or consistent impact upon groups of general education students taking general assessments.

Study #3

Study #3 was planned as a comparison of the performance of students with disabilities and general education students in 4th grade on identical mathematics test items with and without calculators. Data for Part I of this study were obtained from a large-scale administration of fourth grade mathematics items modified to be suitable for use by students with disabilities. Part II describes the results of a supplemental study of general education students on 16 items with assigned calculator conditions.

I. Students with Disabilities

Mathematics ability and item difficulty parameters were estimated for all students and items using the procedure described in Part III of Study #2. Mathematics ability was used as a covariate for both of the analyses of students with disabilities. The first comparison of students with disabilities involved the performance on the common block of items for students who took the modified test and students with disabilities who took the general test form at the same time.

A second analysis was conducted to compare the performance of students with disabilities on the modified test with students with disabilities who were exposed to similar items on the previous year's general mathematics assessment, as described in Part III of Study #2. This analysis comprised the seven items from the overlapping administrations that were omitted from Study #2 because the items involved computation and were thus amenable to the additional accommodation of calculator use.

Of the seven items that were analyzed, two items had statistically significant higher adjusted mean scores for students who took them as part of the modified assessment and thus had calculators available to them. These were the two most difficult of the seven modified calculator-friendly items. This result contrasts with the results found in Study #2 for items in which simplified English was the major modification. On those items, nine of 13 advantaged the modified test group. Even though the calculator-friendly items also had plain English modifications, only two of the seven advantaged the modified test group. Calculator availability may provide a modest benefit to lower-performing students with disabilities or the benefit may be entirely explained by the simplified language of the items. Evidently, plain English did not provide the same helpfulness on these calculator items as it did on the non-calculator items discussed in Study #2.

II. General Education Students

In order to assess a calculator accommodation without interference from other test format and presentation changes, intact classes of students were solicited to participate in a final study. A test booklet containing 16 mathematics items drawn from the Kansas 4th grade mathematics assessment was prepared to represent a variety of indicators and problem types. Of the 16 items, eight were intended to benefit from calculator use and eight did not require calculator use.

Thirteen schools from 13 districts volunteered to allow one intact class of 6th graders participate. The volunteer classes were then randomly assigned either to the calculator or no-calculator conditions, with seven classes assigned calculators and six classes assigned the no-calculator condition. Completed answer sheets were returned from all 13 classes for a total sample of 244 students.

Calculator status had no statistically significant effect on performance for the overall test, though the mean test score for the calculator users was slightly lower than for the non-users. There was no interaction of score with gender or ethnicity. Ethnic status had a statistically significant main effect on test performance with students reporting minority ethnicity obtaining lower scores.

The test was separated into subtests of calculator items and non-calculator items, which were moderately correlated (r = .522). The mean score for the calculator items was 6.3 while the mean non-calculator item score was 5.6. Subtest score mean differences were not statistically significant by calculator group. The similar performance of the two calculator-access groups on the non-calculator items suggests an overall equivalence for the students in this study. The calculator item subtest scores were compared by group using the non-calculator item subtest score as a covariate, again without statistically significant effect.

Finally, individual items were compared across the two groups in order to explore any differences in item content that might be related to calculator usage. Only for one item did the calculator group perform better, and this item was a straightforward computation item involving adding a column of three-digit numbers. The other three items that showed significantly significant differences between groups comprised a subtraction problem that could be quickly estimated, a division problem that should have been an easily recognized math fact, and a problem that required the addition of minutes in intervals of 10 and conversion into hours. For these three problems, having a calculator actually produced lower performance.

In summary, calculator use had no statistically significant effect on the overall performance of typical 6th grade students in this study, regardless of gender or ethnicity. When there was an effect for general education students, that effect was either positive or negative depending on item type. In certain instances having a calculator at hand seemed to inhibit the use of a superior method for that problem type, such as recognizing a math fact or estimating. Students with disabilities were modestly helped by calculator availability at best. Calculator use may have provided a benefit on a small number of purely computation items. These items were presented in tandem with other accommodations so the effects of having a calculator are impossible to completely disentangle from other effects.

In determining whether calculator use is a permissible accommodation, the overall lack of effect on test scores for general education students supports the equivalence of calculator use with non-use for typical students. This finding suggests that allowing students who need calculators to use them does not result in altering the construct being measured and that test scores should be comparable regardless of calculator availability.

Return to State Accommodations Research page

2006 by the Regents of the University of Minnesota.
The University of Minnesota is an equal opportunity educator and employer.

Online Privacy Statement
This page was last updated on May 29, 2012