The Differential Impact of
Accommodations in Statewide
Assessment: Research Summary
Kansas
Published by the
National Center on Educational Outcomes
April 2003
Julia Shaftel
Evelyn Belton-Kocher
Douglas R. Glasnapp
John P. Poggio
Center for Educational Testing and Evaluation
University of Kansas
This report was supported in part by funds provided by the National Center on
Educational Outcomes and the Kansas State Department of Education. Opinions
expressed herein are those of the authors and do not necessarily reflect those
of the sponsoring agency.
Any or all portions of this document
may be reproduced and distributed without prior permission, provided the source
is cited as:
Shaftel, J., Belton-Kocher, E.,
Glasnapp, D. R., & Poggio, J. P. (2003). The differential impact of
accommodations in statewide assessment: Research summary.
Minneapolis, MN: University of Minnesota, National Center on Educational
Outcomes. Retrieved [today's date], from the World Wide Web: http://education.umn.edu/NCEO/TopicAreas/Accommodations/Kansas.htm
Summary of Findings
To meet the challenge of including special populations in
statewide assessments, Kansas
has undertaken a series of test development efforts aimed at students with
disabilities and those who are learning the English language. Data from
Kansas statewide assessments during the past several
years reflect the participation of these groups in statewide assessments using
both modified and unmodified forms of the same test items. This situation
provides a fortuitous opportunity to evaluate the performance of sizeable groups
of students under known testing formats and situations.
In particular, the question of when the adaptations of language
simplification and calculator use result in comparable scores is addressed for
general education students and for special populations. The desired goal is for
an accommodation to benefit only those students who need it, having no effect on
students without relevant special needs. A well-understood example is large
print for students with visual impairments. The most appropriate accommodations
allow students with special needs to display actual academic achievement rather
than permitting the assessment to become a measure of disability or English
proficiency. For this reason, validity studies should demonstrate that with a
particular test accommodation the scores of the general population of students
for whom the test is intended do not change while the special needs population
shows improvement. Improved scores demonstrate the removal of irrelevant
impediments to performance for the special population, allowing students to
demonstrate true achievement.
The three large studies condensed in this executive summary were
undertaken to discover the effects of item characteristics, including
specifically engineered item modifications, on student performance. The first
study is a unique analysis of both linguistic characteristics and mathematics
features in a large pool of mathematics test items at three grade levels. With
the item as the level of analysis, the impact of these features on item
difficulty for a general sample of students and for special populations was
investigated. The second study comprised a three-part investigation of the
comparability of student performance on mathematics items that were modified to
reduce language complexity while retaining the identical mathematics content.
Three separate analyses, addressing general education students and English
language learners at three grade levels and students with disabilities at 4th
grade, were conducted. Finally, the last study addressed calculator use as an
additional modification to language simplification with elementary students.
Together these studies make a significant contribution to the current literature
on test accommodations and modifications.
Study #1
This study evaluated two related issues in multiple choice
large-scale mathematics assessments used for accountability purposes. The first
issue is how linguistic features affect the performance of different student
populations on mathematics items. In order to isolate the impact of the
linguistic features one also has to consider the role of math difficulty.
Therefore this study included an analysis of the role of mathematical features
and complexity in the performance of different populations. The population for
this study consisted of a sample of students in general education, including
eligible English language learners and students with disabilities, who responded
to the state's general mathematics assessments. The impact of the specific
features of the test items was evaluated with respect to these groups and to
minority students.
The mathematics items used in this study were all items in the Kansas Mathematics Assessments
given at grades 4, 7, and 10, with four parallel forms of the assessment at each
grade, or about 200 items per grade level. Each item was rated on two domains,
mathematics and linguistic characteristics.
To serve as the dependent variable, mean scores (item
difficulties) were computed for each item for the following groups: total
sample, students with disabilities, ELL students, and ethnic minority students.
Using the items as the unit of analysis, linear regression analyses were
conducted at each grade level to examine the relationships between item
difficulty levels serving as the dependent variable and ratings of item
mathematics and linguistic characteristics serving as independent predictor
variables.
The results indicated that few mathematics characteristics had
unique effects and these depended on the grade level at which they were included
in assessment. For example, 10th graders found items with fractions
difficult. In terms of linguistic characteristics, slang words had a unique
effect on 4th grade achievement while items with comparatives were
more difficult at 7th grade. Only the linguistic feature of specific
math vocabulary words affected nearly all groups at each grade. Math complexity
was not related to item difficulty except at 4th grade. The unique
effects of individual math and linguistic characteristics were greater at grade
4 than in the higher grades.
There were some differences in subgroup responses to item
characteristics. At 4th grade minority students had more difficulty
with increased pronoun use and students with disabilities had more trouble with
items containing whole numbers. At 10th grade, English language
learners had more difficulty with comparatives (e.g.,
greater than,
less than), with greater preposition use, and with problems
containing exponents.
The combined effect of all features on performance was
statistically significant only at 4th and 7th grades, with
over 30% of variance in performance explained. At 10th grade,
however, these features did not predict a statistically significant amount of
variance in group performance.
In summary, English language learners, students with
disabilities and ethnic minorities did not show distinct profiles of impact by
item characteristics. This study did not identify item features that
differentially impact special populations consistently across grade levels.
Study #2
This investigation was designed to evaluate the effects of
simplified wording on student performance on mathematics items. Unnecessary
language load and complexity of presentation may disadvantage ELL students and
interfere with their ability to demonstrate their true knowledge and skill. In
order to mitigate the effects of difficult language in mathematics test items,
which are not designed to assess language comprehension
per se, one test form at each grade level of the Kansas mathematics
assessments was constructed using modifications designed to address the needs of
English language learners (ELL) during the spring and summer of 2000. Validity
studies of differences in performance on original and modified items were
conducted with general education students and ELL students in 4th, 7th
and 10th grades and with students with disabilities in 4th
grade.
I. General Education Students
Assessment forms with matching original and modified items in
counterbalanced order were administered to students at one grade level above the
grade intended for assessment. The approximate number of students responding to
each of the four test forms was 490 at grade 5, 300 at grade 8 and 100 at grade
11.
There was no statistically significant difference in performance
between any of the randomly assigned groups of students taking matched versions
of the original and plain English test items. The translation of test items into
plain English neither advantaged nor disadvantaged the performance of students
who were primarily English proficient. Although there was greater variability in
the reliability coefficients for grade 11 students, in the main these
coefficients also support the conclusion that items in the two versions function
the same within their respective item sets, thus confirming the lack of a
differential effect.
II. English Language Learners
The data from the first study of general education students
support the equivalency of the two versions based on the performance of students
who are English language proficient. However, these are not the students for
whom the plain English versions of the test were intended. Rather, students
identified as ELL were the intended user group. To address the same question of
equivalency for this group, data were configured using results from the 2000 and
2001 spring testing in mathematics for ELL students.
Data were evaluated on all students with ELL status who had
responded to a set of original items in spring 2000 and the matched but
simplified plain English items in spring 2001. The number of students for whom
data were available was sufficiently large in each analysis. The sample sizes
ranged from a low of 77 for ELL students who took the original grade 10 test
form in spring 2000 to a high of 540 for ELL students who received the grade 4
plain English test form in spring 2001. Using the common item anchor block
design, three separate but related analyses were conducted.
Analysis 1. In the first analysis, a traditional analysis
of covariance (ANCOVA) approach was used. In this design, any difference
detected in the two groups on the common anchor block of items (covariate) was
used to adjust for differences on the dependent variable (scores from the
comparison sets of matched items in their original or plain English form).
The ELL 2000 (original test form) and ELL
2001 (plain English test form) group means were highly comparable on the common
anchor block of items (greatest difference is .13 units), on the comparison
non-anchor set of items (greatest difference is 1.20 units) and on the adjusted
means (greatest difference is 1.22 units). For this difference (1.22 units), and
in contrast to expectations, ELL students taking the original items performed
better than the ELL students taking the plain English versions of the test at
grade 7. At the other two grade levels, ELL students taking the plain English
versions had slightly higher adjusted mean scores with the difference of .49
units between mean scores at grade 4 and .91 units between mean scores at grade
10. All three of these differences represent extremely small statistical
effects.
Analysis 2. As a second set of
analyses, the first set was replicated using the same covariate (common item
anchor block scores), but the dependent variable was changed. To explore the
generalizability of the results to the actual scores reported, rather than use
only the non-common test items in forming the dependent variable scores, all
items were used (common and non-common) to obtain a total score, and then this
total score was transformed to an equated percent correct score based on the
equating formula for that test form established during the spring 2000 baseline
testing year.
The results for the equated percent correct
scores mirror those reported in the first analyses: The adjusted means for
students taking the plain English version were slightly higher at grades 4 and
10 but lower at grade 7. The magnitudes of the differences are extremely small
at each grade level and demonstrate no practical differential effect.
Analysis 3. The third set of analyses was based on item
response theory (IRT) procedures and more directly addressed the construct and
item functioning equivalency of the two versions of the tests at any one grade
level. In these analyses, the common item anchor block design was used to put
the student ability and item parameters for both versions of the tests on the
same scale.
The grade 4 mathematics IRT mean ability and item discrimination
and difficulty estimates are almost identical for the two ELL groups. On the
grade 7 mathematics form, there are no differences in mean ability or item
discrimination estimates but the mean item difficulty estimates do differ
somewhat. In this case, the items in the plain English version were more
difficult for students then were the items when administered in their original
form. For the grade 10 mathematics forms, the mean item discrimination and
difficulty estimates are very close but there is a slight difference in the
ability estimates, with the plain English version of the test resulting in a
slightly higher mean ability estimates (.003 vs. .303).
In summary, the evidence from these three
analyses of ELL students confirms the equivalency of the two forms of
mathematics test items at every grade level for this population, as was found
for general education students. There is no evidence to suggest that performance
is impacted differentially for ELL students; hence, these results do not provide
support for simplified language as an aide to performance for non-native English
speakers.
III. Students with Disabilities
The third portion of the study of plain English items focused on
students with disabilities using a model similar to the item response theory
analysis for ELL students described above. Due to the introduction of tests
designed for special populations in the spring of 2001, students with
disabilities were exposed to original English and simplified English items with
identical mathematics content over two successive years. Students with
disabilities who took the original forms of mathematics test items during the
spring of 2000 were compared with students with disabilities who took a new and
modified test form during spring 2001.
The entire group of 4th grade
students with disabilities who took the modified test numbered 570 during spring
2001. The comparison groups of students with disabilities, all of whom were
exposed to the items in their original form, were identified from a random
sample of 8000 4th grade students who took all four original test
forms in 2000. Mathematics ability and item difficulty parameters were estimated
for all students and items using a one-parameter IRT estimation procedure. The
study design provided IRT mathematics ability score estimates that were used as
a covariate for these analyses.
Resulting from the analyses, a statistically
significant difference in favor of the modified test group was found on 9 of the
13 plain English items but on none of the unchanged items. These results
indicate that plain English as a modification frequently provided a benefit to
students with disabilities.
The finding that only the modified test
group, made up solely of lower performing students with disabilities, was helped
by the simplified English and modified format of the items on the modified test
is the first consistent difference in effect across the three parts of this
investigation. Meaningful differences were not found for general education
students or English language learners. This result is important within the
context of the original validity study, in which the plain English modifications
were not found to have any sizeable or consistent impact upon groups of general
education students taking general assessments.
Study #3
Study #3 was planned as a comparison
of the performance of students with disabilities and general education students
in 4th grade on identical mathematics test items with and without
calculators. Data for Part I of this study were obtained from a large-scale
administration of fourth grade mathematics items modified to be suitable for use
by students with disabilities. Part II describes the results of a supplemental
study of general education students on 16 items with assigned calculator
conditions.
I. Students with Disabilities
Mathematics ability and item
difficulty parameters were estimated for all students and items using the
procedure described in Part III of Study #2. Mathematics ability was used as a
covariate for both of the analyses of students with disabilities. The first
comparison of students with disabilities involved the performance on the common
block of items for students who took the modified test and students with
disabilities who took the general test form at the same time.
A second analysis was conducted to
compare the performance of students with disabilities on the modified test with
students with disabilities who were exposed to similar items on the previous
year's general mathematics assessment, as described in Part III of Study #2.
This analysis comprised the seven items from the overlapping administrations
that were omitted from Study #2 because the items involved computation and were
thus amenable to the additional accommodation of calculator use.
Of the seven items that were analyzed,
two items had statistically significant higher adjusted mean scores for students
who took them as part of the modified assessment and thus had calculators
available to them. These were the two most difficult of the seven modified
calculator-friendly items. This result contrasts with the results found in Study
#2 for items in which simplified English was the major modification. On those
items, nine of 13 advantaged the modified test group. Even though the
calculator-friendly items also had plain English modifications, only two of the
seven advantaged the modified test group. Calculator availability may provide a
modest benefit to lower-performing students with disabilities or the benefit may
be entirely explained by the simplified language of the items. Evidently, plain
English did not provide the same helpfulness on these calculator items as it did
on the non-calculator items discussed in Study #2.
II. General Education Students
In order
to assess a calculator accommodation without interference from other test format
and presentation changes, intact classes of students were solicited to
participate in a final study. A test booklet containing 16 mathematics items
drawn from the Kansas 4th grade
mathematics assessment was prepared to represent a variety of indicators and
problem types. Of the 16 items, eight were intended to benefit from calculator
use and eight did not require calculator use.
Thirteen
schools from 13 districts volunteered to allow one intact class of 6th
graders participate. The volunteer classes were then randomly assigned either to
the calculator or no-calculator conditions, with seven classes assigned
calculators and six classes assigned the no-calculator condition. Completed
answer sheets were returned from all 13 classes for a total sample of 244
students.
Calculator
status had no statistically significant effect on performance for the overall
test, though the mean test score for the calculator users was slightly lower
than for the non-users. There was no interaction of score with gender or
ethnicity. Ethnic status had a statistically significant main effect on test
performance with students reporting minority ethnicity obtaining lower scores.
The test
was separated into subtests of calculator items and non-calculator items, which
were moderately correlated (r
= .522). The mean score for the calculator items was 6.3 while the mean
non-calculator item score was 5.6. Subtest score mean differences were not
statistically significant by calculator group. The similar performance of the
two calculator-access groups on the non-calculator items suggests an overall
equivalence for the students in this study. The calculator item subtest scores
were compared by group using the non-calculator item subtest score as a
covariate, again without statistically significant effect.
Finally,
individual items were compared across the two groups in order to explore any
differences in item content that might be related to calculator usage. Only for
one item did the calculator group perform better, and this item was a
straightforward computation item involving adding a column of three-digit
numbers. The other three items that showed significantly significant differences
between groups comprised a subtraction problem that could be quickly estimated,
a division problem that should have been an easily recognized math fact, and a
problem that required the addition of minutes in intervals of 10 and conversion
into hours. For these three problems, having a calculator actually produced
lower performance.
In
summary, calculator use had no statistically significant effect on the overall
performance of typical 6th
grade students in this study, regardless of gender or ethnicity. When there was
an effect for general education students, that effect was either positive or
negative depending on item type. In certain instances having a calculator at
hand seemed to inhibit the use of a superior method for that problem type, such
as recognizing a math fact or estimating. Students with disabilities were
modestly helped by calculator availability at best. Calculator use may have
provided a benefit on a small number of purely computation items. These items
were presented in tandem with other accommodations so the effects of having a
calculator are impossible to completely disentangle from other effects.
In
determining whether calculator use is a permissible accommodation, the overall
lack of effect on test scores for general education students supports the
equivalence of calculator use with non-use for typical students. This finding
suggests that allowing students who need calculators to use them does not result
in altering the construct being measured and that test scores should be
comparable regardless of calculator availability.
Return to
State Accommodations Research page
|