NCEO Policy Directions
Published by the National Center on Educational Outcomes
Out-of-Level Testing: Pros and Cons
Any or all portions of this document may be reproduced and distributed without prior permission, provided the source is cited as:
Thurlow, M., Elliott, J., & Ysseldyke, J. (1999). Out-of-level testing: Pros and cons (Policy Directions No. 9). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes. Retrieved [today's date], from the World Wide Web: http://education.umn.edu/NCEO/OnlinePubs/Policy9.htm
Whether called "out-of-level," "off-grade-level," "functioning-level," or "instructional-level" testing, the practice of assessing students using a lower-level version of a test is controversial. The controversy pits unintended instructional consequences against "accurately" measuring performance and avoiding student frustration. The controversy also reflects beliefs about the appropriateness of delivering instruction at a studentís perceived functional level rather than adapting on-grade-level instruction to the specific needs of the student. The out-of-level testing controversy is particularly pertinent to students with disabilities, who typically are functioning at lower performance levels than their peers, and who, as a result of changes in federal education laws, must participate in state and district assessments.
To explore the controversy of out-of-level testing, and assist educators and policymakers in making appropriate decisions about its use in large-scale assessments, we describe its meaning and its history, then discuss arguments for and against its use. We conclude with several important considerations and questions to ask before implementing an out-of-level testing policy or administering an out-of-level test to a student.
What is Out-of-Level Testing?
Out-of-level testing is a term used to mean that a student who is in one grade is assessed using a level of a test that was developed for students in another grade. Lower-level testing is almost universally what is meant when terms like "out-of-level," "off-grade level," and "instructional-level" are used.
The use of out-of-level testing in large-scale assessment
programs has increased during the past 10 years. Generally it is presented in
policy as an accommodation or modification for students with disabilities. (See
Table 1 for trends in the use of out-of-level testing.) State policies
often warn that scores from out-of-level testing are to be interpreted with
caution; usually, out-of-level tests can be used only for students with
Table 1. Trends in the Use of Out-of-Level Testing
a From Thurlow, Yesseldyke, & Silverstein
History of Out-of-Level Testing
Out-of-level testing first emerged in norm-referenced testing. Norm-referenced tests (NRTs) were developed with forms for different grade levels. Originally, it was intended that a child would be given the form that corresponded to that childís grade level.
Following procedures used in individualized testing, it was sometimes decided to use the same procedures for group testingóto select for a student the form that corresponded to that studentís functional skill level. The decision about which form to use was based on other information about the student, such as assessed reading level, teacher judgment of instructional level, and so on.
These approaches reflect some of the same ideas as those that are used in individualized intelligence and achievement testing. They are also the basis for computer-adapted testing in which performance on selected test items leads to a branch of items that start with those the student can answer correctly, regardless of the difficulty levels of the items. In fact, out-of-level testing has been called the "poor manís version of computer-adapted testing." Although out-of-level testing grew out of norm-referenced testing, it soon was being applied to criterion-referenced tests (CRTs).
Perceived Pros and Cons of Out-of-Level Testing
There are both pros and cons associated with out-of-level testing. They reflect different perspectives on large-scale testing and the connection between instruction and tests developed to assess the results of instruction.
Arguments For Out-of-Level Testing
Individuals who argue the pro side of out-of-level testing generally cite three types of benefits: (1) avoiding undue frustration for the student, (2) improving the accuracy of measurement, and (3) better matching the studentís current educational goals and instructional level. It is suggested that it is unfair for students who are not performing at grade-level to be subjected to grade-level tests.
Avoiding student frustration and emotional trauma are common arguments for out-of-level testing. Being tested at grade-level when not performing at this level is considered to be too emotionally traumatizing, and traumatization from the testing experience is thought to increase exponentially as the difference increases between the studentís grade and the grade at which the student is functioning. Those in favor of out-of-level testing also argue that out-of-level testing actually is the most humane approach for students not performing well in school. Students are not forced to dwell on their errors, but rather are provided with test items to which they can respond in a reasonable manner.
Improved accuracy of measurement is also given as a reason for out-of-level testing. Psychometric support for out-of-level testing cites the over-statement of actual performance that occurs when there are many chance-level scores for students assessed at their grade level (Doscher & Bruno, 1981; Wick, 1983). This means that the performance of students looks better than it actually is when grade-level assessments are used.
Better measurement occurs when the context of the test matches the studentís instructional level. It is generally recognized that the focus of out-of-level testing may not be the same as the grade-level goals. Still, out-of-level tests are said to accurately measure the studentís intermediate goals on the pathway to the grade-level standards.
Arguments Against Out-of-Level Testing
Individuals who argue against the use of out-of-level testing generally focus on the purpose of assessments and concerns about expectations and instruction for students. In addition, there are specific responses to some of the arguments made by those supporting the use of out-of-level testing (see Table 2).
Assessments must be consistent with the purpose for which they are being used. Although out-of-level testing may be appropriate for making instructional decisions (e.g., knowing what skills the student has now so that plans can be made about what to teach next), it is viewed as inappropriate for accountability assessments. State and district assessments almost always are used for accountability purposesóto describe what students know and can do in relation to a set of standards and to evaluate how schools and programs are progressing in providing students with desired knowledge and skills. Testing at a lower grade level does not reflect the studentís performance at the standard being assessed for the majority of students.
Out-of-level testing reflects low expectations for students and negatively affects their instruction. Too often, expectations for students who have not performed well in the past are below what they should be, creating a never-ending cycle of low expectations resulting in lower performance, which in turn results in even lower expectations. There are many instances of teachers being surprised by how well students performed when they were tested at grade level. There are related concerns about what happens in instruction when out-of-level approaches are used. It may be assumed that what the student is being tested on is all that the student needs to learn, with the resulting instruction focusing on lower-level standards than those toward which the student should be striving.
Table 2. Arguments For and Against the Use of Out-of-Level Testing
Assumptions of Out-of-Level Testing
There are five assumptions that test developers say should be met before out-of-level testing is considered an appropriate adaptation of testing. And, there are objections to the appropriateness of each of the assumptions (see Table 3).
Table 3. Assumptions for Out-of-Level Testing
Considerations in Using Out-of-Level Testing
Three considerations derived from research are important when thinking about the use of out-of-level testing either for a system or for individual students:
Performance on grade-level assessments
is likely to be spuriously higher than on out-of-level assessments.
Instructional issues need to be addressed
before students are placed in out-of-level tests.
When decision makers are considering the use of out-of-level testing for a particular student, their thoughts should immediately turn to the appropriateness of instruction, and its link to the assessment. Thinking only about the assessment allows one to ignore the critical elementóthe studentís instruction. Decision makers must be able to justify the decision and at the same time be able to defend the instruction that should provide the basis for on-level test performance.
Unintended consequences of out-of-level
testing include never reaching grade-level, or passing a high stakes
Questions to Ask Before Using Out-of-Level Tests
Because there are both pros and cons to the use of out-of-level testing, it is extremely important to make good decisions about whether out-of-level testing is appropriate for an individual student, given the purpose of the assessment. Likewise, it is important for testing programs to consider the potential consequences of providing out-of-level testing as an option that may be selected for individual students. There are several questions that should be asked to help decision makers formulate good decisions.
What is the purpose of the assessment?
Was the test designed to have different
levels that are appropriately connected?
Are the unintended consequences of
out-of-level testing appropriate?
While there are times when out-of-level testing may be appropriate, there are many times when it is not. Careful consideration of the assumptions underlying out-of-level testing, the purpose of the assessment and its characteristics, and the potential consequences of using out-of-level testing is advised for any program or individual decision-making team contemplating the use of out-of-level testing.
Simulation of Inner-City Standardized Testing Behavior: Implications for Instructional Evaluation. Doscher, M., & Bruno, J. E. (1981). American Educational Research Journal, 18 (4), 475-489.
Annual Survey of State Student Assessment Programs Fall 1997 (Vol II). Roeber, E., Bond, L., & Connealy, S. (1998). Washington, DC: Council of Chief State School Officers.
Testing Accommodations for Students with Disabilities: A Review of the Literature (Synthesis Report 4). Thurlow, M. L., Ysseldyke, J. E., & Silverstein, B. (1993). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.
A Compilation of Statesí Guidelines for Accommodations in Assessments for Students with Disabilities (Synthesis Report 18). Thurlow, M. L., Scott, D. L., & Ysseldyke, J. E. (1995a). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.
A Compilation of Statesí Guidelines for Including Students with Disabilities in Assessments (Synthesis Report 17). Thurlow, M. L., Scott, D. L., & Ysseldyke, J. E. (1995b). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.
Reducing Proportion of Chance Scores in Inner-City Standardized Testing Results: Impact on Average Scores. Wick, J. W. (1983). American Educational Research Journal, 20 (3), 461-463.
1 Research indicating that performance on assessments is linked to curricula includes that of Shriner and Salvia (1988), who demonstrated that performance on mathematics tests varied as a function of the curriculum that was the basis for the studentís instruction, and Bielinski and Davison (1998), who showed that differences in the construction of mathematics items can account for differences in the performance of males and females on the SAT.
[Shriner, J., & Salvia, J. (1988). Content validity of two tests with two math curricula over three years: Another instance of Chronic noncorrespondence. Exceptional Children, 55, 240-248]
[Bielinski, J., & Davison, M. L. (1998). Gender differences by item difficulty interactions in multiple-choice mathematics items. American Educational Research Journal, 35 (3), 455-476]
2 Item Response Theory is one approach to constructing tests. It is based on the characteristics of individual test items. To create common scale scores, different levels of the test are administered to the same students. For example, a state with tests in grades 3 and 5 might link the two tests by administering both to a sample of grade 4 students. Raw scores on the two tests are linked to form a common scale score. In this way, for example, it is determined that a raw score of 35 on the grade 3 level test is approximately equivalent to a raw score of 20 on the grade 5 level test. Both of these are translated to a scale score of, say, 350.
Appreciation is extended to Mark Davison (Professor, University of Minnesota), for the hours he spent informing us about the psychometric justification for out-of-level testing, and to Scott Trimble (Assessment Director, Kentucky Department of Education), who provided extended comments on drafts and brought the realism of state assessments to the issues addressed here.
This report was prepared by M. Thurlow, J. Elliott, and J. Ysseldyke, and with input from many individuals.