Olson, B., Mead, R., & Payne, D. (2002).
A report of a standard setting method for alternate assessments for students
with significant disabilities(Synthesis Report 47). Minneapolis, MN:
University of Minnesota, National Center on Educational Outcomes. Retrieved
[today's date], from the World Wide Web: http://education.umn.edu/NCEO/OnlinePubs/Synthesis47.html
Executive Summary
As required by Federal law (IDEA and Title
I), state assessment systems must be designed to include all students in three
ways, including participation in the general assessment with and without
accommodations, or participation in an alternate assessment. As part of the
design process, states must ensure that (1) assessments are aligned to content
standards, and (2) performance standards have been set to determine the
proficiency levels assigned to specific scores. This report is focused on one
specific approach toward standard setting, a body of work approach, used to
determine performance level cut scores for an alternate assessment developed for
students with the most significant disabilities. This approach was applied in a
state that chose to develop and implement a standards-based alternate portfolio
assessment. The rationale and design of the alternate portfolio assessment are
presented first, followed by a detailed description of the standard setting
process.
The authors identify time and resource
constraints, and areas of potential contamination and bias in this approach.
They also discuss the importance of range-finding and pin-pointing phases for
the body of work approach. These constraints must be considered and addressed in
any standard setting plan. However, this report of one model of a standard
setting process for an alternate assessment for students with significant
disabilities demonstrates that with careful planning, standards can be set for
alternate portfolio assessments just as they can for any other assessment.
Overview
States are developing comprehensive
assessment and accountability systems encompassing high academic standards,
professional development, student assessment, and accountability for all
students. These systems are for the improvement of student learning and
classroom instruction, public accountability, program evaluation, and to provide
decision-making assistance to policymakers. As required by Federal law (IDEA and
Title I), state assessment systems must be designed to include all students in
one of the following three ways:
• participation in the general
assessments without accommodations
• participation in the general
assessments with accommodations
• participation in the alternate
assessment.
As part of the design process, states must
ensure that (1) assessments are aligned to content standards, and (2)
performance standards have been set to determine the proficiency levels assigned
to specific scores. This report is focused on one specific approach toward
standard setting, a body of work approach, used to determine performance level
cut scores for an alternate assessment developed for students with the most
significant disabilities. This approach was applied in a state that chose to
develop and implement a standards-based alternate portfolio assessment. The
rationale and design of the alternate portfolio assessment are presented first,
followed by a detailed description of the standard setting process.
Including
Students with Significant Disabilities
Student portfolios are a purposeful and
systematic collection of student work that is evaluated and measured against
predetermined scoring criteria. For the population of students with significant
disabilities, use of portfolios or a body of evidence of progress toward state
content standards requires a thoughtful application of existing assessment
development procedures that are analogous to what occurs for general assessments
–from the beginning to the end of the assessment development process (Quenemoen,
Rigney, & Thurlow, 2002).
The alternate portfolio approach was designed
so that data could be collected on the educational progress and accomplishments
of students with the most complex disabilities. As with other assessments, the
results were designed to be used during the school improvement planning process
to help schools focus on access to the general curriculum as reflected in state
content standards, and the need to increase proficient student performance
around those standards (Kleinert & Kearns, 2001; Thompson, Quenemoen, Thurlow, &
Ysseldyke, 2001).
In this approach, students with significant
disabilities worked toward the same content standards as defined for all
students, using alternate student learning expectations to measure their
progress. Because the level of performance for these students differs from the
general education population, performance level definitions were created for
these students. The performance level definitions for students with significant
disabilities vary across the states, but generally describe best professional
understanding of outcomes for this population. Draft performance level
descriptors were developed by stakeholder groups during initial implementation,
and were then refined in conjunction with scoring and standard setting
processes.
Portfolio
Configuration
One challenge with any large-scale portfolio
assessment is the need to provide standard criteria for developing and assessing
a diverse body of evidence of student performance. This standardization
typically begins with a pre-defined structure for the portfolios, including the
number and type of entries. The portfolios for this state’s alternate assessment
were composed of a specific number of entries divided across content strands in
two subjects (literacy and mathematics). The entries were not limited to paper
and pencil tasks; in fact, educators were encouraged to include such reporting
options as audio and videotapes, photographs, checklists, interviews, surveys,
rating scales, and existing student records.
Portfolio
Scoring: Developing Scoring Guides and Training and Qualifying Materials
Providing a standardized means of scoring
such a diverse body of evidence presents a challenge. Although the portfolio
entries were different for each student, all portfolio entries were assessed
using the same criteria. These criteria were specified in a scoring rubric
developed by a group of stakeholders. The stakeholders provided knowledge of
what is considered best practice in teaching and learning for these students,
and identified scoring criteria to encourage these best practices.
The rubric developed for this assessment
followed a focused, holistic, domain scoring model. Under this model, each
portfolio entry was scored individually in three different domains –
performance, appropriateness, and level of assistance. Additionally, a fourth
domain (settings) was scored once per subject area (literacy and mathematics)
using a more analytic scoring approach. These scores were later combined to
provide single raw scores for each subject area.
Scoring the
Portfolios
After the rubric was created and portfolios
were developed, a diverse collection of samples of portfolios representing the
broad range of student performance at each grade level were brought before
rangefinding committees composed of local educators with both general and
special education backgrounds. The committee members scored the samples,
finalized the rubrics, and helped develop scoring rules. These "pre-scored"
samples were used to create training and qualifying materials for scoring.
Professionally accepted standards of scoring
were used. The portfolios were scored by a group of regular and special
education teachers. The teachers were trained by performance assessment
professionals to ensure that the evaluation of student work produced dependable
scores. As is typical with large-scale performance assessment scoring, the
training began by acquainting scorers with the scoring criteria specified in the
scoring rubric and demonstrated by sample entries that were pre-scored by the
rangefinding committees. Training and qualifying sets of student responses were
used to ensure that the readers were consistently scoring with accuracy before
scoring actual student entries. Readers unable to achieve a pre-set level of
agreement on the qualifying sets were not permitted to score any "live" student
portfolios.
Each portfolio was independently scored by
two readers. Non-adjacent scores were resolved by independent third readings.
Inter-reader reliability (agreement rates) and score point distribution were
monitored daily and cumulatively for each reader and for the scoring group as a
whole.
Standard
Setting Procedures
Once scoring was completed and the scores
were tallied according to the conventions defined by stakeholders, policymakers,
and assessment experts, standard setting was performed. Just as with general
assessments, approaches to standard setting for alternate assessments are
evolving (Roeber, 2002). The description that follows is based on a specific
approach used by one state that made a decision to require review of actual
student work on the assessment in the standard setting process for literacy and
mathematics. Several issues arose in the development of analogous standard
setting processes for the alternate assessment.
As the processes were developed, components
of several standard setting processes were considered and adapted. Table 1 is
adapted from Roeber (2002), and shows the three approaches that applied in
various ways, both by design and by practice as the process evolved. The primary
methodology however was the "body of work" approach.
Table 1. Standard-setting
Techniques that Might be Applied to Alternate Assessments (selected methods from
Roeber, 2002)
Technique |
Description |
Contrasting Groups |
Teachers separate students into groups based on their observations of the
students in the classroom; the scores of the students are then calculated to
determine where scores will be categorized in the future. |
Bookmarking or Item Mapping |
Standard-setters mark the spot in a specially constructed test booklet
(arranged in order of item difficulty) where a desired percentage of
minimally proficient (or advanced) students would pass the item; or,
standard-setters mark where the difference in performance of the proficient
and advanced student on an exercise is a desired minimum percentage of
students. |
Body of Work |
Reviewers examine all of the data for a student and use this information to
place the student in one of the overall performance levels. Standard setters
are given a set of papers that demonstrate the complete range of possible
scores from low to high. |
Selection of Committees
Members of each committee, literacy and
mathematics as well as special educators, were chosen by the state department of
education. So that groups were not idiosyncratic, members were chosen who were
diverse with regards to race, gender, geographic area, and level and depth of
experience. Eleven members served on the committees, five on literacy and six on
mathematics, a blend of special educators and content area teachers.
Selection of
Student Work
A team of Data Recognition Corporation (DRC)
handscoring personnel selected the student work that was presented to the
standard setting committees. This team had previously selected the student work
brought before the rangefinding committees. This team also created the training
and qualifying materials used to train the alternate portfolio scorers. When
selecting student work for standard setting, the selection team had two goals in
mind:
1. To select samples of student work that
exemplified the full range of student performance captured by the
portfolios.
2. To select samples of student work that
were representative, yet succinct enough to accommodate the committees’
timeframes.
Rather than presenting entire portfolios to
the committee, a subset of portfolio entries was chosen to represent each
student’s overall level of performance. Portfolios were represented by a sample
of one entry per content strand. This sampling of student work allowed for a
greater number of portfolios to be represented.
In addition to this representative sampling,
each committee was provided with an example of a portfolio with no pieces
removed. These "complete" portfolios (referred to as "exemplars") served three
primary functions:
• to illustrate the construct of completed
portfolios,
• to introduce the variety of types of
entries contained in portfolios, and
• to exemplify student work during
discussions and training.
These exemplar portfolios had individual
entries that were widely varied in terms of assignment and format of response
(e.g., videotape, audio cassette, written documentation) and had an overall raw
score point that fell toward the middle of the range.
DRC estimated that the committees’ timeframes
allowed for a review of fifteen portfolios per grade/subject area. The selection
process for these portfolios and their representative entries follows:
1. Within each grade/subject, an overall
score point range was determined based on the raw score points given to the
portfolios during handscoring.
2. Each score point range was subdivided
into 15 groups. Each group was to be represented by one student’s work.
3. Potential representatives of student
portfolios from each group were selected for review on the basis that a
subset of their entries could match to their entire portfolio score as
closely as possible as defined by minimizing the z-score deviate difference.
This was done to select those students whose performance on the subset of
work most closely resembled their performance on the entire portfolio.
4. The selection team used this
rank-ordered index to locate portfolios containing a high level of
consistency of student performance within the entire portfolio and within
each content strand.
5. The entry in each content strand that
best represented the student’s overall performance within the respective
strand was selected.
One problem arose as a geographic bias crept
into the system. There were a few schools with large numbers of portfolios that
were ranked highly on the standard-deviation index described in step three of
the selection process. As a result, these schools became over represented in the
selection process. This problem was addressed by setting limits on the number of
portfolios from any given school that could be included in any given
subject/grade area. Once a set of samples was gathered, it was reviewed for
geographic bias, and samples from over represented schools were systematically
removed, followed by a re-sampling.
A second problem evolved due to the high
number of non-scoreable entries included in portfolios, especially in those
found at the lower end of the overall score point range. In order to best
represent the overall effect of multiple non-scoreable entries in a portfolio,
non-scoreable entries were selected to represent some content strands. This
forced a holistic view of a portfolio’s overall strengths and weaknesses.
Due to the holistic judgment required for
this task, a quality control measure was put into place. To ensure that the
selected pieces were accurately reflecting the range of student performance, all
of the samples were independently reviewed and approved by two people. To start
this review, all the prepared samples were put in rank order by the portfolio’s
overall raw score. The ordered samples were then progressively reviewed to
ensure that the samples did indeed exemplify an increasing level of student
performance from the bottom to the top of the overall range.
This quality control measure was first
performed by a DRC handscoring person who was not involved in gathering the
respective set of samples. Afterward, the same review was independently
performed by a content-specific specialist from the National Center on
Educational Outcomes (NCEO) who had considerable experience working with the
population of students being assessed and with various state assessment systems.
Samples that appeared problematic to either reviewer were replaced and the same
quality control measures were reapplied to ensure that the replacement was
suitable.
As a final check, DRC psychometric staff
reviewed the overall distribution of student scores and suggested areas for
improvement. In some cases, additional student work was "pulled and reviewed" to
more fully represent the range of possible scores.
Procedure
The alternate assessment standard setting was
achieved by committees that met for three days. The process that was followed
during these three days is described here. The standard setting process began
with a group meeting of both committees. State policymakers welcomed the
committee members and provided them with background information including a
brief history of the portfolio assessment system and an explanation of the role
and the importance of standard setting to the assessment process.
Then the committee was provided an overview
of the alternate assessment approach by staff from NCEO. Following this
discussion, DRC group leaders gave an overview of the standard setting process,
including agendas, timelines, and goals.
Once all the background information was
shared with the committee members, the committees were split into two separate
rooms. Each committee was then led through a description of the performance
levels, which, at that point, were considered drafts that were open to revision
by the committees. Before the review of student work, all committees reviewed
the rubrics that were used to score the portfolios.
Each group was charged with setting the
standards for grades 4, 6, and 8 in its respective content area. The literacy
committee was also responsible for setting the standards for grade 11. Each
committee started with grade 8, then moved to grade 6, and on to grade 4. The
literacy committee finished with grade 11.
Before setting standards at each grade level,
each committee was provided with an "exemplar" portfolio containing all of a
student’s work for the respective grade and content area. The selection of this
exemplar is described in the previous section on "Selection of Student Work."
This portfolio was used to lead the group through a discussion of the contents
and composition of the portfolios and to provide examples of the variety of
formats of student responses that were included in the portfolios.
Following this group discussion, each
committee member was given a photocopy of 15 samples of student work for review,
ordered according to score. The assignment for each member was to categorize
each sample according to the performance level descriptors. Committee members
were not told that the samples were ordered. They were instructed to
independently rank each sample.
Once all members of a committee had
categorized all 15 samples, the results of their decisions were compiled and
presented. The presentations focused on impact data and on areas that were
problematic in terms of group agreement. When looking at "problematic" areas,
the committees discussed any areas of ambiguity that they were sensing. Much of
the discussion focused on the performance level descriptors and how these
descriptors were manifested in the form of actual student work. Given that
descriptors were a first draft, changes were made to the descriptors as group
discussions warranted.
After this group discussion, the committee
members re-reviewed the 15 samples of student work, focusing on the samples that
had the greatest level of discrepancy of group consensus. Once this re-review
was completed, the resulting impact data were recalculated and presented. This
process continued until all committee members were satisfied that they had
placed each sample into its proper performance level and that the impact data
accurately reflected the committee’s perception of the state as a whole. It was
not necessary that the group reach a consensus, only that all members were
satisfied with their own rankings before moving to a new grade.
Once all grade levels were reviewed, the
committees were brought back together. The impact data from each grade/subject
area were presented. Aberrations in impact data across grades and subjects were
noted and discussed. In some cases, committee members decided that these
variations between grades and subjects were an accurate reflection of variations
of the student performance evidenced in the portfolios. In other cases, the
committee members felt that these variations accurately reflected variations in
classroom instruction or student development across grades and subjects.
Occasionally, committee members indicated
that the variations compelled a need for further review of particular samples of
student work. When committee members indicated further review was needed, they
returned to the smaller groups for further review of work samples and reconvened
to discuss the resulting changes in impact data. This process continued until
all committee members were satisfied that the impact data accurately reflected
their perception of the state as a whole.
Calculation
of Summary Statistics
For each grade and content area, the
panelists independently reviewed each of 15 portfolios selected to represent the
range of student responses. The work samples that were chosen represented
consistent performance on the part of the student so that it should be possible
to clearly classify the student’s work. After reviewing the work, the panelists
sorted the portfolios into five classifications (as state performance
descriptors required). To facilitate the analysis, the classifications were
coded numerically, as follows:
0 = Not Evident
1 = Emergent
2 = Supported Independent
3 = Functional Independent
4 = Independent
The cut scores that were implied by the
panelists’ ratings were computed by two different methods. Method one, which is
similar to Contrasting Groups, was based on the mean portfolio score for
each category (Livingston & Zeikey, 1982). For this purpose, every
panelist-portfolio combination was treated as a separate observation. Whenever a
panelist placed a portfolio in a category, the portfolio score for that
portfolio was added to the sum for that category. The category mean was
calculated as the category sum divided by the number of scores that were
included. The cut score between adjacent categories was the average of the two
category means.
Method two used the mean classification for
each portfolio. If, for example, all six panelists placed a portfolio in the
Emergent category, the mean classification would be 1; if three panelists
rated it Emergent and three rated it Functional Independent, the
mean classification would be 1.5. The cut score between two adjacent categories
was computed by interpolating between the portfolio scores that best defined the
border between the categories. Best was defined to minimize the
classification errors.
Overall, method two seemed to perform better
in this application because it was more robust to outliers. The decisions were
based on a relatively small number of portfolios and there was little time to
reconcile the panelists’ decisions or to refine the location of the cut scores.
Method one was heavily influenced by deviant ratings, either an individual
observation that was inconsistent with those near it, a portfolio that was
consistently rated different from its score, or a panelist that systematically
differed from the others. Method two was concentrated on the transitions from
one category to the next and so was not influenced by ratings in other areas.
Preliminary
Standards and Impacts
The summary analyses, along with impact data
for the entire set of portfolios was given to the panelists. The panelists
reviewed the standard definitions and in some cases made minor revisions to
better reflect what was expected of the students. They then discussed the
portfolios where there were significant inconsistencies in the ratings.
The panelists independently reviewed all
their own ratings and made any changes they thought were justified by a
refinement in their understanding of what was expected of the students. The
panelists were asked to rank each portfolio as they thought it should be and not
to reflect a group consensus. However, no attempt was made to limit discussions
among the panelists.
The process was repeated as many times as the
panelists thought it might be productive. This happened more often early in the
week, perhaps because of fatigue later in the process, but also because the
panelists became more familiar with the process and the expectations for the
students.
Rates of
Agreement
There were five or six raters comprising each
panel. Fifteen portfolios were used with each grade level and content area. This
provided a total of either 75 or 90 ratings. Table 2 gives the percentage of the
ratings that were given the same classification by the panelists using the total
portfolio score and the preliminary standards.
Table 2. Percentage of Ratings Given the
Same Classification by Panelists
IEP |
Grade |
Literacy |
Math |
4 |
71 |
97 |
6 |
80 |
87 |
8 |
76 |
79 |
11 |
82 |
|
The classification agreement rates ranged
from 71% to 97%. With a single study, it is not possible to identify the source
of the variation. It may be due to inherent variability of the alternate
portfolio assessment, to the idiosyncrasies of the portfolios, or the panelists
chosen for the study.
Literacy and
Mathematics Consistency Across Grades
Literacy and mathematics both showed
considerable variation across grades. For example for literacy, just over 5% of
grade 4, 6, and 8 portfolios were classified as Not Evident while nearly
four times as many (21.7%) were classified Not Evident for grade 11. Over
55% of grade six portfolios were placed in the Supported Independent but
for grade 11, about than one third of the portfolios were in this category. The
percentage in Independent ranged from 3% to nearly 10%.
Like literacy, the mathematics results varied
across grades. Eighteen percent of grade 8 portfolios were considered
Independent but only 2.3% of grade 6 portfolios. The committee was somewhat
concerned about the lack of continuity, but thought that the different resources
and objectives at the different grades partially explained the results. It
should also be noted that there were very few portfolios submitted that reached
the Independent level. Students capable of working at the Independent
level may have been included in the regular assessment.
Conversion to
Scale Score Metric
For the regular state assessment, raw score
points are converted to the Rasch logit scores. A linear transformation
is then used to convert the logits to the final reporting metric. This
transformation was chosen so that the regular assessment cut score for
Proficient was converted to a scale score of 200 and the cut score for
Advanced was converted to a scale score of 250. No attempt was made to
constrain the scale score for Basic; this typically took a value in the
neighborhood of 160 scale score points. The resulting scale has the familiar
logistic curve, which tends to stretch out the extreme scores and compress the
central scores without reordering them.
For the alternate portfolio assessment, the
same general approach is followed, but some adaptations are necessary. Because
of the limited number of cases and the non-standard nature of the assessment,
Rasch scaling is not possible. Instead, a simple logistic transformation
is used (Logistic Score = ln {r / (L – r)} where r is the raw score and L in the
maximum number of points.)
For the alternate assessment, there were no
prior restraints on the reporting metric. The choice that was made was to set
the scale score standards to multiples of 50. This was done, with separate
linear transformations of the logistic scores in each performance category. The
minimum scale score required for the Emergent,
Supported, Functional, and Independent performance
categories, respectively, are 100, 150, 200, and 250.
A major concern with this approach is that
the scale score metric, unlike those strictly derived with the Rasch measurement
model, will not have all the properties of an interval scale. The scale
score unit will not necessarily have the same meaning at all points on
the scale. Progress is then best measured through percentages in each
performance category rather than differences in scale scores. The scale scores
will always be ordered, but not necessarily equally spaced (a portfolio with
more raw score points will always receive a higher scale score than a portfolio
with fewer points).
Evaluation
Participant Feedback Forms
All committee members were asked to complete
a standard setting feedback form at the completion of the last day of the
standard setting. Participants were asked to respond to a series of questions
about the session components by indicating a high, medium, or low opinion about
each component. Space was provided after each question for individual comments.
6 of 11 committee members responded. When asked about the clarity of materials,
all respondents rated the materials high. When asked about the clarity and
quality of the instructions and presentations all respondents rated this area
high. When asked to rate the overall process, all respondents but one rated
high. One respondent indicated that there was too much down time waiting for
numbers. Respondents were then asked about their satisfaction with the standards
that were set in each of the four categories. All the responses concerning
literacy and mathematics standards were high.
Limitations of the Current Study
The expressed satisfaction of the panelists
with the outcomes, the overall positive responses to the evaluation survey, and
the statistical properties of the recommendations all indicate an excellent
first step toward setting standards. Nonetheless, some limitations on these
results should be considered when planning the next steps. Most of these
qualifications were unavoidable given the limited time and resources available
to establish these preliminary standards.
Time
There were two committees (literacy and
mathematics) with five or six members each meeting simultaneously. The groups
defined five classifications (Not Evident, Emergent, Supported
Independent, Functional Independent, and Independent). Both
groups dealt with grades 4, 6, and 8; Literacy did grade 11 as well. This was
accomplished in three very long days.
Contamination
After working through the standards setting
process three or four times in a period of three days, the judges did not seem
to follow the same mental process at the end of the week as they did at the
beginning. The portfolios were presented in order of total points that had been
awarded, although this ordering was not explained to the panelists and the
scores were not disclosed. By the end of the week the panelists were clearly
aware of the ordering and largely treated the process as a Bookmarking (Lewis,
Mitzel, & Green, 1996). They looked for the point at which the portfolios
changed from one classification to the next rather than classifying each
independently. This did operate to improve the consistency of the raters, at
least superficially. Unless the process is done as a Bookmarking and the order
explained to all panelists, the portfolios probably should be presented in
random order.
Distribution of Portfolios
For each grade, there were approximately 250
portfolios from which to choose. Although this is the entire population of
portfolios for the State in 2001, coverage over the range of possible scores was
thin in many areas.
Sampling of Student Work
The portfolios used in the standards setting
did not include the entire portfolio. Rather, the work included in the reduced
portfolios was intended to be representative of the entire portfolio.
This was done to reduce the amount of material the panelists needed to review.
No matter how carefully done, no sampling is perfect and it presented one
problem in particular to the judges.
One important aspect in the standards
definition was the number of settings in which the student could function. The
judges were uncomfortable with the reduced portfolios because they believed it
was difficult for the work to show multiple settings. They were advised that if
the full portfolios showed multiple settings, every effort had been made to
capture that in the reduced version. This did not fully relieve their anxiety.
Scoring Experience of Panelists
Many of the panelists in the standards
setting panels had also been involved in the initial scoring of the portfolios.
This meant they were very familiar with portfolios and the project in general.
It also made it difficult to separate the standards setting process from the
scoring process.
In scoring, the charge was to assign points
to specific attributes of each sample of the student’s work according to
well-prescribed rubrics. In setting standards, the task was to form a holistic
impression of the student who created the entire body of work, although the work
might not be entirely consistent. It was tempting for the judges to resort to
scoring the work when they were uncertain about what to do with it.
Lack of Pin-Pointing Portfolios
The standard application of the body of work
method involves two distinct phases: range-finding and pin-pointing
(Kingston et al., 2001). In the range-finding phase, the work samples that are
presented cover the entire range of possible work relatively sparsely. Based on
the results of this phase, the pin-pointing phase uses only work samples in the
vicinity of the proposed cut scores.
The current project included only the
range-finding phase with one or more rounds to refine the panelists
understanding of the process and of the standard definitions. The pin-pointing
phase was not possible because of the limited number of portfolios available and
because of the limited time available for the process.
This created at least two problems for the
judges: First, they sometimes thought that they were trying to determine a cut
score based on whether a single portfolio was classified up or down. Second, the
gap between two adjacent portfolios was sometimes quite large. Then, knowing
that the transition was between the two did not define the actual cut score
well. The pin-pointing phase should help to resolve all of these issues.
Conclusion
Alternate assessments for students with
significant disabilities present a number of unique challenges. Yet states must
develop and implement a standardized system of gathering student evidence of
achievement on the content standards, and then score, and report results for a
highly diverse population. Setting standards for this population presents
additional challenges. The model described above demonstrates there are time and
resource constraints, and areas of potential contamination and bias. It also
shows the importance of range-finding and pin-pointing phases for the body of
work approach. These constraints must be considered and addressed in any
standard setting plan. However, this report of one model of a standard setting
process for an alternate assessment for students with significant disabilities
demonstrates that with careful planning, standards can be set for alternate
portfolio assessments just as they can for any other assessment.
References
Kingston, N., Kahl, S. R., Sweeney, K., &
Bay, L. Setting performance standards using the body of work method. In G. J.
Cizek (ed.), Setting performance standards: Concepts, methods, and
perspectives. Mahwah, NJ: Lawrence Erlbaum.
Kleinert, H., & Kearns, J. (2001).
Alternate assessment: Measuring outcomes and supports for students with
disabilities. Baltimore, MD: Brookes Publishing.
Lewis, D. M., Mitzel, H. C., & Green, D. R.
(1996, June) Standard setting: A bookmark approach. Paper presented at
the Council of Chief State School Officers Large-Scale Assessment Conference,
Colorado Springs, CO.
Livingston, S. A., & Zeikey, M. J. (1982).
Passing scores: A manual for setting standards of performance on educational and
occupational tests. Princeton, NJ: Educational Testing Service.
Quenemoen, R., Rigney, S., & Thurlow, M.
(2002). Use of alternate assessment results in reporting and accountability
systems: Conditions for use based on research and practice (Synthesis
Report 43). Minneapolis, MN: University of Minnesota, National Center on
Educational Outcomes. Retrieved from the World Wide Web: http://education.umn.edu/NCEO/OnlinePubs/Synthesis43.html
Roeber, E. (2002). Setting standards on
alternate assessments (Synthesis Report 42). Minneapolis, MN: University of
Minnesota, National Center on Educational Outcomes. Retrieved from the World
Wide Web: http://education.umn.edu/NCEO/OnlinePubs/Synthesis42.html
Thompson, S.J., Quenemoen, R., Thurlow, M.L.,
& Ysseldyke, J.E. (2001). Alternate assessments for students with
disabilities. Thousand Oaks, CA: Corwin Press.