Performance Assessments of Critical Thinking: The
Reflective Judgment Approach
by Richard Christen and James Angermeyer, Independent District 196,
Rosemount, Minn., and
Mark L. Davison, College of Education, University of Minnesota,
and
Karen Anderson, College of Education, University of Minnesota
|
Abstract
We developed a performance assessment of critical thinking based on a
developmental theory of how students reason about complex problems that
may not have a single right or wrong answer. The assessment was designed
not as a single test, but as a format that can be adapted to various
subject areas and scored with a common rubric.
In pilot research with high school social studies students durin1992-93,
rater agreement was quite respectable for scores based on a single task,
correlations with a multiple-choice measure were low, scores increased by
grade, and teachers were generally satisfied with the measure. During
1993-94, District 169 is extending the assessment on a trial basis to high
school science courses and, with the addition of graphic organizers, to a
middle school.
|
Should Columbus be remembered as a great person whose landing in America
transformed the course of history? Or was he a villain who inaugurated five
centuries of suffering for America's native populations? Is the Endangered
Species Act a valuable instrument in the fight to preserve threatened wild life
and plant life? Or is it primarily a source of job loss and economic ruin for
parts of the United States?
Educators
generally consider the ability to think critically about issues such as these to
be one of the most important educational outcomes. But what exactly is critical
thinking? Can so complex a concept be reliably and meaningfully assessed in the
classroom? Although descriptions of thinking skills are varied and often
overlap, the selection of a particular assessment methodology should be based as
closely as possible on how thinking is defined. In addition, if critical
thinking assessment is to drive instruction, both the definition and assessment
of critical thinking should mirror activities and tasks that classroom teachers
value and understand. While a multiple choice measure like the Cornell Critical
Thinking Test (Ennis and Millman, 1985) may have a solid theoretical basis and
sound technical characteristics, it does not lend itself to the design of
instructional strategies, and the task of taking such a test does not reflect
the kinds of skills that classroom teachers expect in strong critical thinkers.
Stephen Norris and Robert Ennis define critical thinking as reasonable and
reflective thinking that is focused upon deciding what to do or believe 1989,
1). Stated in this way, one can expect students to use critical thinking skills
to reach a conclusion based on available information. According to Norris and
Ennis, critical thinking is demonstrated by the ability to: (1) judge the
soundness of information, (2) assess conclusions, and (3) make good inferences.
The Norris and Ennis definition helps clarify the assessment task. Rather
than being the ability to solve "puzzles" (Churchman, 1971), critical thinking
becomes the ability to make decisions or to reach conclusions about problems and
issues that may not always have a single correct answer. Furthermore, one can
easily envision assessment activities that would allow students to perform these
kinds of tasks in a way that encourages classroom practices. A written essay,
for example, in which the student is asked to sift through conflicting points of
view and cogently defend a position would require many of the skills embodied in
the Norris and Ennis definition and also practice reading, writing, and
problem-solving skills that teachers value.
Reflective
judgment, a concept of critical thinking encompassed by the Norris and Ennis
definition, is the basis for a developmental theory proposed by Karen Kitchener
and Patricia King (King and Kitchener, 1993; Davison, King, and Kitchener,
1990). According to this theory, individuals proceed through a specified
sequence of cognitive growth in which they show characteristic changes along
four dimensions. By responding to ill-structured problems, i.e., problems with
no clear right or wrong answers, students demonstrate their maturing
capabilities along these dimensions: (1) Use of Evidence; (2) Use of Authority;
(3) Plausibility of the Argument; and (4) Evaluation. Assessments based on the
Reflective Judgment model can be scaled to show changes over time, providing
valuable information from the standpoint of both testing and instruction.
Using a Reflective Judgment model, Independent School District 196 Rosemount,
Apple Valley, Eagan has designed and tested high school-level critical thinking
assessments in an attempt to answer these questions:
-
Can a reliable assessment and scoring system be designed for reflective
judgment?
-
How do scores on this new measure compare to established critical thinking
measures?
-
Do the average scores on the reflective judgment measure increase as
students move from ninth grade to twelfth grade?
-
How do teachers feel about the utility of the reflective judgment measure?
Method
A committee of high school teachers designed tasks around four
ill-structured problems, which roughly corresponded with general themes
discussed at the four levels of the school district's social studies
curriculum. The issues were election campaign financing, in the ninth-grade
civics/U.S. government course; the quincentennial of Columbus' first voyage
to America, in the tenth-grade U.S. history course; the collapse of the
Roman Empire, in grade eleven world history; and the effects of working
parents on children, in the twelfth-grade sociology/political science
course. Figure 1 shows the Columbus controversy as it was posed to
tenth-grade students.
|
Figure1
Columbus and the Quincentennial: How Should
They be Viewed?
The year 1992 is the quincentenary or 500th anniversary of
Columbus' first voyage to America. This event has sparked a
controversy over the proper place of Columbus in history and the
appropriateness of "celebrating" his voyages. While many view
Columbus as one of the greatest individuals of modern times,
others call for a recognition of the pain and suffering Columbus
brought to Native Americans.
Task
The authors of the following articles offer a variety of
viewpoints on the significance of Columbus and the
appropriateness of celebrating the quincentennial of his first
voyage. After reading these arguments, write an essay expressing
your thinking about Columbus and the celebration of his
quincentennial.
Be sure to explain your answer using the
authorities and evidence cited in the following
articles and any other authorities and evidence that
you think is related. |
Classrooms presumed to represent the general population of the district
were selected from each of the three high schools, with approximately 75
students participating in each grade. Students were given two days of class
time to read the material and to write their essays. They received no
specific preparation prior to the assessment and were not to take the
material home at the end of the first day. Later, the students also
completed the Cornell Critical Thinking Test.
Scoring
The rubric used to score the essays underwent several revisions. The
initial model used an analytical scoring approach in which the four
dimensions of the Reflective Judgment model were described at three
different stages of development. After field testing, the four dimensions
were combined into the holistic scoring rubric shown in Figure 2. A
five-point scale was defined with descriptions at scale points one, three
and five. This holistic scoring approach has been used successfully with
student writing samples (Northwest Evaluation Association, 1992).
To train the readers, six sample papers representing various levels of
the rubric were selected by the project directors. Readers were given a
chance to read and rate the papers against the rubric criteria. In the
discussion that followed, readers asked questions about the meaning of
various phrases in the scoring rubric and made recommendations to change
some of the wording. Discussions continued until the group was comfortable
with the criteria and the rating scale. This process has been used
successfully by the project directors to train readers for other student
performance-based assessments (Independent District 196, 1992).
In the study itself, each reader scored a packet of approximately 25
papers, with essays from all grades mixed together. Although the reader
would obviously know the student's grade from the paper's content, it was
believed that scoring them together would encourage readers to apply the
same standards to all papers. Each paper was scored by two readers. Whenever
the two scores were within one point, they were considered in agreement.
Papers with scores that differed by more than one point were read by a third
reader who determined a final set of scores.
Figure 2
Reflective Judgment Scoring Criteria
Level 5
- The student draws extensively on evidence presented in the
articles to support the conclusion. The conclusion makes coherent
use of the evidence.
- Significant recognition of authority; the student may have made
some attempt to consider the authority's credibility or point of
view.
- The student recognizes both sides of the issue and is able to
weigh the pros and cons of both sides; recognizes the strength and
limitations of each position in taking a stand.
Level 4
Level 3
- The student has made a limited effort to use evidence from
the articles to support the argument; the evidence may not
support the conclusions or may be used somewhat incoherently.
- Some identification or recognition of authority, but little
development.
- The student recognizes that another side of the issue
exists, but finds support for only his or her side; may tend to
build up his or her argument by tearing down the other side.
- May be a "laundry list", citing much evidence both pro and
con, but student makes no attempt to take a side and make a
decision.
Level 2
Level 1
- The student does not cite evidence from the articles.
- No identification or recognition of authority, i.e.,
those responsible for the arguments used.
- Student sees only one side of the argument; no
evaluation attempted.
- No evidence that the student used the articles; could
have written the essay by only skimming the articles.
Level 0
|
Results
In this study, we examined agreement among raters, the
relationship between the performance based Reflective Judgment
Performance Assessment and the multiple-choice Cornell Critical
Thinking Test, grade trends on both the Reflective Judgment
Performance Assessment and the Cornell Critical Thinking Test,
and teacher's opinions of the Reflective Judgment Performance
Assessment. Reported results were statistically significant (p <
.05 except where noted).
Interrater Agreement
Each
pair of scores was categorized as being in complete agreement
when raters gave identical ratings, and it was categorized as
being in close agreement when there was a difference of one
point. Out of 302 papers rated, raters completely agreed on 149
(50%), closely agreed on 136 more (45%), and disagreed on only
17 (5%). The interrater reliability correlation between rater
pairs was .75 for ninth grade, .74 for tenth grade, .59 for
eleventh, and .58 for twelfth. To see if some raters might be
systematically more lenient or harsh than others, we compared
the mean rating of each rater with that of the raters with whom
they were paired. The largest mean difference between a rater
and those with whom s/he was paired was about half a point.
Relations to the Cornell Critical Thinking Test
The
correlations between the Cornell Critical Thinking Test and the
Reflective Judgment Performance Assessment were low. Within each
grade, the correlations were .25 for ninth grade, .29 for tenth,
.30 for eleventh, and .18 for twelfth, none of which was
significant. These measures seem to share no more than 10
percent of their variance in common and cannot be said to
measure the same construct.
Grade Trends
Figures 3a and b show the changes in mean
Reflective Judgment Performance Assessment and Cornell Critical
Thinking Test scores across grades. Figure 4 shows how the
proportion of students scoring at each level of the scoring
rubric changed from ninth to twelfth grade.

Mean scores on the Reflective Judgment Performance Assessment
increased across grades. Expressed on the five-point scale of
the scoring rubric, the means for grades 9, 10, 11, and 12
respectively were 1.94, 2.22, 2.75, and 2.61. The decline from
eleventh to twelfth grade was not statistically significant.
Based on the mean scores and the Level 3 scoring criteria in
Figure 2, it appears the teachers judged the average eleventh-
and twelfth-grade performance as limited in use of evidence, use
of authority, and recognition of alternative points of view,
with little attempt to take a side or make a decision.

On the Cornell Critical Thinking Test, means
also generally increased across grades, except for a
nonsignificant drop from eleventh to twelfth grade. At grades 9,
10, 11, and 12, the mean Cornell Critical Thinking Test scores
were 43.96, 46.53, 49.69, and 48.60.
Although the correlations between the Reflective
Judgment Performance Assessment and the Cornell Test were low
within grades, the trends in mean scores across grades were very
similar. The low correlations mean that within a grade, the
students who performed best on the Reflective Judgment
Performance Assessment grade were not necessarily the same
students who performed best on the Cornell Critical Thinking
Test.

Teacher's Opinions
While not
always satisfied with the particular issues around which the
assessments were constructed, the thirteen social studies
teachers generally believed the essay format allowed students to
display their critical thinking; the scoring rubric was clear,
appropriate and easy to use; and the assessment model would be
appropriate for other disciplines.
Conclusions and Future Plans
The low correlation between the Cornell Critical Thinking
Test and the Reflective Judgment Performance Assessment suggests
that they measure different forms of critical thinking. Teachers
strongly supported the performance-based format, however, and
were convinced that the assessment measured important aspects of
critical thinking that should be taught and measured.
Accordingly, the field test of the Reflective Judgment
Performance Assessment will be continued into a second year in
school district 196, with several adjustments to both the
measurement itself and to the field test model.
Although the credibility or bias of authorities is an
important scoring criterion, few students actually considered it
during the original field test. Two minor changes address this
shortcoming: (1) a brief statement reminding students to take
into account the source of evidence and the point of view of the
author will be added to the instructions, and (2) more detailed
biographical information on the authors will be provided.
To gather additional data, the second-year field test will be
administered in science as well as in social studies classes. A
newly designed testing model will also be used. A sample of
science and social studies students from all four grade levels
and each district high school will perform the same task in an
attempt to more clearly identify the reasons for improved
student mean scores from ninth to twelfth grades.
In the original field test, interrater agreement and
reliability were very respectable for an assessment based on a
single performance. To test an alternative scoring format during
the second year, a number of assessments will be scored by one
teacher provided with the rubric but no training as well as by
two trained teachers. A sample of students also will complete
two different assessments in order to provide data on the
consistency of performance across tasks.
Based on the success of the assessments at the high school,
the study will be expanded to the middle schools during the
second year. A sample of middle school students from grades six,
seven, and eight will complete an assessment examining the
causes of dinosaur extinction. The task format and rubric will
be identical to that used in the high school, but some students
will be provided with graphic organizers in order to test their
impact on performance. As this extension to the middle school
illustrates, the format is flexible enough for adaptation to
various grades and subject matters. Because of this flexibility,
the Reflective Judgment model has the potential to serve not
only as a basis for critical thinking assessment, but also as an
impetus for curriculum improvement organized around tasks and
activities valued by teachers in various disciplines.
References
Churchman, C.W. 1971. The design of inquiring systems:
Basic concepts of systems and organizations. New York:
Basic.
Davidson, M.L., King, P.M., Kitchener, K.S. 1990. Developing
reflective thinking and writing. In R. Beach and S. Hynds, eds.,
Developing disclosure practices in adolescence and adulthood.
New York: Ablex.
Ennis, R.H., Millman, J. 1985. Cornell critical thinking
test - level X, 3d ed. Pacific Grove, Calif.: Midwest
Publications.
Independent District 196. 1992. Performance indicators of
achievement: Interim report no. 2 - writing assessment results.
Unpublished report, Rosemount, Minn.
King, P.M., Kitchener, K.S. 1993. The development of
reflective judgment in adolescence and adulthood. San
Francisco: Jossey-Bass.
Norris, S.P., Ennis, R.H. 1993. Evaluating critical
thinking. Pacific Grove. Calif.: Midwest Publications.
Paul, R., Binker, A.J.A., Martin, D., Vetrano, C., Kreklau,
H. 1989. Critical thinking handbook: A guide for remodeling
lesson plans in language arts, social studies, and science.
Rohnert Park, Calif.: Center for Critical Thinking and Moral
Critique.
|