Considerations for the Development and Review of Universally Designed Assessments


NCEO Technical Report 42

Published by the National Center on Educational Outcomes

Prepared by:

Sandra J. Thompson • Christopher J. Johnstone • Michael E. Anderson • Nicole A. Miller

November 2005


Any or all portions of this document may be reproduced and distributed without prior permission, provided the source is cited as:

Thompson, S.J., Johnstone, C.J., Anderson, M. E., & Miller, N. A. (2005). Considerations for the development and review of universally designed assessments (Technical Report 42). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes. Retrieved [today's date], from the World Wide Web: http://education.umn.edu/NCEO/OnlinePubs/Technical42.htm


Acknowledgements

NCEO extends its sincere appreciation to the expertise of the individuals who provided us with thoughts, feedback, and suggestions in order to further develop and refine the considerations for universally designed assessments:

 

Karen Barton, CTB McGraw Hill

Sheryl Burgstahler, DO-IT Center, University of Washington

Margo Gottlieb, Illinois Research Center

Tom Haladyna, Arizona State University

Tracey Hall, CAST, Inc.

Barbara Henderson, American Printing House for the Blind

Scott Marion, National Center for the Improvement of Educational Assessment 

Ken Olsen, Mid South Regional Resource Center

Marge Petit, National Center for the Improvement of Educational Assessment

Charles Stansfield, Second Language Testing, Inc.

Gerald Tindal, University of Oregon

Carol Traxler, Gallaudet University

Tim Vansickle, Minnesota Department of Education

 


Executive Summary

Universal design is an approach to educational assessment based on principles of accessibility for a wide variety of end users. Thompson, Johnstone, and Thurlow described seven elements of universally designed assessments in their 2002 report entitled Universal Design Applied to Large Scale Assessments. Elements of universal design include inclusive test population; precisely defined constructs; accessible, non-biased items; tests that are amenable to accommodations; simple, clear and intuitive procedures; maximum readability and comprehensibility; and maximum legibility. Since the 2002 report, Universal Design Project staff  have examined research from a variety of fields in an effort to specify how elements of universally designed assessments can be put into practice.

This report describes the development of a “considerations of universally designed assessments” form based on Thompson et al.’s original elements. Considerations are specific questions for test designers to take into account while designing assessments. This report provides the original list of considerations from Thompson et al., then describes a validation process, whereby assessment and content area experts participated in a Delphi study. The Delphi study illuminated expert consensus on some considerations and disagreement on others. All expert commentary is captured in the text of this paper and in Appendix C (in tabular form), and a revised list of considerations is found in Appendix D.

Based on the comprehensive work represented in this report, several recommendations are presented for the use of the considerations of universal design at all stages of test development:

  1. Incorporate elements of universal design in the early stages of test development. 

  2. Include disability, technology, and language acquisition experts in item reviews. 

  3. Provide professional development for item developers and reviewers on use of the considerations for universal design.

  4.  Present the items being reviewed in the format in which they will appear on the test.

  5. Include standards being tested with the items being reviewed. 

  6. Try out items with students.

  7. Field test items in accommodated formats.

  8. Review computer-based items on computers.


Introduction

The term universal design has been applied to a variety of educational approaches over the past several years. For instance, universal design for learning was first described by the Council for Exceptional Children (CEC) in a Research Connections article (CEC, 1999). Likewise, Thompson, Johnstone, and Thurlow (2002) of the National Center on Educational Outcomes (NCEO) described universal design approaches to large-scale assessment. In their initial paper on universal design of assessments, Thomson et al. outlined seven elements of universally designed assessments (inclusive assessment population; precisely defined constructs; accessible, non-biased items; amenable to accommodations; simple, clear and intuitive procedures; maximum readability and comprehensibility; and maximum legibility). Although elements of universal design provide guidance to states and assessment companies about design issues, there is still a need for specific information concerning what considerations should be made in test development in order to make tests accessible to a wide range of students.

This report summarizes the process of developing and refining a list of considerations for the universal design of statewide assessments for all students, including students with disabilities and English language learners. The staff of the Universal Design Project at NCEO, working closely with experts in the fields of assessment, disability, content areas (reading and math), and language acquisition, completed this version of considerations in the summer of 2004.  This revision was one of three, which followed the compilation of an initial set of considerations identified from a literature review of multiple content areas (see Thompson, et al., 2002).  The first version included stakeholder input from the Council of Chief State School Officers (CCSSO) conference on large-scale assessment in 2003.  Following CCSSO feedback, a second version (a Delphi review, see description later in the text) was developed by NCEO in partnership with the Minnesota Department of Education, with a primary focus on students with limited English proficiency. This report describes the process of refining the considerations during a third validation study conducted by the Universal Design Project at NCEO. This is the third version of the considerations for use by test developers and item reviewers. This report also discusses the process used to validate the considerations, the issues that arise when using these considerations, and recommendations for use.

 

Purpose of the Study

The purpose of this report is to describe the process of developing and refining a set of considerations for item developers and item review teams to take into account in the universal design of inclusive, standardized, statewide assessments.  Although the goal of this process was to find design strategies that maximize the accessibility of tests and test items, a larger goal was to create an instrument to guide careful consideration of the elements of test design in order to discover issues in items that may be problematic. 

 

What is Universal Design?

More than 20 years ago, Ron Mace, an architect who was a wheelchair user, began to actively promote a concept he termed “universal design.” Mace was adamant that his field did not need more special purpose designs that serve primarily to meet compliance codes and may also stigmatize people.  Instead, he promoted design that works for most people, from the child who cannot turn a doorknob to the elderly woman who cannot climb stairs to get to a door (Mace, 1998).

The term universal design is found in the newly reauthorized Individuals with Disabilities Education Act of 2004 (Public Law No: 108-446).  Specifically, IDEA of 2004 states that: 

The State educational agency (or, in the case of a districtwide assessment, the local educational agency) shall, to the extent feasible, use universal design principles in developing and administering any assessments under this paragraph 612(a)(16)(E).

Universal design is specifically defined in the U.S. Assistive Technology Act of 2004 (Public Law No. 108-364-ATA 2004) as follows:

[A] concept or philosophy for designing and delivering products and services that are usable by people with the widest possible range of functional capabilities, which include products and services that are directly accessible (without requiring assistive technologies) and products and services that are interoperable with assistive technologies.

Assessments that are universally designed are designed from the beginning, and continually refined, to allow participation of the widest possible range of students, resulting in more valid inferences about performance. These assessments are based on the premise that each child in school is a part of the population to be tested, and that test results should not be influenced by disability, gender, race, or English language ability. Universally designed assessments are not intended to eliminate individualization, but they may reduce the need for accommodations and various alternative assessments by eliminating access barriers associated with the tests themselves. 

The elements of universal design, according to Thompson et al., are:

1.  Inclusive assessment population
2.  Precisely defined constructs
3.  Accessible, non-biased items
4.  Amenable to accommodations
5.  Simple, clear and intuitive procedures
6.  Maximum readability and comprehensibility
7.  Maximum legibility

From these elements, universal design staff constructed considerations for universally designed assessments. The considerations are a list of specific questions that help test designers locate potential design issues in items. The considerations are listed in Table 1.

Table 1: Considerations for Universally Designed Assessment Items

Does the item…

Measure what it intends to measure

•   Reflect the intended content standards (reviewers have information about the content being measured)

•   Minimize skills required beyond those being measured

Respect the diversity of the assessment population

•   Accessible to test takers (consider gender, age, ethnicity, socio-economic level)

•   Avoid content that might unfairly advantage or disadvantage any student subgroup

Have clear format for text

•   Standard typeface

•   Twelve (12) point minimum for all print, including captions, footnotes, and graphs (type size appropriate for age group)

•   Wide spacing between letters, words, and lines

•   High contrast between color of text and background

•   Sufficient blank space (leading) between lines of text

•   Staggered right margins (no right justification)

Have clear pictures and graphics (when essential to item)

•   Pictures are needed to respond to item

•   Pictures with clearly defined features

•   Dark lines (minimum use of gray scale and shading)

•   Sufficient contrast between colors

•   Color is not relied on to convey important information or distinctions

•   Pictures and graphs are labeled

Have concise and readable text

•   Commonly used words

•   Vocabulary appropriate for grade level

•   Minimum use of unnecessary words

•   Idioms avoided unless idiomatic speech is being measured

•   Technical terms and abbreviations avoided (or defined) if not related to the content being measured

•   Sentence complexity is appropriate for grade level

•   Question to be answered is clearly identifiable

Allow changes to its format without changing its meaning or difficulty (including visual or memory load)

•   Allows for the use of braille or other tactile format

•   Allows for signing to a student

•   Allows for the use of oral presentation to a student

•   Allows for the use of assistive technology

•   Allows for translation into another language

Does the test…

Have an overall appearance that is clean and organized

•   All images, pictures, and text provide information necessary to respond to the item

•   Information is organized in a manner consistent with an academic English framework with a left-right, top-bottom flow

In addition to the other considerations, a computer-based test should have these considerations:

Layout and design

•   Sufficient contrast between background and text and graphics for easy readability

•   Color is not relied on to convey important information or distinctions

•   Font size and color scheme can be easily modified (through browser settings, style sheets, or on-screen options)

•   Stimulus and response options are viewable on one screen when possible

•   Page layout is consistent throughout the test

•   Computer interfaces follow Section 508 guidelines

Navigation

•   Navigation is clear and intuitive; it makes sense and is easy to figure out

•   Navigation and response selection is possible by mouse click or keyboard

•   Option to return to items and return to place in test after breaks

Screen reader considerations

•   Item is intelligible when read by a text/screen reader

•   Links make sense when read out of visual context (“go to the next question” rather than “click here”)

•   Non-text elements have a text equivalent or description

•   Tables are only used to contain data, and make sense when read by screen reader

Test specific options

•   Access to other functions is restricted (e.g., e-mail, Internet, instant messaging)

•   Pop up translations and definitions of key words/phrases are available if appropriate to the test

•   Students are able to record their responses and read them back (and have them read back using text-to-speech) as an alternative to a human scribe, but only if student has experiences with this mode of expression and chooses it for the test

Computer capabilities

•   Adjustable volume

•   Speech recognition available (to convert user’s speech to text)

•   Test is compatible with current screen reader software

•   Computer-based option to mask items or text (e.g., split screen)

•   Computer software for test delivery is designed to be amenable to assistive technology

 


Delphi Review

We conducted a Delphi review to determine the usefulness of existing considerations for universally designed assessments. The intent of the Delphi review was to invite experts in the fields of assessment, special education, academic content, and language acquisition to give input on the considerations and modify them accordingly (Adler & Ziglio, 1996).  The Delphi method is a structured process of using a series of questionnaires to gather the combined input from a group of persons with expertise related to a specific area or population.  The method has been used in the social science and public health fields since the mid-1970s (Adler & Ziglio, 1996).  Delphi studies allow participants to give their own informed opinion on an issue.  The input is then compiled and returned to the participants who can respond to further questions, respond to the input from the other participants, and revise their own comments if desired.  All iterations of Delphi are anonymous. 

This Delphi study took place entirely by e-mail.  Participants were unaware of who was invited to participate in the study, who elected to participate, and the individuals who provided feedback (anonymity was maintained throughout the study).  All suggestions and comments were given equal weight.

 

Participants

Universal Design Project research staff identified a group of experts to review the considerations for universally designed assessments. To ensure that important areas of expertise were represented, a chart was created and participants were recommended based on their expertise in one or more of the identified areas (see Table 2).  These individuals were then invited to participate in the Delphi review before the first Delphi questionnaire was sent out.  The resulting group of Delphi participants represented experts in the field of assessment, assistive technology, computer-based testing, reading, math, second language acquisition and testing, disability consultation, and special education.

Table 2: Expertise and Participants

Vision

Barbara Henderson

Computer-based testing, learning disabilities

Gerald Tindal

Item analysis

Karen Barton

Second language acquisition and testing

Margo Gottlieb

Second language acquisition, testing, and translation

Charles Stansfield

Physical disabilities

Sheryl Burgstahler

Hearing

Carol Traxler

Science

Scott Marion

Psychometrics

Tom Haladyna

Assistive technology

Tracy Hall

Math

Marge Petit

Special education assessment

Ken Olsen

State Assessment Director

Tim Vansickle

 

Delphi Process

The first Delphi survey (Delphi Form 1—see Appendix A) was developed to obtain specific feedback on the considerations draft presented by NCEO. Expert participants were provided ample opportunity to comment on the considerations or add to the list. The participants were asked first to rate the importance of each individual consideration on a five point Likert scale. They then were asked to comment on any of the considerations about which they felt strongly positive or negative. They could also pose questions on the form.  Finally, they were asked to add any additional considerations and rate the importance of their additions.  The participants were instructed to try to think about the considerations in terms of their usefulness for test developers and item reviewers.

In July 2004, the first Delphi survey (Delphi Form 1) was e-mailed to the participants.  Each participant was given seven days to review the considerations and email comments back to NCEO. The comments and ratings were returned by 13 of 14 participants. These were compiled at NCEO and a second survey was developed (Delphi Form 2–see Appendix B). 

The second survey (Delphi Form 2) included a list of anonymous individual ratings and the mean from all ratings assigned to each consideration. All comments made by the participants on the first form were included in the second form. Participants were asked to comment on results from the initial survey, were probed on specific issues by NCEO researchers, and were asked to comment on the 15 considerations suggested by participants (the majority relating to computer-based testing). The second survey was e-mailed out at the beginning of August 2004 and participants were again given seven days to return their comments via email. The comments were complied by the staff at NCEO in mid-August, 2004 (see Appendix C).

 

Response Rates

The original list of considerations (Delphi Form 1) was sent out via e-mail to 14 experts for review.  Thirteen of 14 (93%) experts returned Delphi Form 1.  The second survey (Delphi Form 2) was again sent out to the original 14 participants.  The same thirteen participants returned the second survey (one participant did not participate in either survey).  The feedback on both surveys was extensive.

 

Results

Using the feedback from both Delphi surveys, Universal Design Project staff revised the considerations for universally designed assessments (see Table 3). The considerations that had originally been sent to reviewers were rated as somewhat important to extremely important (from 2.67 to 5), with an average of very important (i.e., 4.3) to consider in designing and reviewing assessments. One consideration was deleted based on expert feedback, while others were added or revised. The primary additions to the considerations were the expansion of the considerations for computer-based testing. In addition, there were several additions to the discussion points for the consideration note sections. All changes to the considerations are shown in Table 3, with additions marked by underlines and deletions shown by strikethroughs.    

Table 3: Summary of Consideration Ratings and Changes

Does the item…

Range

Mean

Measure what it intends to measure

•   Reflect the intended content standards (reviewers have information about the content being measured)

•   Minimize knowledge and skills required beyond those being what is intended for measured measurement.

 

 

5–5

 

3–5

 

5.00

 

4.33

Respect the diversity of the assessment population

•   Accessible Sensitive to test takers characteristics and experiences (consider age, gender, ethnicity, and socio-economic level, region, disability, and language)

•   Avoid content that might unfairly advantage or disadvantage any student subgroup

 

4–5

 4–5

 

4.75

 4.64

Have clear format for text

•   Standard typeface

•   Twelve (12) point minimum size for all print, including captions, footnotes, and graphs (type size appropriate for age group)

•   Wide spacing between letters, words, and lines

•   High contrast between color of text and background

•   Sufficient blank space (leading) between lines of text

•   Staggered right margins (no right justification)

 

3–5

3–5

 

2–5

3–5

2–5

2–5

 

4.00

4.09

 

3.09

4.09

2.82

3.36

Have clear visuals (when essential to item)

•   Pictures Visuals are needed to respond to item answer the question

•   Pictures Visuals with clearly defined features (minimum use of gray scale and shading)

•   Dark lines (minimum use of gray scale and shading)

•   Sufficient contrast between colors

•   Color alone is not relied on to convey important information or distinctions

•   Pictures and graphs Visuals are labeled

 

3–5

4–5

 

3–5

1–5

2–5

3–5

 

4.56

4.45

 

3.82

3.64

3.91

3.91

Have concise and readable text

•   Commonly used words (except vocabulary being tested)

•   Vocabulary appropriate for grade level

•   Minimum use of unnecessary words

•   Idioms avoided unless idiomatic speech is being measured

•   Technical terms and abbreviations avoided (or defined) if not related to the content being measured

•   Sentence complexity is appropriate for grade level

•   Question to be answered is clearly identifiable

 

1–5

4–5

1–5

3–5

4–5

 

1–5

5–5

 

4.18

4.83

4.17

4.67

4.73

 

4.45

5.00

Allow changes to its format without changing its meaning or difficulty (including visual or memory load)

•   Allows for the use of braille or other tactile format

•   Allows for signing to a student

•   Allows for the use of oral presentation to a student

•   Allows for the use of assistive technology

•   Allows for translation into another language

 

 

3–5

3–5

3–5

3–5

1–5

 

 

4.67

4.55

4.36

4.45

3.64

Does the test…

 

 

Have an overall appearance that is clean and organized

•   All visuals (e.g., images, pictures) and text provide information necessary to respond to the item

•   Information is organized in a manner consistent with an academic English framework with a left-right, top-bottom flow

•   Booklets/materials can be easily handled with limited motor coordination (consideration was added)

•   Response formats are easily correlated matched to question

•   Place for student to take notes (on the screen for CBT) or extra white space with paper-pencil

 

3–5

 

4–5

 

0–5

 

0–5

0–5

 

4.50

 

4.33

 

4.00

 

3.43

3.82

In addition to the other considerations, a computer-based test should have these considerations:

 

 

Layout and design

•   Sufficient contrast between background and text and graphics for easy readability

•   Color alone is not relied on to convey important information or distinctions

•   Font size and color scheme can be easily modified (through browser settings, style sheets or on-screen options)

•   Stimulus and response options are viewable on one screen when possible

•   Page layout is consistent throughout the test

•   Computer interfaces follow Section 508 guidelines (www.section508.gov)

Navigation

•   Students have received adequate training on use of test delivery system

•   Navigation is clear and intuitive; it makes sense and is easy to figure out

•   Navigation and response selection is possible by mouse click or keyboard

•   Option to return to items and return to place in test after breaks

Screen reader considerations

•   Item is intelligible when read by a text/screen reader

•   Links make sense when read out of visual context. (“go to the next question” rather than “click here”)

•   Non-text elements have a text equivalent or description

•   Tables are only used to contain data, and make sense when read by screen reader

 

4-5

 2–5

2–5

3–5

4–5

0–5

 0–5

4–5

3–5

3–5

3–5

4–5

3–5

3–5

 

4.67

 3.92

4.08

4.67

4.75

3.56

 4.46

4.92

4.67

4.60

4.58

4.67

4.30

4.36

Test specific options

•   Access to other functions is restricted (e.g., e-mail, Internet, instant messaging)

•   Pop up translations and definitions of key words/phrases are available if appropriate to the test

•   Students writing online can get feedback on length of writing on-demand in cases where there is a restriction on number of words. 

•   Students are able to record their responses and read them back (or have them read-back using text-to-speech) as alternative to human scribble, but only if student has experiences with this mode of expression and chooses it for the test as an alternative to human scribe.

•   Students are allowed to create persistent marks to the extent that they are already allowed to paper-based booklets (e.g., marking items for review, eliminating multiple choice items, etc.)

Computer capabilities

•   Adjustable volume

•   Speech recognition available (to convert user’s speech to text)

•   Test is compatible with current screen reader software

•   Computer-based option to mask items or text (e.g., split screen)

•   Computer software for test delivery is designed to be amenable to assistive technology

 

3–5

3–5

0–5

 

0–5

 

 

0–5

 

3–5

1–5

3–5

0–4

0–5

 

4.55

4.08

 2.67

 

3.69

 

 

4.17

 

 4.50

3.67

4.25

3.00

3.91

 

 Notes that were added to the considerations address some of the anticipated issues that might arise when using the considerations.  While we tried to keep the list of considerations brief and user-friendly, it was clear from participant comments that more explanation about the intent and issues surrounding the considerations needed to be presented close to the considerations in note form. The notes are not meant to be used as definitive judgment of the “good” or “bad” quality of an item or design feature.  Instead, the notes are intended to add clarity to the considerations, help elucidate important issues, and help generate discussion.

 

Discussions About Selected Considerations

In addition to providing greater clarity to several of the considerations, many of the respondents in the Delphi review pointed out that using some of the considerations depended on the content being tested.  Extensive discussion focused on issues of construct vs. content validity and the minimization of construct-irrelevant variance. There was also extensive discussion on the validity and practicality of the translation of assessments to languages other than English. In this section of the report, we present a detailed review of these discussions. Considerations about which few comments were made and no clarification was deemed necessary are not discussed. Responses to all considerations, however, can be found in Appendix C. 

Consideration: “Reflects the intended content standards (reviewers have information about the content being measured).”
Following a discussion by Universal Design Project staff, Delphi participants were asked to comment on whether the first consideration should remain “Reflects the intended content standards (reviewers have information about the content being measured)” or whether it should be reworded “Reflects the intended construct (reviewers have information about the construct being measured).”  Although opinions leaned toward changing the wording (Yes = 6, No = 3, Combination wording = 1, Did not state position but provided information to consider when making the decision = 2, Don’t know = 1), only two of the participants in favor of using the term “construct” provided reasoning.  One suggested that construct “would fit better with the professional terminology,” while the other stated that “content is topical, constructs are conceptual.  This difference in meaning is huge. Furthermore, construct is a term used in APA standards and is deeper than content.” 

The participants who wished the consideration to remain the same provided critical information about what to think about before a decision could be made. Specifically, one participant suggested that we consider our audience: “Construct is a formal term that theorists use. Content standards [are] what practitioners understand.”  Another participant suggested we consider what the terms imply: “…construct is a sort of overarching concept (i.e., reading) whereas content standards are…narrower (e.g., reproduces capital letters)…If the test is supposed to be a standards-based achievement test, then it must address standards.  If not, then the item need only address the construct.”

Ultimately, Universal Design Project staff decided to retain the term “content.” This term appears to be consistent with the link of items to standards, and avoids the apparent confusion surrounding the term  “construct.” It should be noted, however, that the term “construct” may still be useful, especially if item developers (who are familiar with the concept of constructs) are using these considerations.

Consideration: “Minimize knowledge and skills required beyond those being what is intended for measured measurement.”
The second consideration under review was altered slightly based on participant input.  Initially, this consideration stated, “Minimize skills required beyond those being measured.”  This was changed to “Minimizes knowledge and skills required beyond what is intended for measurement” following several suggested alternate phrases.  In addition to suggestions on phrasing, Delphi participants expressed concern that item writers or reviewers might interpret this consideration in such a way as to “…separate skills too much…[and thus run the risk that] we’ll wind up with tests that measure isolated, basic skills.”  Still others expressed the belief that this consideration has direct relevance for the measurement of “higher level thinking.”  Yet, as another reviewer questioned, “how…the other skills (are) defined and targeted” would be important in guiding item writers and reviewers.  One participant summed up the issue by saying that it “…depends on how discrete the standards are; minimal skills can be embedded in more complex contextualized items. Ultimately, it depends on what you are measuring.”          

Consideration: Accessible Sensitive to test takers characteristics and experiences (consider age, gender, ethnicity, and socio-economic level, region, disability, and language.           
The third consideration was changed from “Accessible to test takers (consider age, gender, ethnicity, and socio-economic level” to “Sensitive to test taker characteristics and experiences (consider gender, age, ethnicity, socio-economic level, region, disability, and language).”  When asked about including the term “bias” in this consideration, participants were somewhat divided.  While some indicated that bias should be included to “reference systematic variance that interferes with making a valid inference,” others clarified that “bias and accessibility are separate issues from a review standpoint, though obviously related.”  Keeping participants’ suggestions and reasoning in mind, it was decided that the term “bias” would be included in the note portion of the consideration and that the demographic variables would be expanded from four to seven, reflecting the need for greater sensitivity to the experiences of very diverse populations.   

Consideration: “Standard typeface.”
When considering the clarity of the format for text in assessments, most participants agreed that a standard typeface was important.  There was, however, confusion about the meaning of “standard.” Some Delphi participants had interpreted this consideration as implying that a single standard font existed, as illustrated in the following comment: “There is no standard typeface, thus the myriad fonts used in various publisher’s files, even within the same text or textbook.”  In order to reduce confusion over the meaning of the term, however, it was determined that additional clarification was needed.  Consequently, the following was added to the note section: “Use clear, common, familiar, and consistent fonts,” followed by examples of font.

Consideration: “Twelve (12) point minimum size for all print, including captions, footnotes, and graphs (type size appropriate for age group).”
When considering which font size to select, several Delphi participants noted the importance of considering the font style. Given the fact that a 12-point font can vary in size depending upon the font style, an additional issue was included in the note section.  As suggested, one consideration (width of spacing between letters) was combined with font.  One participant stated “Wide spacing is not necessarily best; proper font selection is more important.” Consequently, this consideration was added to the note section of the consideration addressing font.    

Consideration: “High contrast between color of text and background.”
When considering the use of color in text or background, participants suggested going beyond the issue of contrast to consider print density.  Specifically, one participant stated, “[E]ven with sufficient color contrast, color blind users may not be able to distinguish text and background. [I] suggest you further recommend high print density contrast. This would also avoid isoluminance effects for non-visually-impaired students.”  (“Isoluminance” is the point at which two colors have an equivalent luminous intensity, or brightness.)  Based on these comments, information on print density and isoluminance was added to the note section for the consideration addressing format for text.

Consideration: Pictures Visuals are needed to respond to item answer the question.”
The use of visuals resulted in considerable discussion ranging from issues surrounding limiting visuals, the use of visuals to provide only redundant information, and the benefits/drawbacks of using visuals in relation to specific disabilities.  In relation to the content of visuals, for example, it was suggested, “Pictures, line art, etc. should be related to the item [and] should enhance understanding, [but] not [be] required for understanding, with the exception of data tables like on math and science tests.” Additionally, another Delphi participant stated, “often there are pictures used that are not redundant with the text but that are relevant to the item and to the construct.” Consequently, it was suggested that the wording of this consideration take this idea into account.  Rather than dramatically change the wording of this consideration, qualifying information was provided to the note portion below the consideration addressing the idea that clear and well-designed graphics or pictures should add value for students who need a visual cue.     

Consideration: “Commonly used words (except vocabulary being tested).”
When considering the vocabulary used in assessments, both for directions and specific items, many Delphi participants commented on the need for greater clarity surrounding the specification that the text be comprised of “commonly used words.”  Several participants suggested that the term “age-appropriate” was preferable, while another suggested adding “concise and readable.”  Ultimately, the greatest concern with this particular consideration was that there be some acknowledgement that the words selected should be common, “with the exception of subject specific terminology…”  In other words, the “item should consist of commonly understood words or vocabulary…” except when knowledge of specific vocabulary is being tested.  One participant also suggested that the vocabulary be “…consistent with each specific grade level,” with another suggesting “at or below grade level [when] reading is not the primary construct tested.” As a result of this feedback, additional clarification was added to the wording of the consideration (i.e., the consideration was changed from “Commonly used words” to “Commonly used words (except vocabulary being tested)” as well as in the note section following the consideration. 

Consideration: “Allows for translation into another language.”
Perhaps the most controversial consideration of all was “Allows for translation into another language.”  One and one-half pages of initial comments, questions, and suggestions were followed by an additional one and one-half pages of responses, comments, questions, and suggestions.  The response of one participant summarized a number of the issues that participants grappled with when determining the appropriateness of this consideration:

“This is a questionable and highly controversial issue, particularly when one realizes that such a standard is impossible to meet.  About 72% of our LEP students are Spanish speakers, but the other 28% represent many diverse languages.  How do we accommodate and what is the theoretical rationale and what is the technology for doing this?  Is it possible?  Is it beneficial?” 

In reference to the impracticality of translating tests into the less commonly represented language groups, some participants questioned the fairness of accommodating some students (e.g., Spanish speakers) and denying others.  Another stated “What harm is done by helping the 72% of LEP students who speak Spanish? We provide accommodations to others where possible, but some would propose that a translated test is harmful.  Poppycock!”

Participants also suggested some disagreement in terms of the quality of the translations/skill of the translators.  A primary problem with translation, however, was clear: “The limitation is money.  Translation must be cost effective like everything else in education.  You can’t provide translated tests for very small numbers. The Lau decision (Lau v. Nichols, 1974) and other civil rights decisions make it clear that numbers dictate expectations of school systems.”  Given the cost, customized dictionaries were suggested as a possible alternative to fully translated tests.

Besides the practicality/impracticality of translating tests, one area of considerable debate surrounded the validity of the inferences that can be made from scores derived from translated tests.  Some participants expressed the belief that translated tests reduced the validity of scores (“Data analysis has shown these to be less than valid measures of student performance.”), or that certain translations would result in less valid scores (“Some critical and relevant word/concepts [do] not translate into every language.”).  Others, however, made the argument that there are few instances where concepts do not translate:

“Minnesota translates to Hmong and Somali.  Only in these languages are there relevant words/concepts that do not translate easily into English.  The other languages of state assessment (Spanish, Russian, Chinese, Korean, Haitian Creole) almost never pose a problem for translating words or concepts. Professional translators will tell you they can translate almost any word or idea, and if they encounter one they can’t, they will tell you that too.” 

Another participant added, “Translation is no more a threat to validity than a change in option order or a change in font.  Such changes might generate a miniscule change in item difficulty, but they don’t affect validity... [Translation] is the exact same test stated in a different language.”  Yet others brought up the issue of validity in reference to a specific construct being measured.  For example, two participants stated that translating English language arts (ELA) tests would invalidate the inferences that could be made from the scores.  In light of NCLB legislation, a participant brought up a final important point of consideration:  “A translated test is always much less of a threat to validity and score comparability than an alternate assessment,” suggesting that a translated test is preferable to alternate assessment measures for English language learners.  

Two reviewers suggested that this consideration be eliminated given the controversy, at least until more research was available.  Ultimately, Universal Design Project research staff decided to retain this consideration, acknowledging the issues item writers and reviewers face as they incorporate this consideration into the test construction/revision process.  This information was included in the note section following the consideration.  


Summary of Revisions

At the completion of the study, the Universal Design Project staff revised the original considerations based on Delphi responses (Appendix D).  The most extensive revisions were made to the content and wording of the considerations.  Some of the most significant changes to the considerations that resulted from the Delphi process are described here:

1.     Wording of several of the considerations was revised using feedback from the Delphi review participants.  For example, “Minimize skills required beyond those being measured” was changed to “Minimize knowledge and skills required beyond what is intended for measurement” and “Accessible to test takers (consider age, gender, ethnicity, and socio-economic level)” was changed and expanded to “Sensitive to test taker characteristics and experiences (consider gender, age, ethnicity, socio-economic level, region, disability, and language).”

2.     Computer-based testing considerations were expanded.  Much of the useful feedback for this section came from reviewers who are familiar with the development of computer-based tests.  With these revisions, the section of considerations for computer-based testing was clarified and redundancies with other considerations were eliminated. 

3.     Notes were added to the considerations.  These notes discuss some of the anticipated issues that might arise when using the considerations.  While we tried to keep the list of considerations brief and user-friendly, it was clear that more explanation about the intent and issues surrounding the considerations needed to be presented on the same page. The notes are intended to add clarity to the considerations and help elucidate important issues. Notes also provide evidence of the complexity of some of the considerations and illustrate that considerations are not static rules, but general principles that aid in flagging potentially problematic items.

4.     One font-dependent consideration (“Wide spacing between letters, words, and lines”) was eliminated.  Instead it was included in the note section for “Have a clear format for text.” 

5.     Relevant research citations were added to the considerations so that people wanting to investigate a certain issue in more depth would have the resource citations at hand (see Appendix E).

6.     We created a review checklist of the considerations for item reviewers and developers (see Appendix F). This form is intended to be used by item reviewers and developers who have received training on the considerations.  It consists of a list of the considerations, without the supporting text.  Using this form, item reviewers and developers can go through items and flag for further discussion areas of concern or alteration. For item reviewers, there is an additional form on which comments may be recorded explaining why some aspect of an item was flagged (Appendix G).


Issues Related to Universal Design

One of the most important outcomes of this review process was the identification of issues that surround the development of universally designed assessments.  These issues highlight the complexities of a process without easy answers. The issues discussed in this section are not meant to be an exhaustive list of the challenges related to the universal design of assessments, but instead provide some guidance about the challenges that might be encountered when using the considerations.

1.     Universal design is not a cure all.  Just because a test is universally designed, or has used the elements of universal design to guide its development, does not mean that a test is accessible to all students.  The considerations recommended in this report are just that, considerations.  They are meant to be used to guide test developers and reviewers in creating tests that are accessible to the greatest number of students possible.  However, some changes to a test that might make it more accessible to one group of students, might actually make it less accessible to another group.  For example, eliminating or altering an illustration accompanying an authentic reading text may clarify an item by removing a distraction for some students.  On the other hand, eliminating it may remove or change some useful context for the passage.  Issues of accessibility need to be carefully considered and discussed openly so that informed decisions can be made without hindering the construct being tested.  Universal design can be a useful tool for developing better assessments, but it is not a tool that can magically make all tests accessible to all students. 

2.     Universal design does not replace accommodations.  While universal design may remove some barriers for students with disabilities and English language learners, it in no way eliminates the need for testing accommodations.  Some students may still need accommodations such as large print or assistive technology.  A goal of universally designed assessments is to anticipate common accommodations and design tests that allow accommodations to be more easily integrated into the format of the test.

3.     Universal design does not replace good instruction.  The goal of universal design is to think about the full range of students taking an assessment so that they all can demonstrate what they have learned.  A student who has not had an opportunity to learn the material tested will not be helped by a universally designed test.

4.     Universal design does not lower standards.  Some may perceive a universally designed assessment to be a “watered-down” or “easier” assessment.  It is important to make clear the purpose of universal design is to make sure that the content being tested is more universally accessible to all of the students taking the test and thus a better measure of student learning.

5.     Technology use is challenging.  The quality of technology available across schools is an important issue when creating a computer-based assessment.  It is difficult to anticipate what accessibility issues will arise when a test is administered on a variety of different systems with a variety of assistive technologies.  Trying to anticipate these issues is important, however, when reviewing items.


Recommendations

These considerations can be used to make assessments more universally accessible to the entire population of test takers.  Here are some specific recommendations for the use of the considerations of universal design at all stages of test development.

1.     Incorporate elements of universal design in the early stages of test development.  Universally designed assessments present an opportunity to bring more people to the table in the early stages of test development including experts in disability, language acquisition, and technology.  These experts are able to give more structured input at different stages of the test development process if they understand universal design and have these considerations for item development and review at hand. It is more cost effective to consider universal design in the early stages of item development, rather than at the end when items have already been developed and field-tested. 

2.     Include disability, technology, and language acquisition experts in item reviews.  Every effort should be made to involve experts in item review who can judge whether items meet all of the considerations.   

3.     Provide professional development for item developers and reviewers on use of the considerations for universal design.  Explanation and discussion of each consideration will ensure use by item developers and reviewers.

4.     Present the items being reviewed in the format in which they will appear on the test.  When item reviewers examine items to be included in an assessment, it is important to format items as closely as possible to how they will appear on the test.  Since many of the considerations have to do with format, it is not useful to look at items that are not in the font, size, or format in which they will appear in the actual test booklet. 

5.     Include standards being tested with the items being reviewed.  Above all other considerations, the first consideration—does the item measure what it intends to measure—is of primary importance in constructing universally designed assessments.  Consequently, item review teams using the considerations of universal design to guide their work must have the standard (grade level expectations) that each item is intended to test at hand. It is only by knowing what an item is intended to test that reviewers can judge whether an element of the item might interfere with student access.  Each item needs to be presented with the corresponding standard being tested in that item.

6.     Try out items with students.  Some of the elements of an item that distract or confuse students are not easily recognizable by adults or native English speakers.  For this reason, trying items out with students by conducting think-aloud studies can provide valuable information about whether an item is testing the content intended (Thompson, Johnstone, & Miller, in press). 

7.     Field test items in accommodated formats.  In order to ensure that the content an item is intended to measure is not being changed when an accommodated format of a test is being used, include students using accommodated test formats in field tests.  While this can add additional expense to the field test, there are ways of doing such studies that can progressively build a database.  For example, a field test could focus on the use of certain accommodated formats one year and others the next, building up a database for the various forms of the test.  Again, qualitative data from student interviews in this area can provide important information that can be used to improve items.

8.     Review computer-based items on computers. To judge whether computer-based items are universally designed, item reviewers need to use the technology that will be used to deliver the test. Using a paper print-out of an assessment does not allow a review team to meaningfully consider the format of the test.


Conclusion

We hope that the process detailed in this report has produced not only a better set of considerations of universally designed assessments for all students, but has also clarified some of the opportunities and challenges that universally designed assessments present.  While using universal design does not guarantee the accessibility of any test to all students, using the considerations to openly discuss issues of test design throughout the test development process can make any assessment more inclusive.  Making the process of test development more transparent, informed, and focused on the needs of the entire population of students will help ensure that the assessment results are more meaningful for the widest range of students.


References

Abedi, J., Hofstetter, C., Baker, E., & Lord, C. (2001). NAEP math performance and test accommodations: interactions with student language background. Los Angeles, CA: National Center for Research on Evaluation, Standards, and Student Testing.

Adler, M., & Ziglio, E. (Eds.). (1996). Gazing into the Oracle: the Delphi method and its application to social policy and public health. London: Jessica Kingsley Publishers.

Anderson, R.C., Hiebert, E.H., Scott, J.A., & Wilkinson, A.G. (1985). Becoming a nation of readers. Urbana, IL: University of Illinois, Center for the Study of Reading, National Institute of Education, National Academy of Education.

Arditi, A. (1999). Making text legible. New York: Lighthouse.

Assistive Technology Act of 2004 (Brief Title: ATA 2004). (P.L.108-364).

Bridgeman, B., Harvey, A., & Braswell, J. (1995). Effects of calculator use on scores on a test of mathematical reasoning. Journal of Educational Measurement, 32, 323–340.

Brown, P.J. (1999). Findings of the 1999 plain language field test. Newark, DE: University of Delaware, Delaware Education Research and Development Center.

Calhoun, M.B., Fuchs, L., & Hamlett, C. (2000). Effects of computer-based test accommodations on mathematics performance assessments for secondary students with learning disabilities. Learning Disability Quarterly, 23, 271–282.

Carter, R., Dey, B., & Meggs, P. (1985). Typographic design: Form and communication. New York: Van Norstrand Reinhold.

Cole, C., Tindal, G., & Glasgow, A. (2000). Final report: Inclusive comprehensive assessment system research, Delaware large scale assessment program. Eugene, OR: Educational Research Associates.

Council for Exceptional Children (1999).  Universal design:  Research connections. Retrieved September 3, 2004, from the World Wide Web: http://ericec.org/osep/recon5/rc5sec1.html

Fuchs, L., Fuchs, D., Eaton, S., Hamlett, C., Binkley, E., & Crouch, R. (2000). Using objective data sources to enhance teacher judgments about test accommodations. Exceptional Children, 67 (1), 67–92.

Gaster, L., & Clark, C. (1995). A guide to providing alternate formats. West Columbia, SC: Center for Rehabilitation Technology Services.  (ERIC Document No. ED 405689)

Gregory, M., & Poulton, E.C. (1970). Even versus uneven right-hand margins and the rate of comprehension in reading. Ergonomics, 13 (4), 427–434.

Grise, P., Beattie, S., & Algozzine, B. (1982). Assessment of minimum competency in fifth grade learning disabled students: Test modifications make a difference. Journal of Educational Research, 76, 35–40.

Haladyna, T.M., Downing, S.M., & Rodriguez, M.C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education, 15(3), 309–334.

Hanson, M.R. (1997). Accessibility in large-scale testing: Identifying barriers to performance. Delaware: Delaware Education Research and Development Center.

Hanson, M.R., Hayes, J.R., Schriver, K., LeMahieu, P.G., & Brown, P.J. (1998). A plain language approach to the revision of test items. Paper presented at the annual meeting of the American Educational Research Association, San Diego, CA, April 16, 1998.

Harker, J.K., & Feldt, L.S. (1993). A comparison of achievement test performance of nondisabled students under silent reading plus listening modes of administration. Applied Measurement, 6, 307–320.

Hartley, J. (1985). Designing instructional text (2nd Edition). London: Kogan Page.

Heines. (1984). An examination of the literature on criterion-referenced and computer-assisted testing. ERIC Document Number 116633.

Hoener, A., Salend, S., & Kay, S.I. (1997). Creating readable handouts, worksheets, overheads, tests, review materials, study guides, and homework assessments through effective typographic design. Teaching Exceptional Children, 29, (3), 32–35.

Individuals with Disabilities Educational Improvement Act (Brief Title: IDEA 2004). (P.L. 108-446).

Johnstone, C.J., Miller, N.A., & Thompson, S.J. (in press). Using the think aloud method (cognitive labs) to evaluate test design. Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.

Kopriva, R. (2000). Ensuring accuracy in testing for English language learners. Washington DC: Council of Chief State School Officers.

Koretz, D. (1997). The assessment of students with disabilities in Kentucky (CSE Technical Report No. 431). Los Angeles, CA: Center for Research on Standards and Student Testing.

Lau vs. Nichols, 414 U.S. 563, 94 S.Ct. 786 (1974).

MacArthur, C.A., & Graham, S. (1987). Learning disabled students’ composing under three methods of text production: handwriting, word processing, and dictation. Journal of Special Education, 21 (3), 22-42.

Mace, R. (1998). A perspective on universal design. An edited excerpt of a presentation at Designing for the 21st Century: An International Conference on Universal Design. Retrieved January, 2002, from the World Wide Web: www.adaptenv.org/examples/ronmaceplenary98.asp?f=4.

Menlove, M., & Hammond, M. (1998). Meeting the demands of ADA, IDEA, and other disability legislation in the design, development, and delivery of instruction. Journal of Technology and Teacher Education. 6 (1), 75–85.

Muncer, S.J., Gorman, B.S., Gorman, S., & Bibel, D. (1986). Right is wrong: An examination of the effect of right justification on reading. British Journal of Educational Technology, 1 (17), 5–10.

National Research Council. (1999). High stakes: testing for tracking, promotion, and graduation.  In J. Heubert & R. Hauser (Eds.), Committee on Appropriate Test Use. Washington, DC: National Academy Press.

Osborne, H. (2001). In other words…communication across a life span…universal design in print and web-based communication. On Call (January). Retrieved January, 2002, from the World Wide Web: www.healthliteracy.com/oncalljan2001.html.

Popham, W.J. (2001). The truth about testing: An educator’s call to action. Alexandria, VA: Association for Supervision and Curriculum Development.

Popham, W.J., & Lindheim, E. (1980). The practical side of criterion-referenced test development. NCME Measurement in Education, 10 (4), 1–8.

Rakow, S.J. & Gee, T.C. (1987). Test science, not reading. Science Teacher, 54 (2), 28–31.

Schiffman, C.B. (1995). Visually translating materials for ethnic populations. Virginia: ERIC Document Number ED 391485.

Schriver, K.A. (1997). Dynamics in document design. New York: John Wiley & Sons, Inc.

Sharrocks-Taylor, D., & Hargreaves, M. (1999). Making it clear: A review of language issues in testing with special reference to the National Curriculum Mathematics Tests at Key Stage 2. Educational Research, 41 (2), 123–136.

Silver, A.A. (1994). Biology of specific (developmental) learning disabilities. In N.J. Ellsworth, C.N. Hedley, & A.N. Barratta, (Eds.), Literacy: A redefinition. New Jersey: Erlbaum Associates.

Smith, J.M., & McCombs, M.E. (1971). Research in brief: The graphics of prose. Visible Language, 5 (4), 365–369.

Szabo, M., & Kanuka, H. (1998). Effects of violating screen design principles of balance, unity, and focus on recall learning, study time, and completion rates. Journal of Educational Multimedia and Hypermedia, 8 (1), 23–42.

Thompson, D.R. (1991).  Reading print media:  The effects of justification and column rule on memory.  Paper presented at the Southwest Symposium, Southwest Education Council for Journalism and Mass Communication, Corpus Christi, TX.  (ERIC Document Number 337 749) 

Thompson, S. J., Johnstone, C. J., & Thurlow, M. L. (2002). Universal design applied to large-scale assessments (Synthesis Report 44). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.

Thompson, S.J., Johnstone, C.J., & Miller, N.A. (in press). Universally designed assessments from the end user’s perspective: Using a think aloud method (Policy Directions) Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.

Thompson, S., & Thurlow, M. (2002). Universally designed assessments: Better tests for everyone! (Policy Directions 14)Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes.

Tindal, G., Heath, B., Hollenbeck, K., Almond, P., & Harniss, M. (1998). Accommodating students with disabilities on large-scale tests: An empirical study. Exceptional Children, 64 (4), 439–450.

Tinker, M.A. (1963). Legibility of print. Ames, IA: Iowa State University Press.

Trotter, A. (2001). Testing computerized exams. Education Week, 20 (37), 30–35.

West, T.G. (1997). In the mind’s eye: Visual thinkers, gifted people with dyslexia and other learning difficulties, computer images, and the ironies of creativity. Amherst, NY: Prometheus Books.

Worden, E. (1991). Ergonomics and literacy: More in common than you think. Indiana. (ERIC Document Number 329 901)

Zachrisson, G. (1965). Studies in the legibility of printed text. Stockholm: Almqvist and Wiskell.


Appendix A

Delphi Review of Test Item Considerations (Form 1)

Rating scale for importance:

5=Extremely important to consider; 4=Very important to consider; 3=Important to consider; 2=Somewhat important to consider; 1=Not important to consider.

Scales adapted from Ziglio (1996).

Considerations when reviewing any test item:

Subject Responses

Mean

Please insert
your comments here…

 Does the item…

 

 

 

Measure what it intends to measure

 

 

 

•   Reflects the intended content standards (reviewers have information about the content being measured)

 

 

•   Minimize skills required beyond those being measured

 

 

Respect the diversity of the assessment population

 

 

 

•   Accessible to test takers (consider age, gender, ethnicity, and socio-economic level)

 

 

•   Avoids content that might unfairly advantage or disadvantage any student subgroup

 

 

Have a clear format for text

 

 

 

•   Standard typeface

 

 

 

•   Type size appropriate for age group (12 point minimum for all print, including captions, footnotes, and graphs)

 

 

 

•   Wide spacing between letters, words, and lines

 

 

 

 •   High contrast between color of text and background

 

 

 

 •   Sufficient leading (blank space) between lines of text

 

 

 

 •   Staggered right margins (no right justification)

 

 

 

 Have clear pictures and graphics (when essential to item)

 

 

 

 •   Pictures are needed to respond to item

 

 

 

 •   Pictures with clearly defined features

 

 

 

 •   Dark lines (minimum use of gray scale and shading)

 

 

 

 •   Sufficient contrast between colors

 

 

 

 •   Color is not relied on to convey important information or distinctions

 

 

 

 •   Pictures and graphs are labeled

 

 

 

 Have concise and readable text

 

 

 

 •   Commonly used words

 

 

 

 •   Vocabulary appropriate for grade level

 

 

 

 •   Minimum use of unnecessary words

 

 

 

 •   Idioms avoided unless idiomatic speech is being measured

 

 

 

 •   Technical terms and abbreviations avoided (or defined) if not related to the content being measured

 

 

 

 •   Sentence complexity is appropriate for grade level

 

 

 

 •   Question to be answered is clearly identifiable

 

 

 

Allow changes to its format without changing its meaning or difficulty (including visual or memory load)

 

 

 

 •   Allows for the use of braille or other tactile format

 

 

 

 •   Allows for signing to a student

 

 

 

 •   Allows for the use of oral presentation to a student

 

 

 

 •   Allows for the use of assistive technology

 

 

 

 •   Allows for translation into another language

 

 

 

Does the test...

 

 

 

Have an overall appearance that is clean and organized

 

 

 

 •   All images, pictures, and text provide information necessary to respond to the item

 

 

 

 •   Information is organized in a manner that is consistent with an academic English framework with a left-right, top-bottom flow

 

 

 

In addition to the other considerations, a computer-based test should have these considerations:

Layout and design

 

 

 

 •   Sufficient contrast between background and text and graphics for easy readability

 

 

 

 •   Color is not relied on to convey important information or distinctions

 

 

 

 •   Font size and color scheme can be easily modified (through browser settings, style sheets, or on-screen options)

 

 

 

 •   Stimulus and response options are viewable on one screen when possible

 

 

 

 •   Page layout is consistent throughout the test

 

 

 

Navigation

 

 

 

 •   Navigation is clear and intuitive; it makes sense and is easy to figure out

 

 

 

 •   Navigation and response selection is possible by mouse click or keyboard

 

 

 

 •   Option to return to items and return to place in test after breaks

 

 

 

Screen reader considerations

 

 

 

 •   Item is intelligible when read by a text/screen reader

 

 

 

 •   Links make sense when read out of visual context. (“go to the next question” rather than “click here”)

 

 

 

 •   Non-text elements have a text equivalent or description

 

 

 

 •   Tables are only used to contain data and make sense when read by screen reader

 

 

 

Test specific options

 

 

 

 •   Access to other functions is restricted (e.g. e-mail, Internet, instant messaging)

 

 

 

 •   Pop up translations and definitions of key words/phrases are available if appropriate to the test

 

 

 

Computer capabilities

 

 

 

 •   Adjustable volume

 

 

 

 •   Speech recognition available (to convert user’s speech to text)

 

 

 

 •   Test is compatible with current screen reader software

 

 

 

Items on this form are based on information presented in Thompson, Johnstone, & Thurlow (2002, Universal Design Applied to Large Scale Assessments, Synthesis Report 44); Thompson & Thurlow 2002, Universally Designed Assessments: Better Tests for Everyone!, Policy Directions 14), and Kopriva (2002, Ensuring Accuracy in Testing for English Language Learners, CCSSO SCASS-LEP Consortium), as well from NCEO staff brainstorming and input received from participants in the Universal Design Pre-conference Clinic at the CCSSO Large Scale Assessment and Accountability Conference in San Antonio, Texas, June 2003 and input from a joint project/Delphi review with the Minnesota, Nevada, and South Carolina Departments of Education.


Appendix B

Delphi Review of Test Item Considerations (Form 2)

Rating scale for importance:

5=Extremely important to consider; 4=Very important to consider; 3=Important to consider; 2=Somewhat important to consider; 1=Not important to consider.

Scales adapted from Ziglio (1996). 

Considerations when reviewing any test item:

Subject Responses

Mean

Please insert your comments here…

Does the item…

 

 

 

Measure what it intends to measure

 

 

1-Ultimately this (is the) most important. Some of this can be accomplished within the concept of universal design…there may be content, knowledge, skills, and/or abilities that may not lend themselves well to UD in some instances.

•   Reflects the intended content standards (reviewers have information about the content being measured)

55555555555

 

5

1-Agree

Should this be changed to read: “Reflects the intended construct (reviewers have information about the construct being measured)?

N/A

N/A

 

•   Minimize skills required beyond those being measured

 

334444555555

4.33

1-While I think you could argue for a “5” for the 2nd statement here, I worry that if we try to separate skills too much, we’ll wind up with tests that measure isolated, basic skills.

2- Both cognitive demand and specific content should be included in the review.  Content to content match is insufficient (see Conserving Math Construct (CMC) template).

3-I would rather it said “Minimizes skills required beyond those explicitly in the standard.”

4-Very good observation.  Relevant to higher level thinking.

5-Multiple response options should be available, perhaps as a pretest assessment.

6-Important but cannot override ability to measure all content areas to assessed (e.g., draw a graph of the results).

7-Crucial for test validity for deaf test takers.

8-This may be difficult with disabled students—but try.

9-Perhaps should be reworded as “Minimize skills required beyond those intended for measurement.” Assure measurement of the intended construct. Assumptions about level of achievement possible/abilities (language, sensory, motor, background knowledge, etc.) can interfere with the ability to accurately measure the intended construct.

Based on these comments, would you change the wording of this consideration. If so, what should it say?

N/A

N/A

 

Respect the diversity of the assessment population

 

 

 

•   Accessible to test takers (consider age, gender, ethnicity, and socio-economic level)

444555555555

4.75

 

Should we add issues such as: Regional differences?  Students with disabilities? Language minority students?  Others?

Note: This section relates to what bias review committees typically address. Does the word “bias” need to be included?

N/A

N/A

 

•   Avoids content that might unfairly advantage or disadvantage any student subgroup

 

44445555555

4.64

 

Have a clear format for text

 

 

 

•   Standard typeface

33334445555

4

1-What is meant by “standard?” If you mean the same throughout a form,

then 1.  If you mean an acceptable typeface for clarity, then 4.

2-“Standard typeface” does not communicate in a world where everyone has access to hundreds of fonts—specify fonts and criteria re: serifs.

3-To do otherwise would introduce construct irrelevant variance (CIV).

4-Tests may be different between but should be the same within.

5-This should be “selection of typeface,” wherein the best proven typefaces are used.

6-General comment for all of these points: no one set of values will work for all students. All we can hope for is a reasonable compromise unless we embed flexibility, as per universal design principles. Perhaps explicitly state here that this consideration addresses print materials exclusively.

7-Serif recommended for print; sans-serif recommended for computer displays. For large-print booklets, consider specific fonts (see APH guidelines: http://sun1.aph.org/edresearch/lpguide.htm)

Instead of the word “standard” should we use the word “common,” “familiar,” or “clear?”  Please comment.

Would you change the wording entirely for the above consideration?  If so, how?

N/A

N/A

 

•   Twelve (12) point minimum for all print, including captions, footnotes, and graphs (type size appropriate for age group)

33444444555

 

4.09

 

Additional comments on print size?

N/A

N/A

 

•   Wide spacing between letters, words, and lines

22233333445

3.09

1-Dependent on age.

2-To do otherwise would introduce construct-irrelevant variance (CIV).

3-Wide spacing is not necessarily best; proper font selection is more important.

4-Need to define precisely. Also consider that wide spacing may have deleterious effect on non-visually impaired students.

Should we retain the above consideration?

N/A

N/A

 

•   High contrast between color of text and background

33344445555

4.09

1-Also is one needed on color intensity?

2-And selection of paper/ink color.

3-Even with sufficient color contrast, color blind users may not be able to distinguish text and background. Suggest you further recommend high print density contrast. This would also avoid isoluminance effects for non-visually-impaired students.

Additional comments on color and background?

 

 

 

•   Sufficient leading (blank space) between lines of text

23334444455

2.82

1-After having just had a conversation on this, I’ll up my rating to 3.

2-“Standard typographic leading (blank spaces) between lines of text.”

•   Staggered right margins (no right justification)

22333334455

3.36

 

Have clear pictures and graphics (when essential to item)

 

 

 

 

 

•   Pictures are needed to respond to item

335555555

4.56

1-Does this mean how important is it for there to be picture-based items?  In some speaking tests, pictoral prompts are often the best way of getting students to produce language (without giving them directions and thus affecting their responses).

2-Deaf examinees tend to focus on pictures when provided, more than on text.  A picture should be related to the correct response, not be distraction to “trick” certain students (more than others).

3-Pictures, line art, etc. should be related to the item, should enhance understanding not required for understanding with the exception of data tables like in math and science tests.

4-Perhaps reword as “only construct-relevant visuals.”

Should we change consideration to read: “When pictures are used, they should be redundant with the text as much as possible.”?

N/A

N/A

 

•   Pictures with clearly defined features

44444455555

4.45

1-As we noted previously, this seems like catering to the lowest common denominator. Might there be instances where grayscale and shading are most appropriate for providing construct-relevant information to the student? If so, then alternate representations would need to be considered for visually impaired students.

•   Dark lines (minimum use of gray scale and shading)

33334444455

3.82

1-Pictures with clearly defined features” covers this point. This now seems redundant and could be removed.

•   Sufficient contrast between colors

13333444555

3.64

1-Again, color may be a different and additional issue for computer based assessments.

2-“Color alone is not relied on to convey important information or distinctions.”

•   Color is not relied on to convey important information or distinctions

23334445555

3.91

1-This is perhaps too vague to be useful as a guideline.

•   Pictures and graphs are labeled

33334444555

3.91

 

Have concise and readable text

 

 

 

•   Commonly used words

12445555555

4.18

1-Unless it is a vocabulary test.

2-Commonly used words are not necessarily easy…they can carry multiple meanings.

3-What does this mean in a “content” assessment? 

4-Again phrasing is inconsistent.  This is not appropriate for vocabulary/language assessment.

5-If the construct is about unique words, then common words might make sense.

6-Important, but highly subjective.

7-That is, use common meaning, not low-frequency meaning for common words.

8-This can be collapsed into the second item: age level vocabulary.

9-“Commonly used words” seems dependent on the construct being measured, e.g., vocabulary instruction. Can this point be combined with the next one?

Should we re-word the above consideration?  If so, what should it say?

N/A

N/A

 

•   Vocabulary appropriate for grade level

445555555555

4.83

1-I think both this and the next bullet could easily be combined with the first.

2-Depends on what you are measuring…

•   Minimum use of unnecessary words

124445555555

4.17

 

•   Idioms avoided unless idiomatic speech is being measured

344555555555

4.67

 

•   Technical terms and abbreviations avoided (or defined) if not related to the constructs being measured

44455555555

4.73

 

•   Sentence complexity is appropriate for grade level

14455555555

4.45

 

•   Question to be answered is clearly identifiable

 

555555555555

5

1-Refer to the work and writing of Jamal Abedi at UCLA regarding language simplification as it affects testing of LEP students.  Very critical work and he has many concrete suggestions about item writing and presentation that are consistent with our AERA, APA, NCME Standards for Educational and Psychological Testing.

Allow changes to its format without changing its meaning or difficulty (including visual or memory load)

 

 

 

 

1-The change in meaning is going to be speculative until the research is done, and maybe not even then.

•   Allows for the use of braille or other tactile format

34455555555

4.67

1-Not all items can be brailled or made tactile in a meaningful way. Would this mean that such items cannot be used for any students? What’s wrong with having equivalent items that can be so modified?

•   Allows for signing to a student

33455555555

4.55

 

•   Allows for the use of oral presentation to a student

33345555555

4.36

 

•   Allows for the use of assistive technology

33445555555

4.45

 

•   Allows for translation into another language

 

12333355555

3.64

1-Validity may be compromised as bias is interjected.

2-Some critical and relevant words/concepts do not translate into every language.  I think we might have to narrow to a few core languages.

3-This is a questionable and highly controversial issue, particularly when one realizes that such a standard is impossible to meet.  About 72% of our LEP students are Spanish speakers, but the other 28% represent many diverse languages.  How do we accommodate and what is the theoretical rationale and what is the technology for doing this?  Is it possible?  Is it beneficial?

4-Most likely invalidates the test scores for high stakes decision, should probably assess in an alternate fashion that more validly assesses students’ ability, skill, knowledge, or learning.

5-A sticky wicket…

6-What about ELA tests? Also, what if the student prefers seeing the question in both their native language and the language of instruction? Should we deny them this opportunity?

7-There is the continual challenge of literal interpretation in translation, and limitations in expertise to translate to all needed languages. So, do students who speak Spanish receive interpreted tests, and those speaking a less well known language do not?

8-This issue will be one of the most difficult to overcome on a political and emotional level. Data analysis has shown these to be less than valid measures of student performance. Additionally, it has been suggested that it takes approximately 5 years for a student to become proficient in English.

Additional comments on this consideration?

N/A

N/A

 

Does the test...

 

 

 

Have an overall appearance that is clean and organized?

 

 

 

•   All images, pictures, and text provide information necessary to respond to the item

344445555555

 

4.5

1-Redundant with “pictures are needed to respond to item” and “minimum use of unnecessary words.”

2-I still rate this as a 5!

•   Information is organized in a manner that is consistent with an academic English framework with a left-right, top-bottom flow

444444445555

4.33

 

In addition to the other considerations, a computer-based test should have these considerations:

Layout and design

 

 

 

•   Sufficient contrast between background and text and graphics for easy readability

444455555555

4.67

1-“Sufficient luminance contrast between…”

•   Color is not relied on to convey important information or distinctions

233334455555

3.92

1-“Color alone is not relied…”

•   Font size and color scheme can be easily modified (through browser settings, style sheets, or on-screen options)

233344555555

4.08

 

•   Stimulus and response options are viewable on one screen when possible

344555555555

4.67

 

•   Page layout is consistent throughout the test

444555555555

4.75

 

Navigation

 

 

 

•   Navigation is clear and intuitive; it makes sense and is easy to figure out

455555555555

 

4.92

 

 

•   Navigation and response selection is possible by mouse click or keyboard

344555555555

4.67

1-“Navigation and response selection is possible by mouse (or equivalent) or keyboard (or equivalent).”

•   Option to return to items and return to place in test after breaks

3445555555

4.6

1-As before, returning to items will only work in non-adaptive tests.

Screen reader considerations

 

 

 

•   Item is intelligible when read by a text/screen reader

344455555555

 

4.58

 

•   Links make sense when read out of visual context (“go to the next question” rather than “click here”)

444455555555

4.67

 

•   Non-text elements have a text equivalent or description

3334555555

4.3

 

•   Tables are only used to contain data, and make sense when read by screen reader

34444455555

 

4.36

 

 

Test specific options

 

 

 

•   Access to other functions is restricted (e.g., e-mail, Internet, instant messaging)

33455555555

4.55

1-“Access to functions other than assistive technologies and supports is restricted (e.g., e-mail, Internet, instant messaging).”

 

•   Pop up translations and definitions of key words/phrases are available if appropriate to the test

333444445555

4.08

 

Computer capabilities

 

 

 

•   Adjustable volume

334455555555

4.5

1-“Adjustable volume and rate of voice.”

•   Speech recognition available (to convert user’s speech to text)

112334555555

3.67

 

•   Test is compatible with current screen reader software

334444455555

4.25

 

Below are new considerations suggested by panelists on the first survey, please rank each consideration and comment if you wish (numeric rankings found below were provided by participants in last survey).

 

 

 

Computer interfaces follow  Section 508 guidelines
(www.section508.gov)

 

 

 

Students have received adequate training on use of test delivery system

 

 

 

Students writing online can get feedback on length of writing on-demand in cases where there is a restriction on number of words

 

 

 

Students are able to record their responses and read them back (or have them read-back using text-to-speech) as alternative to human scribe, but only if student has experience with this mode of expression and chooses it for the test

 

 

 

Students are allowed to create persistent marks to the extent that they are already allowed to on paper-based booklets (e.g., marking items for review; eliminating multiple choice items, etc.)

 

 

 

Alternate versions of computer interface provided that is amenable to use with screen readers (e.g., JAWS, Window-Eyes)

 

 

 

When “modifying” items for use under a UD framework, there is convincing evidence that the item still measures the same or similar intended construct

4

 

 

 

 

Obviously, since I wrote these, I think they are all important.  I didn’t rate them as “5” because I think they are very hard to do, but I also think we need to be working in this direction.

A test constructed under a UD framework allows for the measurement of the same depth of knowledge levels as the “original” test

4

 

 

 

 

 

A UD test is aligned to the standards to the same extent as the “original” test

4

 

 

Test items are piloted, field tested, and normed on all subgroups for which the measure is designed

5

 

 

Booklets/materials can be easily handled with limited motor coordination

 

 

 

Response formats are easily correlated to question

 

 

 

Computer-based option to mask items or text (e.g., split screen)

4

 

 

Place for student to take notes (on the screen for CBT) or extra white space with paper-pencil

4

 

 

Computer software for test delivery is designed to be amenable to all assistive technology

5

 

 

Items on this form are based on information presented in Thompson, Johnstone, & Thurlow (2002, Universal Design Applied to Large Scale Assessments, Synthesis Report 44); Thompson & Thurlow 2002, Universally Designed Assessments: Better Tests for Everyone!, Policy Directions 14), and Kopriva (2002, Ensuring Accuracy in Testing for English Language Learners, CCSSO SCASS-LEP Consortium), as well from NCEO staff brainstorming and input received from participants in the Universal Design Pre-conference Clinic at the CCSSO Large Scale Assessment and Accountability Conference in San Antonio, Texas, June 2003 and input from a joint project/Delphi review with the Minnesota, Nevada, and South Carolina Departments of Education.


Appendix C

Original Considerations Plus All Expert Commentary

Delphi Review of Test Item Considerations

Rating scale for importance:

5=Extremely important to consider; 4=Very important to consider; 3=Important to consider; 2=Somewhat important to consider; 1=Not important to consider.

Scales adapted from Ziglio (1996).

Considerations when reviewing any test item:

Subject Responses

Mean

Please insert your comments here…

Does the item…

 

 

 

Measure what it intends to measure

 

 

 

1-Ultimately this (is the) most important. Some of this can be accomplished within the concept of universal design…there may be content, knowledge, skills, and/or abilities that may not lend themselves well to UD in some instances.

•   Reflects the intended content standards (reviewers have information about the content being measured)

555555555555

 

5

1-Agree

Should this be changed to read: “Reflects the intended construct (reviewers have information about the construct being measured)?

N/A

N/A

1-Yes.

2-Either way.

3-How about combining the 2 ideas? Reflects the intended construct that is aligned with representative (language proficiency or academic content) standards?

4-Construct is a formal term that theorists use. Content standards is what practitioners understand.

5-Yes.

6-Yes, Definitely.

7-No—but you might add something about the cognitive demand

8-This revision would alter the meaning. I think the construct is a sort of overarching concept (i.e., reading) whereas content standards are quite narrower (i.e., reproduces capital letters). It depends what you mean as to which one you should use.  If the test is supposed to be a standards-based achievement test, then it must address standards.  If no, then the item need only address the construct.

9-Probably content is the correct phrase…construct is typically Reading or Math, etc. The content is what reviewer typically know and evaluate.

10-No. I like better as it was.

11-Yes. That would fit better with the professional terminology.

12-Yes.

13-Yes, I believe it should be changed.  Content is topical, constructs are conceptual. This difference In meaning is huge. Furthermore, constructs is a term used in APA standards and is deeper than content.

•   Minimize skills required beyond those being measured

 

334444555555

4.33

1-While I think you could argue for a “5” for the 2nd statement here, I worry that if we try to separate skills too much, we’ll wind up with tests that measure isolated, basic skills.

2-Both cognitive demand and specific content should be included in the review.  Content to content match is insufficient (see Conserving Math Construct [CMC] template).

3-I would rather it said “minimizes skills required beyond those explicitly in the standard.”

4-Very good observation.  Relevant to higher level thinking.

5-Multiple response options should be available, perhaps as a pretest assessment.

6-Important but cannot override ability to measure all content areas to assessed (e.g., draw a graph of the results).

7-Crucial for test validity for deaf test takers.

8-This may be difficult with disabled students—but try.

9-Perhaps should be reworded as “Minimize skills required beyond those intended for measurement.” Assure measurement of the intended construct. Assumptions about level of achievement possible/abilities (language, sensory, motor, background knowledge, etc.) can interfere with the ability to accurately measure the intended construct.

R1-How might the other skills be defined and targeted?  It seems if the first part is clear and clearly defin