Testing and Evaluation, Neal's story, What are standardized tests?, What does a score on a standardized test mean?, What do standardized tests measure?

Evaluation (ee-val-yoo-AY-shun) is the process of examining a problem or condition so that it can be understood and diagnosed. Testing is one of the ways to evaluate possible behavioral and mental health problems. Tests also can be used to measure normal abilities, including intelligence, personality, certain brain functions, learning capabilities, and school progress.

Neal's Story

What Are Standardized Tests?

A standardized test is a test that is given under the same conditions to everyone who takes it. The questions on the test, the instructions, the time allowed for taking the test, and the rules for scoring it are the same every time the test is given and for every person who takes it. For example, students in classrooms across the country may take the same standardized test to measure school progress. The same test booklets, answer sheets, and instructions are used at each school.

Standardized tests make it possible to compare the scores of a large group of people. For example, the math scores of all sixth-grade students in the United States can be compared by using a standardized test. It would not be possible to make such comparisons with the tests teachers make separately for their own classes because those tests most likely would differ in ease or difficulty, or might include different material. It would not be possible to make fair comparisons among students in different classes by using nonstandardized tests.

What Does a Score on a Standardized Test Mean?

The results of a standardized academic (schoolwork) test can show how well a student scored in certain subjects, such as reading comprehension (com-pree-HEN-shun) or solving mathematical problems, compared with other students in the same grade throughout the country. Scores are usually given as percentiles in this type of test. For example, a student may score in the 86th percentile in reading comprehension, which means that the student can read and understand the readings as well as or better than 86 percent of all the students in the same grade who were tested.

What Do Standardized Tests Measure?

There are many types of standardized tests. Different tests measure different factors. There are standardized tests that can measure students’ academic progress, intelligence, memory, and behavior capabilities. Some standardized tests are given to a whole group of people at once, whereas others are given individually. Group tests, particularly tests that measure school progress, are generally given in a classroom. Scores show how well a student is doing in academic subjects compared with all other students in the same grade. A typical standardized test to measure academic progress consists of a test booklet with multiple-choice questions and a separate answer sheet on which the student fills in a circle to mark the correct answer. Some standardized tests with true/false or multiple-choice short-answer questions can be taken on a computer. Computers, however, are generally not used for tests that require students to write short essays.

Group standardized tests can measure academic progress at every level. Colleges and universities often use these tests to decide whether to accept an individual as a student. For example, colleges and universities often require applicants to take one of two standardized tests, either the SAT or the ACT; graduate schools may require the standardized test called the Graduate Record Exam (GRE); law schools commonly require the Law School Admission Test (LSAT); and medical schools usually require the standardized test called the Medical College Admission Test (MCAT). Scores from these tests allow a college or university to compare the abilities of students who are applying and decide which students to accept. These tests measure how much a student has learned in school and how well he or she can solve problems, as well as other learned skills or natural aptitudes that may predict that a given applicant will be a good student. Tests are just one measure of someone's capabilities, and they are generally just one of several factors used in evaluating an applicant for a college or university. Many colleges maintain, in fact, that a prospective student's high school grade point average (GPA) is a better predictor of success in the freshman year of college than the SAT score.

What Are Psychological Tests?

Some tests are given only by psychologists (sy-KAH-lo-jists), and they are called psychological (SIGH-ko-LAH-ji-kal) tests. Among the most common psychological tests are those that measure intelligence. Intelligence tests are examples of standardized psychological tests. Some other psychological tests are not standardized, but they can still provide important information about a person's personality, feelings, ideas, and concerns and can help evaluate and diagnose problems they may have. Most psychological tests are given individually and involve a face-to-face meeting with the psychologist during testing.

One commonly used psychological test to measure intelligence (IQ) in childhood is the Wechsler Intelligence Scale for Children (ages 6–16). There is also the Wechsler Adult Intelligence Scale, which can be given to anyone over 16 years of age. Intelligence tests can also help evaluate a person for possible learning disabilities, attention problems, and intellectual disability. These tests can accurately measure a person's intelligence under most circumstances; however, some factors may prevent individuals from scoring their best, such as not feeling well or being extremely nervous about taking the test. The psychologist takes these possibilities into account and decides whether the test score recorded on that day should be considered an accurate reading of the person's true capabilities.

Paula's and Kim's Stories: Testing and Classroom Placement

Paula's best friend, Kim, took the same tests, as well as some others, with Dr. James, but for a different reason. Kim had been having trouble with her schoolwork and was finding it hard to remember what she read. In Kim's case, the tests helped Dr. James diagnose a learning disability. The tests showed that although Kim was quite intelligent, her learning disability was preventing her from doing her best work. Kim started to go to a learning support class and knew it was helping when she got a B+ on her reading test.

What Are Personality Tests?

Certain psychological tests assess personality rather than intelligence. Some personality tests are standardized, whereas others are not. An example of a standardized personality test is the Myers-Briggs Type Indicator (MBTI), which can measure a person's usual personality style. Although this test is designed for adults, it can be used for teens, and there are variations designed for younger children. Another standardized personality test for older teens and adults is the Minnesota Multiphasic Personality Inventory—Adolescent (MMPI-A), which helps identify personality disorders.

Projective tests are a different type of test that also give information about someone's personality. Projective tests are not standardized, but psychologists follow certain guidelines for scoring and interpreting them. Projective tests usually include pictures that could have many possible meanings. People are asked to describe what they see in the picture or to tell a story about it. Examples are the Thematic Apperception Test (TAT) for older teens and adults and the Children's Apperception Test (CAT) for younger children. The Rorschach test is a projective test in which individuals are shown a series of inkblot designs on cards and asked what they see in the inkblot. These tests are called projective tests because people project their own imagination, ideas, and personality onto the inkblots or pictures.

What Are Neuropsychological Tests?

Other Tests

Adaptive behavior tests can measure people's capabilities to care for themselves and carry out other types of behavior important for daily living such as counting money, shopping, and taking public transportation. They also can assess various job skills. Adaptive behavior tests are often used to evaluate the strengths, capabilities, and needs of individuals who have a developmental disability.

Vocational * tests can be used to assess people's interests, skills, and aptitudes for particular jobs. There are also many kinds of tests that allow people to choose words or phrases that best describe themselves. Such selfreport tests include checklists about behavior, feelings, or problems. These checklists can help identify important issues and start a discussion with a mental health professional who may be evaluating individuals’ needs and how best to help them. For example, a self-report measure to examine possible attention deficit hyperactivity disorder * might include symptoms of hyperactivity, impulsivity, and poor concentration. Scores are rated against how others self-report to give an indication of how significant the symptom pattern might be within a person's gender and age group.

Evaluation Interviews

Tests are not the only means of finding out about a person. In fact, the most commonly used method of evaluation by psychologists and other mental health professionals is the interview. Interviewing, which consists of questions and answers and in-depth discussion, is an important and effective way to evaluate a person's emotional and behavioral condition. Mental health professionals are trained to use interviews to understand the many aspects of someone's situation and to begin to diagnose possible problems.

How Can Evaluation and Testing Help?

What Are the Limitations of Evaluation and Testing?

The limitations of evaluation and testing begin with the expressed purpose through which each method has been designed. A method that may be quite effective for one analysis may be highly inaccurate in evaluating a related concern. For example, the Stanford-Binet Intelligence Test is accurate for identifying how closely one's intellectual ability is related to the average. However, the Stanford-Binet cannot identify the presence of a learning disability in someone who was scoring below average. The validity of the test is compromised if the test is used for a purpose other than the one for which it was designed. Most tests are not designed to be administered too often within a short period of time. If a child is sick on the day of the first administration, a second administration can be scheduled shortly after the first. However, if someone is taking the test over and over trying to improve his or her score, the results would not be accepted as valid. The results could be skewed in a higher direction because the student became more familiar with the content. By contrast, the results could be skewed in a lower direction if the student became bored with repeated administrations.

The results of the evaluation and testing could be limited by unfounded assumptions regarding the constructs being tested. For example, the traditional view of intelligence has been that it is a fixed capability. If someone were to use the results of the intelligence test to promote this misconception, it would greatly limit the use of the test results. The practical application of the construct is, of course, a better measure of the construct than any test. The best validation of any intelligence test is its correlation with a student's adaptation to school. Adaptation to school is influenced by many factors. Changing any of these influences can help the student adapt to school more effectively. Therefore, the practical application of the student's intelligence would be improved.

How Well Do These Techniques Perform in Regard to Outcomes?

Evaluation and testing instruments are constantly updated and revised as their common use and expectations change. If some instrument is not producing the expected outcomes, it is rejected when problems are identified. Many research projects rely on the effectiveness of these instruments and techniques and are constantly reporting how they can be modified or under what conditions they should be used. Despite this constant research and evaluation of the effectiveness of tests and techniques, there are still some general concerns.

In the 2010s, the Myers-Briggs remained a popular personality test, even though it was designed by a mother-daughter team during World War II and first published in 1956. The purpose of the Myers-Briggs is different from that of many other tests because the results are meant to be used by the person taking the test. The result of taking the Myers-Briggs is that the person is assigned to one of 16 personality types, each with a description of preferences. Recognizing one's preferences can help in developing relationships with others and in exploring possible careers. The debatable value of the Myers-Briggs is that its concept of personality is different from the traditional view of personality. The Myers-Briggs test produces mutually exclusive types, whereas the traditional personality types are based on a continuum of various traits. The Myers-Briggs approach sees personalities as qualitatively different, whereas the traditional approach sees personalities as quantitatively different.

In general, projective techniques are accepted as being as accurate for what they are meant to produce as are standardized tests. However, there are conflicting reports of the validity of projective techniques. One perceived problem with using the projective techniques with children is that children may fake their responses, thus hiding their true self. Also, the situation and the examiner's manner may influence the child's responses. Moreover, the lack of norms for these types of tests leads clinicians to rely on highly individualized interpretations.

Despite these weaknesses, the projective techniques remained popular with clinicians in the 2010s. One reason is that the techniques work well as an “ice breaker” in therapy. Using these techniques helps to build rapport between the clinician and the children being evaluated. Another reason is that many clinicians use these techniques as part of a structured interview. The clinicians who use these techniques are looking for broad information while recognizing the low precision of the measures. The results then are treated like clues that can be pursued later in therapy. No serious decision or immediate action would be based solely on the results of any one of these tests.

How Are Test Results Evaluated?

What Kinds of Biases May Influence These Methods?

Standardized tests are developed through constant updating of statistics for a specified population through a process called norming. These tests are expected to be free from bias for the population for which they have been normed. However, the degree to which the normed population is described varies among the many tests available. For instance, if the population is identified as American students in a specified grade or range of grades, there should be some recognition of the diversity of ethnic groups among modern American students. Because there was no serious recognition of gender differences before the 1980s, there may be some gender bias if the test has not been normed since the early 1980s.

There are other risks of bias in standardized tests, even the ones that have been normed properly. Most tests rely on questions that present an example or small story, often with a character's gender or ethnicity identified in some way. If these examples or stories promote gender or ethnic stereotypes, certain test-takers may feel that the test was not meant for someone like them. Such stereotypes could be promoted if all the female characters are depicted in traditional feminine roles or if every minority character is depicted as being in some kind of trouble.

Another source of bias is called item bias, in which the wording of the test question leads students of the same minority backgrounds to choose the same specific wrong answer over the right answer that they would have chosen had the wording been different. An example of item bias would be a question on a standardized test that required the test-taker to quickly recognize whether the wife of a duke is a “duchess” or a “dutchess.” Otherwise knowledgeable students from Dutchess County, New York, may not recognize the misspelling in the time allowed. They would have to ignore the spelling that is common to them in a way that students in other parts of the country would not. Therefore, the question would be testing something different for the Dutchess County students than for the other students taking the same standardized test. Similarly, the test question could use a term that may have a different meaning for members of a minority group. Usually the test publisher will catch examples of item bias for each administration of the test. There are some basic analytical methods to determine whether all (or most) of the high test-takers in a specific minority are choosing the same wrong answer on a particular question.

Projective tests are not free from cultural or gender bias; however, they have a new set of biases not seen in the standardized tests. Because the interpretation of the test rests on the judgment of the test administrator, there are openings for various subjective biases. The first bias is researcher bias, which occurs when test interpreters believe that they know what the outcome will be. Test interpreters may make small recording errors that lead the results to support their pre-established judgment. Moreover, test interpreters may pay more attention to those results that coincide with the pre-established judgment and ignore those results that go against it. If test interpreters are the same individuals as the test administrators, they may ask leading questions or give subtle hints about the “correct” response.

Another source of bias in the administration of projective tests is social desirability on the part of the individuals being tested. Social desirability is a factor that influences test responses when individuals who are being tested respond with what they think are the socially acceptable responses and not what they truly feel or think. Individuals may also think that the testing is leading to a predetermined end and may respond in a way that they believe will bring about that end. If individuals being tested perceive that the test administrator or interpreter has already decided what the results are, they may detect cues from the administrator and respond accordingly.

Resources

Books and Articles

Gregory, Robert J. Psychological Testing: History, Principles, and Applications. 7th ed. Boston: Pearson, 2013.

Miller, Leslie A., Robert L. Lovler, and Sandra A. McIntire. Foundations of Psychological Testing: A Practical Approach. 4th ed. Thousand Oaks, CA: Sage Publications, 2012.

Websites

American Psychological Association. “Understanding Psychological Testing and Assessment.” http://www.apa.org/helpcenter/assessment.aspx (accessed November 17, 2015).

College Board. “SAT.” https://collegereadiness.collegeboard.org/sat (accessed April 17, 2016).

Organizations

American Psychological Association. 750 First St. NE, Washington, DC 20002-4242. Toll-free: 800-374-2721. Website: http://www.apa.org/index.aspx (accessed November 17, 2015).

Association for Psychological Science. 1133 Fifteenth St. NW, Ste. 1000, Washington, DC 20005. Telephone: 202-293-9300. Website: http://www.psychologicalscience.org (accessed April 17, 2016).

The College Board. 250 Vesey Street, New York, NY 10281. Telephone: 212-713-8000. Website: https://www.collegeboard.org/ (accessed April 17, 2016).