"Testing and Teacher Accountability" - presented by Jim Carpenter

Testing and Teacher Accountability
by Dr. Jim Carpenter Prairie UU Society August 22, 2010
Introduction Jim Carpenter in 1964 began his third career, working as an elementary school teacher. By 1969 he earned a Ph.D. in educational administration from Northwestern University. Because of his concomitant work in measurement and statistics he was recruited by the Chicago Public Schools to be Director of Research and Evaluation. From 1977 until his retirement in 1991 he served as principal in Chicago elementary and middle schools, always in the inner city. From 1996 to 2001 Jim and his wife Margaret were employed as Co-coordinators of the Alternative Teacher Certification Program at the University of Wisconsin-Parkside. During part of this time UW-Parkside also enabled Dr. Carpenter to form a research and evaluation consortium for school districts in southeast Wisconsin. Prologue Because I write more concisely than I speak, I have written my presentation today and that includes this prologue. On the same day that it was suggested to me that I relate my presentation to some or all of the UU seven principles, my wife, Margaret, had mentioned to me that someone who works for her is not courageous. My response was that I thought that he is too ambitious to be courageous. It reminded me of a situation when I was Director of Research for the Chicago Public Schools and I was told that if I insisted on reporting that some programs (costing about $7 million dollars a year) did not result in improved test scores, that I would never be made Assistant Superintendent of Research. The prediction was correct and I eventually decided to become a school principal. As I said at the time, why be the head of a Department of Research and Evaluation that did not do Research and Evaluation. Every once in a while we may have our belief in principles tested. At the conclusion of my presentation there should be time for comments and questions. If elucidation is needed along the way, please feel free to ask. Testing and Teacher Accountability or Should Test Scores Be Used To Determine Teacher Accountability At one time I was the principal of a kindergarten through fifth grade inner city school in Chicago. Chicago schools were big, we had thirty students in each room and there were five rooms of each grade level. Of the approximately 150 students who completed kindergarten each year we would identify 15 to 20 who needed extra help during the coming year to get ready for first grade. Every year this class was taught by Rachel Shapiro, who, I came to think was a Jewish saint. Observing her teaching that required so much saintly patience made me squirm. If you had simply tested Rae’s class at the end of the year with a standardized test and based her pay on whether the students had achieved the second grade level, you would have turned this saint into a martyr. In what follows I would like to review some of the problems and possibilities of educational testing, indulge in offering some observations and opinions, and, in some cases, illustrate with some personal experiences. When I started evaluating educational programs and practices in 1970, standardized, or norm-referenced, tests were widely and frequently, exclusively, used. These tests were created by publishers who would give each grade level test to a respective national sample. The tests usually consisted of four part multiple-choice items. The scores were divided into ten-month school years. Thus a 1.0 was nationally typical of beginning first graders, 1.5 was the typical score of those who had completed the first half of first grade, and so on through ninth grade and above in high school. (By 1970 Academics had begun to accept that inner city students typically only gained seven instead of ten months per year and that once the socioeconomic status of students was taken into account this explained most of this variation. At this point the general public and probably most educators resisted these explanations.) The standardized test metric was very convenient and offered a way to objectively and fairly evaluate a teachers performance. When I was a middle school principal I was able to use it. If one takes a student's recent previous test scores and divides it by the number of years that a student has attended school that will determine the average number of months the student gained per year during his or her career. If a teacher can enable a student to exceed this average gain, that is commendable. As an example, if a class has an average historical gain of seven months per year, and teacher A achieves a nine-month gain, that is commendable even though the class did not gain a year in a year. On the other hand, if the historical gain is 1.2 years and teacher B achieves a one year gain, I would submit that teacher A has done a better job than teacher B, even though teacher B’s class gained a month more. Unfortunately the standardized, or norm reference, tests had some problems, some of which I would like to mention. The sample testing to establish norms was neither consistent enough nor extensive enough. When I first went to work in the Chicago Schools the annual testing was conducted by “Pupil Services” instead of by a Research and Evaluation Department. One committee was set up to select a primary test (they chose the “Metropolitan”) and another committee was set up to select the fourth through eighth grade test (they chose the “Iowa Test of Basic Skills.”) I warned the superintendent that we should select one test for all grades and that the “Iowa” was “harder” than the “Metropolitan”. He chose to honor the committee selections and for years the Chicago newspapers reported the inadequacies of Chicago fourth grade teachers. Also test publishers extrapolated beyond their samples. A fifth grade test was at most given to fourth, fifth, and sixth grades, but scores were extrapolated far beyond these three grade levels. If a ninth grade math score on a fifth grade test meant anything, at best it was the publishers guess about how a ninth grader would score on the fifth grade test; it certainly did not mean the test taker had mastered algebra. The tests are only accurate if they are given at the appropriate level. Suppose you were to give an eighth grade test to a student who could only read at a fourth grade level, and suppose you taught that student to select one choice, any choice, for each item. Since there are four choices for each item, by chance the student would get 25% of the items correct. This would probably yield a sixth grade score, even though the student could only read at a fourth grade level. In order to get an accurate measurement you would need to test students at “reading” level rather then at “grade” level. (My staff and I had done some reading level testing and knew something about the results, but by this time the superintendent had hired one of my Northwestern University professors for the job for which I had been recruited. The professor assured the superintendent that if we tested at reading level the scores would go up. When testing at reading level city wide showed that eighth graders were actually reading at a level two years below their previously reported level, it was one of the final “nails-in-the-coffin” of the superintendent, who was soon gone. I decided to become a school principal, but in subsequent years it was sometimes hard to tell whether reported citywide reading improvement was real or the result of fudging the rules to determine “reading” level.) Cheating on the standardized tests was sometimes prevalent, though subsequent precautions improved practices. The easiest way was to simply erase student’s wrong answers and darken in the correct circle. At one time my sub district superintendent sent me to visit two of my fellow principals at their schools to learn how to be a better principal. Each of my friends tried to teach me how to cheat on the tests. I became convinced that one other principal and I were the only ones in the 25 school sub district that actively worked to ensure that there was no cheating on the tests. When I suggested to the sub district superintendent a relatively easy way to check up on each school, he was not interested. I don’t think he cheated, but he was not particularly interested in having his sub district have even lower test scores. In the 1970’s a different test concept began to be promoted. This was Criterion Referenced testing. Rather than achievement tests being norm referenced, a set of criteria were to be established to determine whether students had mastered agreed upon objectives. These criteria were usually established by a group of appointed experts. (Before this time it was sometimes observed that a kind of national curriculum was established by the materials presented by textbook publishers.) The determination of the objectives and criteria did become a way to establish curriculum also. In the 1990’s the adoption of the Wisconsin Knowledge and Concepts Exam resulted in increased instructional attention to solving word problems in math and improved writing skills in language arts. (Some teachers find word problems difficult and correcting and grading students’ essays is more difficult then assigning a page in a grammar workbook.) At the present time each of the fifty states has its own Criterion Referenced set of achievement tests. Because each state established its own tests, there is no practical and reliable way to compare results between states. The test results are usually given in terms of the percentage of students who have mastered the criteria at each grade level. Whereas with the norm referenced tests, by definition half of the students were above average and half below average, with the criterion referenced test, theoretically all students can master the criteria, leading to something that has the appearance of the Lake Woebegon all-students-above-average phenomenon. Particularly with the No-Child-Left-Behind Act requirement that each school achieve this phenomenon in the near future there has been considerable incentive to make the tests as easy as possible. Presently the main basis for comparison between states is a program conducted by the government funded National Assessment of Educational progress, which does use consistent achievement tests with a sample of students in all fifty states, thereby giving some basis for comparison between states and comparisons over time. Comparisons are also made using end-of-high school achievement tests, the A.C.T and S.A.T. As I mentioned previously, using the metric embodied in the old norm referenced tests, I could compare past and present performance. With the current tests this is not possible, although efforts are being made to make the evaluation of teachers more fair by taking into account some co-variables, such as those related to social class. I would think there would be difficulties finding acceptance for this. Is it possible to establish tests with the kind of metric I just mentioned? Perhaps. There are currently actions being taken to established a national curriculum in the teaching of mathematics and, not as far along, in the teaching of language arts. These could result in the kind of tests that combine the best features of both the norm-referenced and criterion-referenced tests. In what follows I would like to address the idea of testing and teachers’ accountability, but without this perfection of the tests, I seriously doubt that this can be done fairly. I would like to mention some pertinent, and some impertinent, observations and opinions: The whole notion of teacher accountability is aimed primarily at inner city and impoverished schools. (Suburban type schools have the luxury of just engaging in good teaching because, incidentally, most of their students will do pretty well on the tests.) Without good classroom management skills, inner city teachers will not accomplish much. (Many parochial schools get away with paying their teachers about half as much as public school teachers because the parochial school teachers do not have the discipline problems to deal with.) Good, interesting teaching helps to minimize discipline problems. With good classroom management, 5% of the teacher’s time is spent on discipline; without good classroom management 95% of the teacher’s time can be spent on discipline. Facilitating the removal of ineffective teachers would do much more to improve teaching than would any merit pay incentive. (Most teachers didn’t start teaching to get rich, anyway.) In most schools many faculty and staff members know who are the good teachers and who are the poor teachers. Shared governance needs to be utilized. Research shows that Charter schools can be as good as public schools but in many instances they are not as good. In my small sample of public schools and charter schools in Milwaukee in which I had practice teachers to supervise, the public schools were better organized, better staffed, and, in my estimate, probably more effective. With charter schools, beware of comparisons if the students are self selected and beware if the faculty and staff are required to put in time and effort that are probably unsustainable over a period of time. Finally, while any evaluator of a teacher will factor in student achievement, anyone who only knows test scores should never be allowed to make the final judgment.

Testing and Teacher Accountability

by Dr. Jim Carpenter
Prairie UU Society
August 22, 2010

Introduction

Jim Carpenter in 1964 began his third career, working as an elementary school teacher. By 1969 he earned a Ph.D. in educational administration from Northwestern University. Because of his concomitant work in measurement and statistics he was recruited by the Chicago Public Schools to be Director of Research and Evaluation. From 1977 until his retirement in 1991 he served as principal in Chicago elementary and middle schools, always in the inner city. From 1996 to 2001 Jim and his wife Margaret were employed as Co-coordinators of the Alternative Teacher Certification Program at the University of Wisconsin-Parkside. During part of this time UW-Parkside also enabled Dr. Carpenter to form a research and evaluation consortium for school districts in southeast Wisconsin.

Prologue

Because I write more concisely than I speak, I have written my presentation today and that includes this prologue.

On the same day that it was suggested to me that I relate my presentation to some or all of the UU seven principles, my wife, Margaret, had mentioned to me that someone who works for her is not courageous. My response was that I thought that he is too ambitious to be courageous. It reminded me of a situation when I was Director of Research for the Chicago Public Schools and I was told that if I insisted on reporting that some programs (costing about $7 million dollars a year) did not result in improved test scores, that I would never be made Assistant Superintendent of Research. The prediction was correct and I eventually decided to become a school principal. As I said at the time, why be the head of a Department of Research and Evaluation that did not do Research and Evaluation. Every once in a while we may have our belief in principles tested.

At the conclusion of my presentation there should be time for comments and questions. If elucidation is needed along the way, please feel free to ask.

Testing and Teacher Accountability
or
Should Test Scores Be Used To Determine Teacher Accountability

At one time I was the principal of a kindergarten through fifth grade inner city school in Chicago. Chicago schools were big, we had thirty students in each room and there were five rooms of each grade level. Of the approximately 150 students who completed kindergarten each year we would identify 15 to 20 who needed extra help during the coming year to get ready for first grade. Every year this class was taught by Rachel Shapiro, who, I came to think was a Jewish saint. Observing her teaching that required so much saintly patience made me squirm. If you had simply tested Rae’s class at the end of the year with a standardized test and based her pay on whether the students had achieved the second grade level, you would have turned this saint into a martyr.

In what follows I would like to review some of the problems and possibilities of educational testing, indulge in offering some observations and opinions, and, in some cases, illustrate with some personal experiences.

When I started evaluating educational programs and practices in 1970, standardized, or norm-referenced, tests were widely and frequently, exclusively, used. These tests were created by publishers who would give each grade level test to a respective national sample. The tests usually consisted of four part multiple-choice items. The scores were divided into ten-month school years. Thus a 1.0 was nationally typical of beginning first graders, 1.5 was the typical score of those who had completed the first half of first grade, and so on through ninth grade and above in high school. (By 1970 Academics had begun to accept that inner city students typically only gained seven instead of ten months per year and that once the socioeconomic status of students was taken into account this explained most of this variation. At this point the general public and probably most educators resisted these explanations.)

The standardized test metric was very convenient and offered a way to objectively and fairly evaluate a teachers performance. When I was a middle school principal I was able to use it. If one takes a student's recent previous test scores and divides it by the number of years that a student has attended school that will determine the average number of months the student gained per year during his or her career. If a teacher can enable a student to exceed this average gain, that is commendable. As an example, if a class has an average historical gain of seven months per year, and teacher A achieves a nine-month gain, that is commendable even though the class did not gain a year in a year.

On the other hand, if the historical gain is 1.2 years and teacher B achieves a one year gain, I would submit that teacher A has done a better job than teacher B, even though teacher B’s class gained a month more.

Unfortunately the standardized, or norm reference, tests had some problems, some of which I would like to mention.

The sample testing to establish norms was neither consistent enough nor extensive enough.

When I first went to work in the Chicago Schools the annual testing was conducted by “Pupil Services” instead of by a Research and Evaluation Department. One committee was set up to select a primary test (they chose the “Metropolitan”) and another committee was set up to select the fourth through eighth grade test (they chose the “Iowa Test of Basic Skills.”) I warned the superintendent that we should select one test for all grades and that the “Iowa” was “harder” than the “Metropolitan”. He chose to honor the committee selections and for years the Chicago newspapers reported the inadequacies of Chicago fourth grade teachers.

Also test publishers extrapolated beyond their samples. A fifth grade test was at most given to fourth, fifth, and sixth grades, but scores were extrapolated far beyond these three grade levels. If a ninth grade math score on a fifth grade test meant anything, at best it was the publishers guess about how a ninth grader would score on the fifth grade test; it certainly did not mean the test taker had mastered algebra.

The tests are only accurate if they are given at the appropriate level.

Suppose you were to give an eighth grade test to a student who could only read at a fourth grade level, and suppose you taught that student to select one choice, any choice, for each item. Since there are four choices for each item, by chance the student would get 25% of the items correct. This would probably yield a sixth grade score, even though the student could only read at a fourth grade level. In order to get an accurate measurement you would need to test students at “reading” level rather then at “grade” level. (My staff and I had done some reading level testing and knew something about the results, but by this time the superintendent had hired one of my Northwestern University professors for the job for which I had been recruited. The professor assured the superintendent that if we tested at reading level the scores would go up. When testing at reading level city wide showed that eighth graders were actually reading at a level two years below their previously reported level, it was one of the final “nails-in-the-coffin” of the superintendent, who was soon gone. I decided to become a school principal, but in subsequent years it was sometimes hard to tell whether reported citywide reading improvement was real or the result of fudging the rules to determine “reading” level.)

Cheating on the standardized tests was sometimes prevalent, though subsequent precautions improved practices.

The easiest way was to simply erase student’s wrong answers and darken in the correct circle. At one time my sub district superintendent sent me to visit two of my fellow principals at their schools to learn how to be a better principal. Each of my friends tried to teach me how to cheat on the tests. I became convinced that one other principal and I were the only ones in the 25 school sub district that actively worked to ensure that there was no cheating on the tests. When I suggested to the sub district superintendent a relatively easy way to check up on each school, he was not interested. I don’t think he cheated, but he was not particularly interested in having his sub district have even lower test scores.

In the 1970’s a different test concept began to be promoted. This was Criterion Referenced testing. Rather than achievement tests being norm referenced, a set of criteria were to be established to determine whether students had mastered agreed upon objectives. These criteria were usually established by a group of appointed experts. (Before this time it was sometimes observed that a kind of national curriculum was established by the materials presented by textbook publishers.) The determination of the objectives and criteria did become a way to establish curriculum also. In the 1990’s the adoption of the Wisconsin Knowledge and Concepts Exam resulted in increased instructional attention to solving word problems in math and improved writing skills in language arts. (Some teachers find word problems difficult and correcting and grading students’ essays is more difficult then assigning a page in a grammar workbook.)

At the present time each of the fifty states has its own Criterion Referenced set of achievement tests. Because each state established its own tests, there is no practical and reliable way to compare results between states. The test results are usually given in terms of the percentage of students who have mastered the criteria at each grade level.

Whereas with the norm referenced tests, by definition half of the students were above average and half below average, with the criterion referenced test, theoretically all students can master the criteria, leading to something that has the appearance of the Lake Woebegon all-students-above-average phenomenon. Particularly with the No-Child-Left-Behind Act requirement that each school achieve this phenomenon in the near future there has been considerable incentive to make the tests as easy as possible.

Presently the main basis for comparison between states is a program conducted by the government funded National Assessment of Educational progress, which does use consistent achievement tests with a sample of students in all fifty states, thereby giving some basis for comparison between states and comparisons over time. Comparisons are also made using end-of-high school achievement tests, the A.C.T and S.A.T.

As I mentioned previously, using the metric embodied in the old norm referenced tests, I could compare past and present performance. With the current tests this is not possible, although efforts are being made to make the evaluation of teachers more fair by taking into account some co-variables, such as those related to social class. I would think there would be difficulties finding acceptance for this.

Is it possible to establish tests with the kind of metric I just mentioned? Perhaps. There are currently actions being taken to established a national curriculum in the teaching of mathematics and, not as far along, in the teaching of language arts. These could result in the kind of tests that combine the best features of both the norm-referenced and criterion-referenced tests. In what follows I would like to address the idea of testing and teachers’ accountability, but without this perfection of the tests, I seriously doubt that this can be done fairly.

I would like to mention some pertinent, and some impertinent, observations and opinions:

The whole notion of teacher accountability is aimed primarily at inner city and impoverished schools. (Suburban type schools have the luxury of just engaging in good teaching because, incidentally, most of their students will do pretty well on the tests.)
Without good classroom management skills, inner city teachers will not accomplish much. (Many parochial schools get away with paying their teachers about half as much as public school teachers because the parochial school teachers do not have the discipline problems to deal with.)
Good, interesting teaching helps to minimize discipline problems.
With good classroom management, 5% of the teacher’s time is spent on discipline; without good classroom management 95% of the teacher’s time can be spent on discipline.
Facilitating the removal of ineffective teachers would do much more to improve teaching than would any merit pay incentive. (Most teachers didn’t start teaching to get rich, anyway.)
In most schools many faculty and staff members know who are the good teachers and who are the poor teachers. Shared governance needs to be utilized.
Research shows that Charter schools can be as good as public schools but in many instances they are not as good. In my small sample of public schools and charter schools in Milwaukee in which I had practice teachers to supervise, the public schools were better organized, better staffed, and, in my estimate, probably more effective. With charter schools, beware of comparisons if the students are self selected and beware if the faculty and staff are required to put in time and effort that are probably unsustainable over a period of time.
Finally, while any evaluator of a teacher will factor in student achievement, anyone who only knows test scores should never be allowed to make the final judgment.