Please take our brief survey

Blueprints Programs = Positive Youth Development

Return to Search Results

Promising Program Seal

Reading Recovery

Blueprints Program Rating: Promising

A one-to-one tutoring intervention to reduce the number of first-grade students who have extreme difficulty learning to read and write and to reduce the cost of these learners to educational systems.

  • Dr. Marie M. Clay, Deceased
  • Academic Performance

    Program Type

    • Academic Services
    • Mentoring - Tutoring
    • School - Individual Strategies
    • Skills Training

    Program Setting

    • School

    Continuum of Intervention

    • Indicated Prevention (Early Symptoms of Problem)

    A one-to-one tutoring intervention to reduce the number of first-grade students who have extreme difficulty learning to read and write and to reduce the cost of these learners to educational systems.

      Population Demographics

      The program targets all first-grade students that score in the lowest 20% on reading skills as determined by a Diagnostic Survey and the classroom teachers’ judgment. The program is effective among schools in low, average, and high SES settings and has been implemented in various countries (e.g., US, Australia, England). The program has also been successfully used among Spanish speakers (Descubriendo la Lectura by its Spanish acronym).

      Age

      • Late Childhood (5-11) - K/Elementary

      Gender

      • Male and Female

      Race/Ethnicity

      • All Race/Ethnicity

      Race/Ethnicity Specific Findings

      • Hispanic or Latino

      Race/Ethnicity/Gender Details

      The program targets at-risk first-graders of any race/ethnicity and both genders. A Spanish version of the program has been tested with Spanish-speaking students.

      • Individual
      • School
      Risk Factors
      • School: Poor academic performance*
      Protective Factors
      • Individual: Problem solving skills
      • School: Instructional Practice

      *Risk/Protective Factor was significantly impacted by the program.

      See also: Reading Recovery Logic Model (PDF)

      The program is an intensive one-to-one tutoring intervention program for the poorest readers (lowest 20%) in first-grade classrooms. During daily 30-minute lessons, teachers who are specifically trained in Reading Recovery techniques individually tutor up to eight faltering readers to help them develop the kinds of strategies that good readers use. For the first 10 days, the teacher does not teach, but rather, explores reading and writing with the child to determine specific needs. During the following days, Reading Recovery lessons evolve around reading small story books (the teacher chooses from 500 books organized into 20 reading levels), manipulating letters and words, and composing and writing a story. Specific skills taught include problem solving strategies based on self-monitoring, cross-checking, predicting, and confirming, as well as the use of multiple sources of information while reading and writing. Children typically leave the program within 12 to 20 weeks (60 sessions), as soon as they have reached about the average level of text reading for their class.

      Reading Recovery originated in New Zealand, and has been a nationwide program in that country since 1979. It has been successfully adapted and tested for four years in Ohio, and is now being disseminated to many other locations throughout the United States, Canada, and Australia. Reading Recovery in the U.S. is a collaboration between universities and school districts, involving a one-year academic course for teachers. By the early 1990s, Reading Recovery was operating in 48 states.

      The program is an intensive one-to-one intervention tutoring program targeting the lowest-achieving 20% of first-grade readers. During daily 30-minute lessons, teachers who are specifically trained in Reading Recovery techniques individually tutor faltering readers to help them develop the kinds of strategies that good readers use. Lessons are tailored to a student's individual strengths and needs based on teacher observations. Teachers must be able to make highly skilled decisions at each moment during the lesson. Once fully trained, Reading Recovery teachers provide lessons to approximately eight first grade students. These students are served during half of the teacher's work day. During the other half of the day the Reading Recovery teacher performs additional duties that vary by individual, such as classroom instruction, small-group work, or instructional coaching.

      Reading Recovery focuses on phonemic awareness, phonics, vocabulary, fluency, and comprehension. The program starts with an assessment of the child’s strength and weaknesses (letter identification, word test, concepts about print, writing, dictation test, text reading). For the first 10 days, the teacher does not teach, but rather, explores reading and writing with the child. During the following days Reading Recovery lessons evolve around reading small story books (the teacher chooses from 500 books organized into 20 reading levels) and composing and writing a story. Specific skills taught include problems solving strategies based on self-monitoring, cross-checking, predicting, and confirming, as well as the use of multiple sources of information while reading and writing. Once students are equipped with these strategies for independent processing, struggling readers can achieve at average levels and maintain proficiency in the regular classroom without special intervention.

      Lessons are discontinued when students demonstrate the ability to consistently read at the average level for their grade, between weeks 12 and 20 of the program. Those who make progress but do not reach average classroom performance after 20 weeks are referred for further evaluation and a plan for future action.

      Teacher training includes a one-year university-based training program through a network of partner universities and ongoing professional development by a Reading Recovery teacher leader.

      Theory of learning: The program is based on a theory of learning which assumes that people learn by constructing meaning through social interactions. Learners engage in social activities that support their learning, and they gradually take over the process, becoming independent literacy learners.

      Theory of instruction: Any theory of learning implies a theory of instruction. Adults help children to solve problems and in the process provide conditions that help the children find the patterns and regularities they will use to solve problems alone at future times. The complexity of whole tasks is maintained, yet each is tailored for the child to participate easily. The involvement of the more expert adult provides demonstrations that communicate information about the way people go about the task.

      Reading Recovery provides opportunities for ongoing conversation while the student is engaged in authentic reading and writing tasks. The conversation between teacher and child operates to stimulate, encourage, challenge, and support reading work. This is based on the theoretical assumption that higher mental functions appear first on the social level between people (intercognitive), and later on the individual level, inside the child (intracognitive). This growth occurs in the zone of proximal development, that phase in the development of a cognitive skill where a child has only partially mastered the skill. By employing the skill with the assistance of an adult, the child internalizes it.

      • Cognitive Behavioral
      • Skill Oriented
      • Social Learning

      (May et al., 2013, 2014):
      These two randomized controlled trials estimated short-term program impacts on student achievement after the 12-20 week program. Both studies were part of a national scale-up implementation of Reading Recovery. Of the schools implementing the program, 209 and 348 were randomly selected to participate in the trial, of which 158 and 267 implemented the random condition assignments. At each school, the eight first-grade students with the lowest reading achievement were matched according to pretest scores and English language learner status, and within each pair, one student was randomly assigned to treatment and the other to control. The studies administered pretests prior to randomization and posttests at the conclusion of the intervention (midway through the school year). Of the 1,253 and 2,067 randomly assigned students, 866 (69%) and 1430 (69%) students in 147 and 233 schools had Reading Recovery data, outcome data, and a match with complete data.

      Other studies:
      Randomization
      : All other studies used quasi-experimental designs to study program effects (e.g., Pinnell et al., 1988; Curry, Griffith, & Williams, 1995; Burroughs-Lange & Douetil, 2007) or a mix of quasi-experimental and randomized control trial designs (e.g., Pinnell et al., 1994; Baenen et al., 1997; Hurry & Sylva, 2007). In most of these cases schools that were already implementing Reading Recovery were selected and were matched with schools of similar characteristics that formed the control group.
      Conditions: Usually, the Reading Recovery treatment was compared to a control group without access to the program. However, some studies have investigated variations of Reading Recovery by modifying the instruction framework (one-to-one instruction vs. small groups) or the teacher training model (varying underlying training philosophy and training length) (Pinnell et al., 1994). Other studies have compared Reading Recovery to other types of interventions (e.g., Phonological Training, Hurry & Sylva, 2007).
      Sample size: Usually samples were of medium size, including between 300 and 500 students (intervention and control group combined), located in 20 to 40 schools in up to 10 school districts.
      Assessment: Some studies evaluated only posttest effects (e.g. Escamilla, 1994; Burroughs-Lange & Douetil, 2007), while the majority investigated the sustainability of the program effects between one and three years after the intervention was completed.
      Locations: The program has been implemented in locations with diverse SES characteristics in various countries (e.g., US, Australia, England).

      Primary findings:
      All studies demonstrated the effectiveness of Reading Recovery in improving reading and writing skills of low-achieving first-graders compared to a control group. However, the results are mixed regarding long-term effects. While some studies (Pinnell, DeFord, & Lyons, 1988, Pinnell et al., 1994; Curry, Griffith, & Williams, 1995) found sustained improvement in children’s reading abilities a year or longer after posttest, other studies (Center et al., 1995; Baenen et al., 1997; Hurry & Sylva, 2007) reported declining effects that became non-significant at the final assessment (1 to 3 years).

      Secondary findings:
      Compared to alternative interventions (e.g., Reading Success, Phonological training), Reading Recovery consistently produces the strongest improvements in children’s reading and writing skills (Pinnell et al., 1994; Hurry & Sylva, 2007). This suggests that the unique program features of Reading Recovery such as the teachers training length and philosophy as well as the one-to-one tutoring approach is important for children's learning success (Pinnell et al., 1994).

      In addition, Reading Recovery has been shown to positively affect learning attitudes (Burroughs-Lange & Douetil, 2007), and to be effective if used among Spanish students (Curry, Griffith, & Williams, 1995; Escamilla, 1994).

      Primary findings:

      • Consistent evidence confirmed that Reading Recovery improves reading and writing skills of low-achieving first-graders compared to a control group measured at posttest.
      • Mixed results regarding long-term effects of Reading Recovery.

      Secondary findings:

      • Compared to alternative interventions (e.g., Reading Success, Phonological training), Reading Recovery consistently produced the strongest improvements in children’s reading and writing skills.
      • Reading Recovery was able to produce a better attitude towards learning among low-achieving students.
      • Reading Recovery was effective among Spanish students in its Spanish version Descubriendo la Lectura.

      No mediating effects were measured.

      Reading Recovery generally showed medium to strong effects on reading and writing skills of low-achieving first-grade students. A few examples follow:

      • Medium sized effects were observed if improvements in reading skills were measured by dictation assessment (d=.65) and the Gates-MacGinitie test (d=.51), and strong effects for the reading level assessment (d=1.50) (Pinnell et al., 1994).
      • Out of 7 significance tests on reading and writing skills, Burroughs-Lang & Douetil (2007) observed 6 strong effects (Cohen’s d>.8) and 1 medium effect (d=.76).
      • On an overall reading/spelling composite measure, a medium effect (d=.77) was observed comparing Reading Recovery students to a control group within schools, while a strong effect (d=.88) was observed if the same comparison was made between schools (Hurry & Sylva, 2007).
      • For 6 out of 8 measures, very strong effects (d=1.48-3.05) were observed at posttest. The effect size diminished slightly at the 3-month follow-up (d=.76-1.55), and no significant effect was observed at the 12-month follow-up (Center et al., 1995).
      • Five out of six measures of literacy (the 6th is not provided) displayed large effect sizes ranging between d=.90 - d=2.02 (Schwartz, 2005).
      • For three reading outcomes (composite reading, reading words subscale, and reading comprehension subscale) from a standardized achievement test, effect sizes were small-medium (May et al., 2013, 2014).

      No effect sizes were reported by a number of studies (e.g., Pinnell, DeFord, & Lyons 1988; Curry, Griffith, & Williams, 1995; Baenen et al., 1997; Escamilla, 1994).

      The results can be generalized to young children (first-grade) who have reading problems of both genders and various ethnic and socioeconomic backgrounds. The program's effectiveness has been confirmed for schools serving low, average, and high SES neighborhoods, in various countries (e.g. England, US, Australia), and when taught in different languages (English and Spanish).

      The studies investigating Reading Recovery show some substantial limitations. Reoccurring issues include the following:

      • Attrition is frequently not reported and no study employed statistical tests to investigate differential attrition.
      • Description of the sample characteristics are limited and frequently completely absent (e.g., Pinnell, DeFord, & Lyons, 1988; Escamilla, 1994; Center et al., 1995; Baenen et al., 1997).
      • A number of studies did not follow the intent-to-treat principle (e.g., Pinnell et al., 1994; Curry, Griffith, & Williams, 1995; Center et al., 1995; Baenen et al., 1997; Schwartz, 2005).
      • A number of studies did not investigate long-term effects (> 1 year) (e.g., Pinnell et al., 1994; Escamilla, 1994; Burroughs-Lange & Douetil, 2007; Schwartz, 2005).
      • Many studies did not use a full randomization procedure to assign intervention and control groups (e.g., Escamilla, 1994; Curry, Griffith, & Williams, 1995; Center et al., 1995; Burroughs-Lange & Douetil, 2007; Hurry & Sylva, 2007; Schwartz, 2005).
      • Most studies did not control the fidelity of program implementation (e.g., Escamilla, 1994; Curry, Griffith, & Williams 1995; Baenen et al., 1997; Burroughs-Lange & Douetil, 2007; Schwartz, 2005).
      • A number of studies poorly report their methodology (Curry, Griffith, & Williams, 1995; Baenen et al., 1997).
      • In many studies the evaluation of program effectiveness is conducted by investigators (often teachers) that are not blind to the group assignment (e.g., Pinnell et al., 1998; Escamilla, 1994; Pinnell et al., 1994; Center et al., 1995; Baenen et al., 1997; Schwartz, 2005).

      Of the studies reviewed, only studies 10 and 12 (May et al., 2013, 2014) meet Blueprints criteria. Major limitations in baseline equivalence, differential attrition, and controls for baseline outcomes disqualify the others. May et al. (2013, 2014) did not conduct a complete differential attrition analysis and it is not clear if the study followed intent-to-treat, but these limitations likely do not seriously bias the results.

      There are a number of studies examining the effectiveness of Reading Recovery, but most are not high quality. What Works Clearinghouse identified 202 studies, of which three met evidence standards: Pinell et al. (1988), Pinell et al. (1994), and Schwartz (2005).

      • Blueprints: Promising
      • What Works Clearinghouse: Meets Standards Without Reservations - Positive Effect

      If you would like to contact a site currently implementing this program, please contact:

      Jady Johnson, Executive Director
      Reading Recovery Council of North America
      500 West Wilson Bridge Road, Suite 250
      Worthington, Ohio 43085-5218
      Phone: (614) 310-7323
      Fax: (614) 310-7345
      jjohnson@readingrecovery.org
      www.readingrecovery.org

      Baenen, N., Bernhole, A., Dulaney, C., & Banks, K. (1997). Reading Recovery: Long-term progress after three cohorts. Journal of Education for Students Placed at Risk, 2(2), 161-181.

      Burroughs-Lange, S., & Douetil, J. (2007). Literacy progress of young children from poor urban settings: A Reading Recovery comparison study. Literacy Teaching and Learning, 12(1), 19-46.

      Center, Y., Wheldall, K., Freeman, L., Outhred, L., & McNaught, M. (1995). An evaluation of Reading Recovery. Reading Research Quarterly, 30(2), 240-263.

      Curry, J., Griffith, J., & Williams, H. (1995). Reading Recovery in AISD. Austin Independent School District: Department of Audit and Evaluation.

      D’Agostino, J. V. & Murphy, J. A. (2004). A meta-analysis of Reading Recovery in United States schools. Educational Evaluation and Policy Analysis, 26(1), 23-38.

      Escamilla, K. (1994). Descrubriendo la Lectura: An early intervention literacy program in Spanish. Literacy, Teaching, and Learning, 1(1), 58-70.

      Hurry, J., & Sylva, K. (2007). Long-term outcomes of early reading intervention. Journal of Research in Reading, 30(3), 227-248.

      May, H., Gray, A., Gillespie, J. N., Sirinides, P., Sam, C., Goldsworthy, H., ... Tognatta, N. (2013). Evaluation of the i3 Scale-up of Reading Recovery: Year one report, 2011-12. Philadelphia, PA: Consortium for Policy Research in Education.

      May, H., Goldsworthy, H., Armijo, M., Gray, A., Sirinides, P., Blalock, T. J., ... Sam, C. (2014). Evaluation of the i3 Scale-up of Reading Recovery: Year Two Report, 2012-13. Philadelphia, PA: Consortium for Policy Research in Education.

      Pinnell, G. S., DeFord, D. E., & Lyons, C. A. (1988). Reading Recovery: Early intervention for at-risk first graders (Educational Research Service Monograph). Arlington, VA: Educational Research Service.

      Pinnell, G. S., Lyons, C. A., DeFord, D. E., Bryk, A. S., & Seltzer, M. (1994). Comparing instructional models for the literacy education of high-risk first graders. Reading Research Quarterly, 29(1), 8-39.

      Schwartz, R. M. (2005). Literacy learning of at-risk first-grade students in the Reading Recovery early intervention. Journal of Educational Psychology, 97(2), 257-267.

      Reading Recovery Council of North America
      500 West Wilson Bridge Road, Suite 250
      Worthington, Ohio 43085-5218
      Phone: 614-310-READ (7323)
      Main Fax: 614-310-7345
      Conference Dept. Fax: 614-310-7342
      http://www.readingrecovery.org/

      Study 10

      May, H., Gray, A., Gillespie, J. N., Sirinides, P., Sam, C., Goldsworthy, H., ... Tognatta, N. (2013). Evaluation of the i3 Scale-up of Reading Recovery: Year one report, 2011-12. Philadelphia, PA: Consortium for Policy Research in Education.

      Study 12

      May, H., Goldsworthy, H., Armijo, M., Gray, A., Sirinides, P., Blalock, T. J., ... Sam, C. (2014). Evaluation of the i3 Scale-up of Reading Recovery: Year Two Report, 2012-13. Philadelphia, PA: Consortium for Policy Research in Education.

      Pinnell, G. S., DeFord, D. E., & Lyons, C. A. (1988). Reading Recovery: Early intervention for at-risk first graders (Educational Research Service Monograph). Arlington, VA: Educational Research Service.

      Evaluation Methodology

      Design:
      This study was a randomized control trial, but the write-up targets primarily practitioners and therefore does not contain many technical details.

      Recruitment:
      The program was implemented in 12 schools in Columbus, OH in the year 1985-1986 (the criteria for selection were not discussed). Thirty-two trained teachers were involved in the project. The lowest 20% (determined by diagnostic survey and teachers' judgment) of children in the classrooms taught by Reading Recovery teachers were selected for the program. The lowest 20% of children in other classrooms in the same schools were also identified; half of these children were randomly assigned to receive Reading Recovery and half were randomly assigned to receive an alternative compensatory intervention. The alternative intervention was implemented in small groups (2-4 students); it is otherwise unclear how this program was structured.

      Sample size/Attrition:
      The study was conducted with 187 children (136 Reading Recovery intervention (RR) and 51 alternative intervention (AI)). Additionally, at each time measurement point, a random sample of students (excluding Reading Recovery and alternative intervention students) was drawn to provide a grade-level average (102 regular first-grade students, 68 regular second-grade students, 67 regular third-grade students). At the time of the first assessment (May 1986, end of first grade), 98% of the intervention group (3 children had moved from the district) and 100% of the alternative intervention students were tested. Attrition increased for the two follow-ups during May 1987 (completion rates: RR=85%; AI=84%) and May 1988 (completion rates: RR=77%; AI=82%).

      Assessment:
      The Reading Recovery intervention was implemented throughout the school year 1985/1986. A pretest was conducted during Fall 1985 while assessments of the program success were conducted in May 1986 (end of first grade), in May 1987 (end of second grade) and May 1988 (end of third grade).

      Sample characteristics:
      No description of sample characteristics is given.

      Measures:
      Children were assessed on eight dependent measures:

      • Text reading skills
      • Letter identification skills
      • Word test
      • Concepts about print
      • Writing vocabulary
      • Dictation test
      • Two subtests of the Comprehensive Tests of Basic Skills (Reading Vocabulary and Reading Comprehension)
      • Writing sample

      Even though not explicitly mentioned, it appears that teachers who delivered the intervention also did the assessments.

      Analysis:
      No statistical tests were conducted. Only means are compared to evaluate the effectiveness of the program. In addition, normal curve equivalent (NCE) gain scores from baseline were computed for Reading Recovery and comparison groups.

      Intention-to-treat: The study complied with the intent-to-treat principle.

      Outcomes

      Implementation fidelity: Teachers received training in Reading Recovery by the program developer Marie Clay. The program was pilot tested at Columbus Public Schools during 1984-1985.

      Baseline Equivalence/Differential attrition: No analysis of baseline equivalence or differential attrition was performed.

      Posttest: In May 1986, Reading Recovery children as a total group (73% had successfully discontinued) scored higher than children in the alternative intervention on all measures. For example, on the text reading test Reading Recovery children scored 9.95 after intervention while alternative intervention children scored only 6.96. Moreover, the scores of the total Reading Recovery children were similar to those of the reference group of regular first-grade students. Specifically, the Reading Recovery group scored slightly higher on letter identification (51.92 vs. 51.78), concepts about print (16.40 vs. 16.00), writing sample (2.94 vs. 2.92), and dictation (31.20 vs. 30.24), and slightly lower on writing vocabulary (34.68 vs. 38.12), text reading (9.95 vs. 11.13), and word test (13.62 vs. 13.91), compared to the reference group.

      When students were given the Comprehensive Tests of Basic Skills, the Reading Recovery children as a group (both discontinued and not discontinued children) gained ground relative to the level of skills expected of them in the fall and again in May. For example, on the measure for reading comprehension Reading Recovery children gained 7.0 points (NCE Gain Score) while children in the comparison group showed a reduction in the gain score (-4.5), comparing pretest and posttest results.

      The Reading Recovery program was extended and implemented across the state of Ohio (years 1985-86 110 children served by 28 teacher leaders at 22 schools; 1986-87 1,130 students served by 235 teachers in 167 school districts; 1987-88 2,648 children served by 416 teachers in 228 school districts). Even though long-term effects were not consistently measured and also no control group (alternative intervention) was used, this extended study confirms the effectiveness of Reading Recovery. The data show that high percentages of the Reading Recovery children, ranging from 68.5% to 94.8%, achieved scores on reading and writing skills that were similar to those of the reference group of first-graders without reading disorder.

      Long-term effects: The group of Reading Recovery children maintained the advantage that they had achieved at posttest (at the end of first grade) over children who had received the alternative intervention for up to 2 years after being released from the program. For example, at the end of the third grade, the mean text reading level score for successfully discontinued Reading Recovery children was 23.99 which slightly surpassed the score of the randomly sampled comparison group (23.50), while the mean score for students in the alternative intervention group was 16.71.

      Limitations

      • Random assignment was not made for all intervention students.
      • No clear description of the alternative intervention is provided.
      • No proof of validity of the used measurements is provided.
      • No description of sample characteristics is given.
      • Attrition was larger than 10% for all groups at the one year and two year follow-up assessment.
      • Observers were not blind to the condition.

      Pinnell, G. S., Lyons, C. A., DeFord, D. E., Bryk, A. S., & Seltzer, M. (1994). Comparing instructional models for the literacy education of high-risk first graders. Reading Research Quarterly, 29 (1), 8-39.

      This study tests whether the success of Reading Recovery can be attributed to certain components within the instruction framework or the teacher training model. Compared to the original study (Pinnell, DeFord, & Lyons, 1988), this study uses a more sophisticated analytical approach (multilevel models). Besides these small differences, this study closely replicates the methodological setup of the original study.

      Evaluation Methodology

      Design:
      Sample size/Attrition:
      A total of 403 first-grade students (age 6 years) representing two rural, two suburban, and six urban school districts were identified to participate. In each of the 10 districts 4 schools were assigned a different treatment resulting in a sample of 40 at the school level. Seven schools were dropped from the analysis, six because the random assignment or the test administration was not accomplished correctly and one in which pretest data was "lost in the mail". Thus the overall experimental sample was reduced to 324 students (80%) in 33 sites (82%). In addition, some individual students were lost due to mobility after the experimental treatment, absenteeism, and failure to obtain a valid administration of a particular test. No information is provided on attrition.

      Study type/Randomization/Intervention:
      The design employed for the study was a mix of quasi-experimental and randomized control trial with a split-plots design replicated over a series of blocks (districts). The quasi-experimental part comes from selecting one school in each of the 10 districts that already had Reading Recovery (RR). This school was designated as the RR treatment site for the district. Three additional schools were also identified in each district and randomly assigned to one of the three alternative treatments: 1) Reading Success (RS), which utilized the Reading Recovery lesson framework and procedures in individual daily lessons for children, but teachers were trained in an alternative teacher education model (Theoretical Orientation to Reading Profile developed by DeFord [1985]); 2) Direct Instruction Skill Plan (DISP), which used a one-on-one treatment but varied in the activities and instructional emphasis; 3) Reading and Writing Group (RWG), which involved trained RR teachers applying their knowledge to work with groups of children.

      Each school established a pool of 10 of the lowest-scoring students. Four students from within each pool were randomly assigned to the treatment at that school. The remaining students in the pool constituted a randomized control group. The control group was taught in small groups by teachers who had not received any special training; these teachers were instructed to help students build basic reading skills without specific directions of how to accomplish this goal. With an intervention and control group in each of the four schools, the design included eight groups.

      Assessment:
      For all treatment and comparison groups, pretest data, consisting of the Mason Early Reading Test, Dictation Task 1, and text reading level assessment, were collected in October of Year I (1989). At the conclusion of the tutorial programs in February, the full battery of student measures, that is, Dictation Task 2, text reading level, Woodcock Reading Mastery, and Gates-MacGinitie, were collected in order to determine the immediate impact of the four alternative treatments. As a first follow-up at the end of the academic year in May, the Gates-MacGinitie was readministered in order to assess end-of-the-year progress. Finally, sustained impact (if any) of the four treatments was determined on Dictation Task 3 and text reading tasks 8 months after posttest in October of Year II (1990).

      Sample characteristics:
      The 403 students constituted 238 males and 165 females. Seventy-two children were in school districts whose policies forbade racial identification; the rest of the sample consisted of 244 whites, 86 blacks, and 1 Asian. One hundred thirty-one were in school districts whose policies prohibited the communication of information about free or reduced-price lunch. Of the remaining 272 subjects, 166 (60.8%) were receiving free lunch and 11 (4%) were receiving reduced-price lunch.

      Measures:
      Validity of measurements:
      Measurement validity was established by correlating the test results with scores on a test of word reading with 100 children at age 6. In addition, test-retest reliability was estimated or reported based on published findings that used the same measure. Also Cronbach’s alpha reliability was reported. However, it is not clear who collected the data (presumably the teachers who delivered the program).

      Primary outcomes:
      Students’ reading and writing skills were evaluated using a number of different tests.

      • Dictation tests: three dictation tests were administered (test-retest reliability coefficients .73-.89; Cronbach’s alpha=.96).
      • Text reading level: Clay’s running record technique was utilized (alpha = .83; item separation reliability = .98).
      • Mason early reading test: combines spelling skills test, recognition of high-frequency words, decoding make-believe words, and a reading task (no reliability test was performed for this measure).
      • Revised version of Woodcock reading mastery test: constitutes a comprehensive battery of tests measuring aspects of reading ability such as visual/auditory learning, letter identification, word attack, word identification, word comprehension, and passage comprehension (internal consistency reliability coefficient =.99).
      • Gates-MacGinitie reading test: uses vocabulary and comprehension exercises (reliability coefficients for vocabulary .90-.95; for comprehension .88-.94).

      Analysis:
      The analysis used multilevel models (HLM) with a two-level structure, student-level and school-level. The statistical analysis controlled for baseline scores on two relevant pretest measures (Dictation task, Mason test).

      Intention-to-treat: The study may violate the intent-to-treat principle. The authors dropped schools from the analysis after they were assigned to one of the four groups, based on the assumption that the random within-school assignment to intervention and control group was not conducted properly, or due to the unavailability of valid pretests (p. 20).

      Outcomes

      Implementation fidelity: All teachers received extensive training in Reading Recovery procedures or their respective instruction technique necessary for their intervention (e.g. direct instruction skills plan).

      Baseline Equivalence: The study conducted a test for baseline equivalence between the intervention and control group for each school. In general, most pairs were well matched with four exceptions for which gross initial differences were observed. The authors assumed that randomization was not implemented at these schools and thus dropped these cases from the subsequent analysis.

      Differential attrition: No information on attrition was reported, and no analysis of differential attrition was performed.

      Posttest: The multilevel-models showed that compared to the control group in the same school both the Reading Recovery (RR) and Reading Success (RS) interventions were able to significantly improve reading and writing skills as measured by dictation assessment (RR b=4.99, p<.01, d=.65; RS b=3.45, p<.05, d=.45), text reading level assessment (RR b=5.84, p<.001, d=1.50; RS b=1.75, p<.05, d=.45). However, only the Reading Recovery intervention showed a significant improvement on reading skills on the Woodcock Reading Mastery test (b=.32, p<.05, d=.49) and the Gates-MacGinitie test (b=5.19, p<.05, d=5.1). The reading and writing group intervention showed marginal significant improvements on the text reading level assessment (b=1.60, p<.1, d=.41). No significant results were observed for the Direct Instruction Skills Plan intervention. In summary, Reading Recovery showed significant improvements on all four tests.

      Long-term effects: Long-term effects were evaluated by two tests. As mentioned above Reading Recovery showed significant improvements on the Gates MacGinitie test measured in February 1990 as posttest. The same test was administered 3 months later, in May 1990, but no statistical results were observed at this point. The second test to evaluate long-term effects was a dictation assessment. Significantly higher scores on the dictation assessment were achieved in February for both the Reading Recovery intervention and the Reading Success intervention. However, 8 months later a sustained effect was detected only for the Reading Recovery intervention; none of the other three interventions differed significantly from the control group. After 8 months, Reading Recovery showed a sustained significant effect on text reading level (b=5.12, p<.01, d=.75) and a marginal significant effect on the dictation test (b=4.98, p<.1, d=.35).

      Limitations:

      • No information on attrition was provided and no analysis of differential attrition was performed.
      • The intent-to-treat principle was not followed since the authors intentionally dropped schools from their analysis for which the randomization procedure had allegedly not been implemented.
      • The follow-up period was short (8 months) and the only two out of four tests were employed to measure long-term effects.
      • Observers were not blind to conditions: It is not clearly stated who conducted the tests (most likely the teachers that implemented the program also performed the tests).
      • No information was presented on selection of school districts and their representativeness.
      • Reading Recovery schools had already adopted the program and were self-selected rather than randomly assigned.

      Burroughs-Lange, S., & Douetil, J. (2007). Literacy progress of young children from poor urban settings: A Reading Recovery comparison study. Literacy Teaching and Learning, 12 (1), 19-46.

      This study differs from Pinnell, DeFord, and Lyons in that the researchers took no part in the work in schools, nor manipulated any features of the school provision to children. The study identified and selected already occurring circumstances and, after matching on important characteristics known to affect learning outcomes, compared children’s literacy progress. In addition, the setting differed in that schools were chosen that serve disadvantaged urban areas in London. This study does not investigate long-term effects of the program.

      Evaluation Methodology

      Design:
      Sample size/Attrition:
      The intervention took place across one school year (2005-2006) in 42 schools serving low-income urban areas in London. The sample chosen from 21 Reading Recovery schools contained 605 children of which 145 were characterized as low-achievers, while 588 children formed the collective sample of students in the 21 comparison schools of which 147 children were identified as low-achievers.

      Study type/Randomization/Intervention:
      The study used a quasi-experimental design. The study compared the literacy attainments in schools where some children received Reading Recovery interventions with attainments in schools where children received alternative interventions. The sample was matched on characteristics at three levels, boroughs (London’s administrative divisions), schools, and children in classrooms. Five London boroughs had Reading Recovery provision in some of their schools (group 1). Five other London boroughs were selected to form the comparison group because they were similar in achievement levels in standardized national tests (group 2). Twenty-one elementary schools were chosen who had an established Reading Recovery program, while the 21 elementary schools forming the control group were “nominated by the borough education officers as of most concern for high numbers of children with poor performance in literacy” (p. 24). "In each of the 42 schools, the eight children considered lowest in literacy formed one sample for comparison, and children in their entire classroom in Year 1 formed the other sample for this evaluation" (p. 24-25).

      Assessment:
      Children in Year 1 classrooms and the lowest-achieving eight children within those classrooms were assessed in each of the 42 schools in September 2005 and again in July 2006.

      Sample characteristics:
      The London boroughs selected for the Reading Recovery and comparison samples are among the lowest achieving in England. In both boroughs about 8% of 11-year-old children were achieving below the competency of a seven- to eight-year-old. In the 21 Reading Recovery schools, 40% of students received free school meals while in the comparison group these children amounted to 44%. In the Reading Recovery schools, 49% of the students spoke English as their second language while this percentage was 48% for the comparison group.

      Measures:
      Validity of measurements:
      All measures have been used by prior studies. However, no additional validity tests are reported. The Observation Survey and the BAS test were administered by trained research assistants.

      Primary outcomes:

      • Word recognition and phonic skills measure (WRAPS) (classrooms and low-achievers)
      • An Observation Survey of Early Literacy Achievement (low-achievers) measured the following:
        • Concepts About Print
        • Letter Identification
        • Writing Vocabulary
        • Hearing and Recording Sounds in Words
        • Text Reading
        • Book-level
      • Standard Reading Recovery diagnostic (low-achievers)
      • BAS Test to identify word reading age in months (low-achievers)
      • Change in attitudes to learning and self-confidence (CAPSD) – based on teachers’ evaluation (low-achievers)

      Analysis:
      To assess program effects, the intervention and control groups were compared using ANOVA. In the case where significant baseline differences between treatment and control groups emerged, baseline scores were used as controls.

      Intention-to-treat: The study complied with the intent-to-treat principle. For example, if children had recently left the class, research assistants were sent to their new school to administer the tests.

      Outcomes

      Baseline Equivalence: No statistical differences were observed for characteristics at the school level (e.g., free school meals, percentage of children with English as second language) or at the individual level (gender, age) at baseline. Among the outcome measures, a significant difference on book-level was observed comparing the low-achiever groups in the Reading Recovery group to the comparison group. The authors controlled for this difference in the subsequent analysis.

      Differential attrition: All children who had started in the studied classrooms but who had left schools or were absent when the final assessment took place were examined. Their scores were similarly distributed across groups; therefore, the authors concluded that attrition did not bias the analysis.

      Posttest (July 2006):
      Classroom comparison: The WRAPS test indicated that Reading Recovery schools showed stronger (p<.05) progress on both available measures for word reading and phonic skills, compared to the comparison schools.

      Low-achiever comparison: Comparing low-achievers who received Reading Recovery to low-achievers in comparison schools revealed significant (p<.05) differences on all measures of reading and writing skills (book level, concepts about print, letter identification, sounds in words, written vocabulary, BAS age, WRAPS age) with mostly strong effect sizes. For example, in text reading on a gradient of difficulty, children who received Reading Recovery were on average more than 14 book levels higher on the posttest compared to pretest assessment, while comparison group children on average made only 4 book-level gains from an equivalent baseline score. Children who received Reading Recovery were at age appropriate levels across all assessment measures at the end of the evaluation year. Comparison children were not. In addition, a subjective evaluation by classroom teachers indicates that children who had received Reading Recovery compared to the control group showed a significantly (p<.05) better attitude towards learning as measured by oral communication, work habits, social interaction with adults and peers, and self-confidence. No gender effect in the impact of Reading Recovery was observed. Boys and girls attained similar age-appropriate reading levels at the end of the program.

      Limitations

      • Intervention schools had already chosen to use the program, and therefore were self-selected.
      • No randomization was used: The choice of the matching schools seems to bias the estimates since schools were intentionally chosen that are characterized by a large number of low-achieving students.
      • The statistical analysis was conducted at a different level (individuals) than the matching procedure (schools) – the authors failed to use appropriate statistical models to account for this multi-level structure.
      • The study did not evaluate long-term effects of the program.
      • The study did not monitor the fidelity of program implementation.

      Curry, J., Griffith, J., & Williams, H. (1995). Reading Recovery in AISD. Austin Independent School District: Department of Audit and Evaluation.

      The study was conducted in the Austin, Texas, Independent School District (AISD). Compared to the original study (Pinnell, DeFord, & Lyons, 1988), this study compared low-achieving first graders in schools where Reading Recovery was either available or unavailable. No randomization to intervention or control group was employed. This study used only one measure/test to assess the effectiveness of the program. The program was also evaluated among Spanish-speaking students who were instructed with the Descubriendo la Lectura version of the program.

      Evaluation Methodology

      Design:
      Sample size/Attrition:
      A total of 268 Chapter 1 and Chapter 2 first-grade students at 20 schools were eligible to receive Reading Recovery. Only those students whose pretest score (MRT) was at or below the 30th percentile (N=154) comprised the group of students that were used to evaluate the Reading Recovery program in AISD. Reading Recovery students were compared to a control group that was composed of Chapter 1-eligible students who attended other Chapter 1 schools that did not offer Reading Recovery (N=285). In addition, a group of 23 students (that was excluded from the 154 English Reading Recovery students) received the Spanish version (Desubriendo la Lectura) of Reading Recovery. The study reports that out of the 154 program students, 9% withdrew (entered special education or withdrew for other reasons). However, the study failed to investigate whether the group of attritors differed systematically from the group of program completers.

      Study type/Randomization/Intervention:
      This study used a quasi-experimental design. No intentional assignment of treatment and control groups was conducted; rather, the study compared schools in which Reading Recovery was already established to schools in which Reading Recovery was not available to low-achieving students. This approach fails to control for structural differences that led schools to adopt or not adopt the program.

      Assessment:
      Different tests were used at the two assessments. The program’s effectiveness was evaluated by comparing normal curve equivalents (NCEs) percentiles for the pre- and posttest at the beginning and end of the school year 1993/94.

      Sample characteristics:
      The Reading Recovery group differed on numerous socio-demographic characteristics from the control group. In the Reading Recovery group 59% and in the control group 49% were male. In the Reading Recovery group the majority were Hispanics (59%) while in the control group the majority were African American (61%). Special Education was received by 13% of the students in the Reading Recovery intervention group while this percentage was as low as 4% in the control group. However, both groups showed a high percentage of children coming from low income families (93% and 92%).

      Measures:
      The study relied on two measures that were administered at different assessment points. The measures have been widely used and can be assumed to be valid.

      • Metropolitan Readiness Test (MRT) (fall, pretest)
      • Iowa Test of Basic Skills (ITBS) (spring, posttest)

      The 50th percentile is the average score for both the MRT and ITBS. For Spanish Reading Recovery students, the MRT and La Prueba were used as pre- and posttest respectively.

      Analysis:
      The authors transformed percentile scores to normal curve equivalents (NCEs) for the pre- and posttest comparison. The NCE relates a student’s percentile rank to the normal curve. The national mean NCE is 50 with a gain of 2.0 NCE points considered to be the average expected gain for a school year. No baseline controls were included.

      Intention-to-treat: The study may not have complied with the intent-to-treat principle. Investigators “decided that only the students with a valid pre- and posttest would be studied” (p. 4).

      Outcomes

      Baseline Equivalence: Although substantial differences in socio-demographic characteristics between the intervention and control group exists, the study did not control for these differences.

      Differential attrition: No test for differential attrition was performed. In fact, the study only evaluated differences for students for which complete pre- and posttest data were available.

      Posttest: The results show that Reading Recovery improved reading skills of low-achieving first-graders. However, Reading Recovery was only effective for students who had been successfully discontinued over the school year. Successfully discontinued students scored higher on the posttest (mean NCE=39.2) than students in the control group (mean NCE=36.7). The grade equivalence for the successfully discontinued Reading Recovery students was 1.6, which is the expected gain for AISD Chapter 1 students. Reading Recovery was not effective for students who were not discontinued at the end of the school year (mean NCE=24.1) or had received less than 60 lessons of instruction (mean NCE=24.2).

      Spanish-speaking students who were instructed with Descubriendo la Lectura made the greatest gains of all students. The discontinued Spanish students scored a median percentile of 60.5 as a group on the La Prueba end of year test which places them above the national average of 50%. This shows that the program’s effectiveness is not dependent on the cultural context or language used. In addition, the study found that Reading Recovery is unable to improve reading skills among higher-achieving students.

      Long-term effects: A rank-order form was used to observe how grade-2 students, who were Reading Recovery students in 1992/93, ranked in reading in the year following Reading Recovery instructions. Those students who successfully discontinued Reading Recovery on average placed in the 53rd percentile in their second grade classes. Thus, the program can be assumed to produce lasting improvements in reading skills.

      Limitations:

      • No randomization was employed; sample was self-selective.
      • Different measures for pretest (MRT) and posttest (ITBS) were used; comparing percentiles of these different measures to evaluate effectiveness of the program is not ideal.
      • Implementation fidelity was not monitored or evaluated.
      • No analysis of differential attrition or baseline equivalence was conducted.
      • Poor methodology and poor reporting.
      • The study did not follow the intent-to-treat principle.
      • Wrong level of analysis: analysis was done at the individual level while group assignment was conducted at the school level.

      Hurry, J., & Sylva, K. (2007). Long-term outcomes of early reading intervention. Journal of Research in Reading 30 (3), 227-248.

      This study was fielded in England and explored the long-term effectiveness of Reading Recovery and a specific phonological training. The study employed a mixed design, combining quasi-experimental with randomized control trial. The strength of this study is its assessment of long-term effects, up to 3 years after posttest.

      Evaluation Methodology

      Design:
      Study type/Sample size/Attrition:
      This study used a mixture of a randomized control trial design and a quasi-experimental design. At the start of the study in 1992, all 24 English schools which had chosen to provide Reading Recovery were initially included in the evaluation. During the intervention year, two schools abandoned Reading Recovery (reason not stated) and were dropped by the researchers from the study. For each Reading Recovery school, the primary schools adviser identified two schools with similar characteristics, which were then randomly assigned to an alternative intervention (phonological training, N=23) or the control group (N=18), resulting in a final sample of 63 schools (QED). In each of these 63 schools, the six poorest Year 2 readers (age 6 years), approximately the bottom 20% of readers, were selected on the basis of their performance on a diagnostic survey. In the 22 Reading Recovery schools, the 4 poorest scorers among selected children were offered intervention, the remainder being assigned to a within-school control condition (QED). In each of the 23 phonological training schools, the pre-identified six poorest readers were randomly assigned to phonological training (n=4) or to a within-school control condition (n=2) (RCT). In the remaining 18 control schools, the pre-identified six poorest readers formed the control group.

      Intervention:
      The Reading Recovery intervention, which includes reading of graded texts, word-level phonics work and writing, was delivered in standard form by trained teachers (employed by the particular school). Children received on average 21 weeks of intervention, with an average of 77 sessions. Eighty-nine percent of the children made sufficient progress to be discontinued. The phonological training intervention involved sound awareness training and word building with plastic letters and was delivered by trained teachers that belonged to the research team (not affiliated with the particular school). Each child was given forty 10-minute individual sessions, spread over 7 months. The control group received the standard provision available in their school. As weak readers, they often received extra, specialized help with reading, on average 21 minutes weekly.

      Assessment:
      Children were pre-tested on a battery of reading tests in September/October 1992, before the start of intervention (pretest). Short-term gains were assessed in June/July 1993 after the interventions were completed (posttest). Medium-term gains were assessed 1 year later, in May/July 1994. Long-term effects were assessed 3 years later in September/December 1996, when children were in Year 6 (final year of primary school).

      Sample characteristics:
      Boys were overrepresented at 61% of the sample (class average = 52% boys). About 42% of the sample was receiving free school meals (class average 32%); 16% spoke English as a second language (class average 17%). The groups were well matched on these demographic factors with no significant differences.

      Measures:
      Validity of measurements:
      All measures have been used, tested, and evaluated in prior studies. In addition, the researchers who administered the tests were blind to the group assignment of the children.

      Primary outcomes:
      Children were assessed on standardized reading tests, sensitive to the skills addressed by both interventions. A different battery of tests was applied at each measuring point:

      Pretest and posttest:

      • British Ability Scale (BAS) Word Reading test
      • Neale Analysis of Reading test
      • Book Level
      • Clay’s (1985) Diagnostic Survey
      • Oddities test (alpha=.83)

      An overall measure of reading and spelling was calculated by summing z-scores for the Diagnostic Survey, Book Level, BAS Word Reading and the Neale Analysis of Reading, and transforming again into a z-score.

      1-year follow-up:

      • BAS Word Reading test
      • Neale Analysis of Reading test
      • Oddities test
      • BAS Spelling test
      • Graded Non-word Reading test

      An overall measure of reading and spelling was calculated by summing the z-scores for BAS Word Reading, the Neale and BAS Spelling and transforming again into a z-score.

      3-year follow-up:

      • NFER-Nelson Group Reading Test
      • Parallel Spelling Test

      An overall measure of reading and spelling was calculated by summing the z-scores for reading and spelling and transforming again into a z-score.

      Analysis:
      The study used regression analyses to estimate differential treatment effects on reading/spelling outcomes, controlling for baseline scores. The study did not use multilevel models but justified this choice by stating that preliminary analyses found “between-school variation to be very small” (p. 237). All children receiving Reading Recovery were included in the analyses, irrespective of their discontinued status.

      Intention-to-treat: It is hard to definitively decide whether the study complied with the intent-to-treat principle. Even though no attrition is reported, the study seems to use the information of all children assigned to treatment and control groups.

      Outcomes

      Implementation fidelity: Program fidelity was monitored by the senior research officer who observed each member of the team during the program implementation stage. The researchers recorded the content of every lesson, for every child, at this stage.

      Baseline Equivalence: At pretest significant differences in the overall reading/spelling scores were observed comparing the intervention groups to the control group with the intervention groups doing worse than the control group (p. 234). To address this issue, the study controlled for baseline reading ability in the statistical models.

      Differential attrition: Attrition is not discussed in the text and no test for differential attrition was reported.

      Posttest:
      Reading Recovery
      : At post-test, Reading Recovery children had made substantially more progress than both their within- and between-school controls on all measures of reading and spelling and on the overall measure, except for the Oddities Test (which is a test specifically measuring phonological abilities). For example, Reading Recovery children scored significantly higher on the BAS Word Reading test (b=1.2; p<.001; d=.81) compared to the within-school control group. For the within-school comparison, 5 out of 6 tests were significant, while for the between-school comparison all tests were significant.
      Phonological Training: For the phonological training, the effects were more mixed and generally weaker. For the within-school comparison, only 1 out of 6 tests received significance while for the between-school comparison, 2 out of 6 tests were significant. For example, children in the intervention group scored significantly higher (b=.3; p<.01; d=.30) on the Diagnostic Survey than did children in the control schools. However, no significant results were observed for the overall (reading/spelling) measure.

      Long-term effects:
      1-year follow up
      Reading Recovery: One year after children had graduated from Reading Recovery, they were still significantly ahead of their between-school controls on all measures (except for the Oddities test). Out of 6 tests 5 were significant. However, the effect size had decreased substantially for all measures (e.g., BAS Word Reading test: d=.84 vs. d=.41). In addition, the within-school comparison did not produce any significant differences between intervention and control group.
      Phonological training: The between-school comparison revealed that children who had received Phonological Training one year previously had now made significantly more progress overall, in reading and spelling, as well as phonological skills. All 6 tests were significant. However, there were no significant differences between the Phonological children and their within-school controls on any test, including the Oddities test, which directly assesses the phonological intervention focus.

      3-year follow-up
      Reading Recovery: Out of 3 significance tests (reading, spelling, overall), none was significant for either the within- or between-school comparisons.
      Phonological training: In the between-school comparison, positive effects were sustained with a significant (but weak) effect of phonological training on spelling and the overall measure for reading/spelling. However, no significant within-school difference was observed.

      Limitations

      • No information on attrition was reported and no analysis of differential attrition was performed.
      • At the school-level no true random assignment was conducted and at the classroom-level only the phonological intervention but not Reading Recovery was randomly assigned.
      • A small within-schools control group was used (N=2).

      Center, Y., Wheldall, K., Freeman, L., Outhred, L., & McNaught, M. (1995). An evaluation of Reading Recovery. Reading Research Quarterly, 30 (2), 240-263.

      This study differs from the original study (Pinnell, DeFord, & Lyons, 1988) in that it used a randomized control approach and was fielded in New South Wales, Australia. The study was limited to assess long-term program effects up to 12 months after posttest. A problem is the method of replacing discontinued intervention students with control group students, which diminished the size of the control group across assessment points.

      Evaluation Methodology

      Design:
      Sample selection and size:
      In January 1991, the 10 schools in the New South Wales (NSW, Australia) metropolitan area, which routinely offered Reading Recovery (RR) to low-achieving first-grade students, agreed to participate in the evaluation. The NSW Department of School Education selected five additional schools (where RR was not in operation), matched as closely as possible to the experimental schools in terms of educational region, socioeconomic level, and size. As only four schools could be obtained from the two educational regions by this method, a fifth school, located in a different region but matched for size and socioeconomic level, was also included as a comparison school. In the 10 RR schools, teachers identified the 20 children at greatest risk of reading failure and used the Clay Diagnostic Survey to select the 12 lowest achieving students for participation in the study.

      Study type/Randomization/Intervention:
      The study used a randomized controlled trial set-up. Eight children in each of the 10 schools were randomly assigned to two groups, the experimental (n=40) and the control (n=40), while 8 children in each comparison school formed the third group (n=40). The remaining 4 children in each school from the initial pool of 12 children were randomly assigned to a holding group. These children progressively replaced the experimental group children in RR upon the latter's discontinuation. However, these children are not included in the analysis.

      Children in the control group were able to take advantage of any support in reading typically available at each school until they entered the program.

      Attrition:
      There was substantial attrition of students in all groups resulting from students changing schools, illness factors, being withdrawn from the program prior to assessment, or ceasing to be controls by entering the experimental group. For the experimental group retention rates were 78%, 70%, 58%, 58% for pretest, posttest, first follow-up, and second follow-up, respectively. Retention rates for the control group (same order) were 98%, 85%, 78%, 40% and for the comparison group 98%, 90%, 88%, 80%.

      Assessment:
      Children’s reading and writing skills were assessed at pretest (March 1991), posttest (June/July 1991), and at 3-month (October/November 1991) and 12-month (June 1992) follow-ups.

      Sample characteristics:
      Children were about 6 years of age at pretest. No additional information regarding gender, race, SES, etc. is provided.

      Measures:
      Children were tested by trained research assistants and not by the teachers. However, it is not clear if the research assistants were blind to the treatment conditions.

      Primary outcomes:
      The Burt Word Reading Tests and the Clay Diagnostic Survey, including the following tests, were administered:

      • Book level
      • Letter Identification
      • Concepts about Print
      • Word Tests
      • Writing Vocabulary
      • Dictation

      A second set (Set 2 tests) of tests comprising the following six standardized and criterion-referenced tests, was also administered:

      • Neale Analysis of Reading Ability-Revised
      • Passage Reading Test
      • Waddington Diagnostic Spelling Test
      • Phonemic Awareness Test (Test-retest coefficient = .91)
      • Syntactic Awareness (cloze) Test
      • Word Attack Skills Test (Test-retest coefficient = .93)

      Analysis:
      A multivariate analysis of variance (MANOVA) over repeated measures was employed. Significant multivariate results (F-statistic; alpha = .05) were followed up by univariate pairwise multiple comparisons (alpha = .01). The study did control implicitly for baseline scores since they measured a group-by-time interaction.

      Intention-to-treat: The study may not have followed the intent-to-treat principle. Three students were lost from the Reading Recovery group between pretest and posttest because one student changed schools, one was ill, and one was withdrawn due to poor progress.

      Outcomes

      Implementation fidelity: Systematic observation of each Reading Recovery teacher for one session with each of the 4 students, was undertaken in April 1991 to guarantee implementation fidelity. To investigate whether teachers had altered their general theoretical approach to teaching over the implementation phase of Reading Recovery the Theoretical Orientation to Reading Profile was administered (test for spillover effects). It was shown that the teachers did not change their teaching style and thus a spillover effect is unlikely to have biased the results.

      Baseline Equivalence: Multiple comparisons indicated that there were no significant differences between the experimental and control group on any literacy measure at the pretest stage.

      Differential attrition: No analysis of differential attrition was performed by the authors.

      Posttest: Overall, Reading Recovery was effective in impacting children’s reading skills at posttest as revealed by a significant group-by-time interaction (F=4.44; p<.001) in the MANOVA model. The discontinued Reading Recovery students outperformed control students and made significantly greater gains on Burt and Clay book level tests (p<.001) and on all the Set 2 tests (p<.001) apart from two (cloze test and Phonemic Awareness Test). Thus, 6 out of 8 tests were significant.

      Long-term effects:
      3-month follow-up
      The MANOVA revealed a significant group-by-time interaction (F=4.44; p<.001) across outcome variables. Multiple comparisons showed that the experimental group was continuing to maintain its superiority on Burt and Clay book level tests and most of the Set 2 tests (p<.001). However, two tests of metalinguistic skills, the Cloze test and the Word Attack Skills Test failed to reach significance. Thus, out of 8 tests 6 were significant. However, the effect sizes indicate that compared to posttest, there was a diminution in effect size for all literacy tests.

      12-month follow-up
      A MANOVA performed on the Reading Recovery group and the control group revealed no overall significant group effect (F= 0.262, p = .0268). The univariate results indicated that only one out of eight outcome measures was marginally significant at the .01 level with the Reading Recovery group having a higher book-level score than the control group. However, the authors point out that this lack of significance might be an artifact of the small numbers remaining in the control group (40% of students). Those remaining in the control group were “probably the more skilled readers” (p.253).

      Limitations

      • Selection bias: The study selects all schools that have Reading Recovery already implemented and thus might be in general more proactive and progressive in their approach to help low-achieving students.
      • Diminishing size of control group across evaluation points due to replacement of the intervention group poses problems to robust statistical tests.
      • The study does not account for clustering at the school level.
      • The study may not have followed the intent-to-treat principle.
      • No analysis of differential attrition was performed.
      • Poor reporting on sample characteristics.
      • It is not clear if the research assistants were blind to the treatment conditions.

      Escamilla, K. (1994). Descrubriendo la Lectura: An early intervention literacy program in Spanish. Literacy, Teaching, and Learning, 1 (1), 58-70.

      This study's main goal was to test whether the Spanish version of Reading Recovery (Descrubiendo la lectura) had the same beneficial impact on Spanish low-achieving first-graders as has been shown by the original study (Pinnell, DeFord, & Lyons, 1988). However, no long-term program effects were investigated. The assessment tools employed in this study were limited to the Observation Survey and the Aprenda Reading Achievement Test.

      Evaluation Methodology

      Design:
      Sample size/Attrition:
      Subjects eligible for study participation were all first grade, Spanish-speaking students (N=180) from six elementary schools in a large urban Southern Arizona school district. All eligible students received their initial literacy instruction in Spanish. In October 1991, all 180 students were given the Spanish version of the Reading Recovery Observation Survey. Based on these data, students who were in the bottom 20% were identified. Four out of the six schools had the Descubriendo la lectura (DLL) program. In these 4 schools, 50 students were identified as low-achievers of which 23 students were selected to receive the program. In the 2 schools that did not offer a Descubriendo la lectura program, children were selected from among the lowest 20% to form a control group (N=23). From the six schools in the study, all students not identified as program students or control group (N=134), were assigned to the comparison group. No attrition is mentioned in the article.

      Study type/Randomization/Intervention:
      This study used a quasi-experimental design. No randomization procedure was used to assign students to any of the three comparison groups. Students in the program group received the Spanish version of Reading Recovery (Descubriendo la lectura, DLL). DLL has been pilot tested and closely follows the English version of Reading Recovery. Trained teachers provide at-risk children with daily 30-minutes tutoring sessions in which the child reads skills appropriate books and writes small essays. Lessons are designed to actively involve children in their own learning. Children are guided to think and solve problems while reading. Teachers provide support, but the children do the work and solve problems.

      Assessment:
      In October 1991, all 180 students were pretested. The posttest was administered to all 180 students at the end of the school year in May 1992. No assessment of long-term effects was conducted.

      Sample characteristics:
      All subjects were dominant Spanish speakers with only limited English proficiency. No additional information is provided on sex, race, or SES characteristics of the sample.

      Measures:
      Validity of measurements:
      A number of published studies found the Spanish construction of the Observation Survey to be valid and reliable. However, a problem might be that the teachers who administered the intervention also did the testing.

      Primary outcomes:
      The Spanish version of the Reading Recovery Observation Survey was used as the main tool to assess progress in reading skills. The Observation Survey consists of the following tests:

      • Letter identification
      • Word test
      • Concepts about print
      • Writing vocabulary
      • Dictation
      • Text reading

      In addition, two versions of the Aprenda Reading Achievement Test were administered at pretest (Nivel Preprimario – Subtests 2, 3, 4, and total reading) and posttest (Nivel Primer Nivel Primario - Subtests 2, 3, and total reading).

      Analysis:
      Statistical methods/baseline control:
      Mean pre- and posttest scores were compared across groups using t-tests statistics (this simple test does not allow for the inclusion of baseline scores). Because different forms of the Aprenda Achievement Test were used at pre- and posttest, student’s raw scores were standardized. In the analysis the program group comprises all students who completed at least 60 lessons, including successfully discontinued and not-discontinued students.

      Intention-to-treat: The study appears to follow the intent-to-treat principle.

      Outcomes

      Baseline Equivalence: There were significant differences at baseline (Spring 1991), comparing the program group to the control group, on a number of test scores. The authors made no effort to control for these differences in their statistical comparison of posttest results.

      Differential attrition: Attrition is not mentioned in the study.

      Posttest:
      Program vs. comparison
      : At the end of the intervention (May 1992) the program group had not only caught up to the comparison group (average students), but had surpassed them on many measures. At posttest, program students outperformed comparison students on four out of six observation tasks (differences were not significant for text reading and dictation).

      Program vs. control: Posttest results also indicated that there were statistically significant differences between the program group and control group on all six observation tasks, with the program group significantly outperforming the control group (p<.05) on all measures. For example, on the written vocabulary test, program children scored almost twice as high (48.5 vs. 25.7; p<.001) compared to control group children.

      The improvement is also reflected in standardized gain scores for the Aprenda Spanish Achievement Test. Relating pretest to posttest results showed that the program group went from the 28th percentile to the 41st percentile while the control group went from the 26th to the 28th percentile. The 50th percentile can be considered an indicator of the national average and thus, the program group was approaching this national average. At the individual level, 91% of the program students achieved end-of-year scores on all six observation tasks that either equaled or exceeded the average.

      Control vs. comparison: Comparing the control group to the comparison group shows that control group children also made gains in reading/writing skills over the academic year. However, the control group did not catch up to the comparison group while the program group did.

      Long-term effects: No long-term effects were investigated by this study.

      Limitations

      • No long-term effects were assessed.
      • Poor description of sample characteristics.
      • Even though substantial differences in baseline scores were observed the study did not account for these differences in the statistical analysis.
      • Attrition is not mentioned in the study and a test of differential attrition was not performed.
      • Poor statistical methodology: The study did not account for clustering at the school level; it only compares means at pre- and post-test without taking baseline values into account.
      • Observers were not blind to condition: Teachers who administered the intervention also did the testing.
      • No random group assignment was performed.
      • Implementation fidelity was not monitored

      Baenen, N., Bernhole, A., Dulaney, C., & Banks, K. (1997). Reading Recovery: Long-term progress after three cohorts. Journal of Education for Students Placed at Risk, 2 (2), 161-181.

      This study was implemented in the Wake County Public School System (WCPSS) in Raleigh, North Carolina. It investigated the impact of Reading Recovery for three cohorts 1990-91, 1991-92, 1992-93 of first-grade students. Similar to the original study (Pinnell, DeFord, & Lyons, 1988), this study examined long-term effects of Reading Recovery, though using an indirect measure (need of additional service after program completion). Out of all studies reviewed, this study shows the most limitations and poorly reports the methodology and results. The study used a randomized control trial setup but only for one of the three cohorts studied.

      Evaluation Methodology

      Design:
      Recruitment:
      The study appears to have used a quasi-experimental design, but provides no information on the selection process. The study was fielded in the Wake County Public School System (WCPSS) in Raleigh, North Carolina. It provides no clear description of sample size or attrition. However, from the reported figures in different tables it can be inferred that the study was conducted with students from 30 schools in which Reading Recovery was established.

      Sample size/Attrition:
      The study investigated the success of Reading Recovery across three cohorts, 1990-91, 1991-92, 1992-93. For each cohort a different study setup was used (note that the sample size for the various groups reported in the following came from Table 5, p. 170). For 1990-91 the Reading Recovery group (N=72) was compared to a control group (N=75). In the schools that had an established Reading Recovery program, half of the students were randomly assigned to the intervention and half to a control group. For the 1991-92 no random control group was assigned but rather the Reading Recovery children (N=135) were compared to a “comparison group” (N=86), which comprised the lowest readers in schools that did not offer Reading Recovery (no information about the selection or number of the comparison schools is provided). For the third cohort 1992-93 neither a control group nor a comparison group was used and thus only results for the Reading Recovery intervention group (N=244) were reported. As for attrition, it appears that more students received Reading Recovery than were used in the statistical analysis, which might suggest attrition at best or arbitrary selection at worst. If we use the “total students served” figures (Table 3, p.166), the following retention rates can be calculated for the Reading Recovery intervention group: 86% (1990-91 cohort); 92% (1991-92 cohort); 98% (1992-93 cohort).

      Study type/Intervention:
      Based on the above provided information the study might be considered a randomized control trial (at least for the cohort 1990-91). The usual reading recovery intervention was implemented based on daily 30-minute one-on-one tutoring sessions by trained teachers. Children where discontinued if they performed within the average range for their first-grade peers. A full program is generally considered 60 lessons, although sometimes the number of lessons will vary depending on students’ progress.

      Assessment:
      Reading and writing skills of children were assessed using a pretest (beginning of academic year) and a posttest (end of academic year). Selective measures at the end of the second and third academic year were used to investigate long-term program effects.

      Sample characteristics:
      Characteristics of the sample were not reported.

      Measures:
      Validity of measurements:
      The study used the validated Clay Observation Survey. It is not clear who conducted the testing (potentially not blind to conditions).

      Primary outcomes:
      The Clay Observation Survey was used to evaluate program progress. The Clay Observation Survey uses the following measures:

      • Letter identification
      • Word test
      • Concepts about print
      • Written vocabulary
      • Dictation test
      • Text reading level

      In addition, the North Carolina EOG test in reading was used to investigate long-term effects.

      Also, student need for special education, Chapter 1 service, and grade retention were measured to evaluate long-term effects of Reading Recovery.

      Analysis:
      Statistical methods/baseline control:
      The study used basic statistics such as chi-square tests or Fisher’s exact test to compare results for the program vs. control group. Baseline controls were not used.

      Intention-to-treat: Due to poor reporting of the methodology, it is not possible to judge the study’s adherence to the intent-to-treat principle.

      Outcomes

      Baseline Equivalence: Baseline scores for the program and control groups were similar in 1990-91 but differed between the treatment and comparison group for the 1991-92 cohort. No effort was made to control for these differences.

      Differential attrition: Attrition was not reported by the authors and no statistical analysis of differential attrition was performed.

      Posttest: Only for the 1990-91 cohort was a comparison between intervention and control group possible. For this cohort, Reading Recovery students showed greater mean short-term gains than control students on three of the six measures of the Clay Observation Survey (writing vocabulary, dictation, text reading). This improvement is also reflected in the observation that a higher percentage of Reading Recovery students scored in the first-grade average band than the control group on the same three measures (80% vs. 45% for writing vocabulary, 61% vs. 35% for dictation, and 49% vs. 15% for text reading).

      Similar results were obtained for the 1991-92 cohort. Reading Recovery students showed higher scores on all measures at posttest compared to the comparison group (recall that the comparison group comprised low-achievers from schools without access to Reading Recovery).

      Long-term effects: The positive effects of Reading Recovery appear to become lost over time. To measure long-term effects the study did not use the Clay Observation Survey but rather relied on measures for the need of additional services (e.g. special education) after receipt of the full Reading Recovery intervention. A small program benefit was observed one year after posttest at which point Reading Recovery children were less likely to need Chapter 1 service compared to the control group. However, after two years, Reading Recovery students were as likely to be retained in grade, placed in special education, or to receive Chapter 1 services, as the control group children.

      In addition, the North Carolina EOG Reading test was used to evaluate long-term effects of Reading Recovery. No statistically significant difference was observed between the program and control group two years after intervention.

      Limitations

      • The characteristics of the control group are not described.
      • The selection criteria for the comparison group and associated schools is not described.
      • The study does not adequately report sample sizes, sample characteristics, and attrition.
      • The methodology is poorly reported.
      • No use of baseline controls.
      • Implementation fidelity was not monitored.
      • No randomized control group was available for the 1991-92 and 1992-93 cohorts.
      • No comparison or control group was available for the 1992-93 cohort.
      • No effort was made to adjust for baseline non-equivalence between intervention and comparison group for the 1991-92 cohort.
      • It is not possible to judge adherence to the intent-to-treat principle due to poor reporting of the methodology.
      • No control for clustering at the school level.
      • Individuals who did the testing might not have been blind to student’s group assignment.

      Schwartz, R. M. (2005). Literacy learning of at-risk first-grade students in the Reading Recovery early intervention. Journal of Educational Psychology, 97 (2), 257-267.

      Evaluation Methodology

      Design:
      This randomized controlled trial of the Reading Recovery early intervention ran for the length of one academic year. The analysis sample comprised n=148 first-grade students from schools across 14 states in the U.S. Pupils fell into one of four groups -- two groups of at-risk children randomized to receive the program in either the first or second half of the school year (where those receiving RR in the second round served as a control) and two non-randomized comparison groups of high- and low-average students, respectively, selected by the intervention teachers to provide additional points of comparison and neither of which received RR.

      The evaluation sought to identify whether first-round intervention students made greater gains in reading development over those second-round intervention students who were yet to receive the intervention (thus acting as a control group).

      Forty-seven Reading Recovery teachers selected all children involved in the study, with n=2 students per teacher selected for the two intervention groups (randomized to treatment or control respectively, where treatment students receive the program in the first part of the academic year and control students receive it in the second part of the academic year) and n=2 students selected for the two non-randomized comparison groups (comprising 2 additional students from the same classroom, considered to be 'high-average' and 'low-average'). All students selected by a particular teacher were from the same classroom.

      The selection procedure for the students in each of the randomized intervention and non-randomized comparison groups began initially with the normal selection procedure for Reading Recovery (reference given, pg. 261). After this, the RR teacher identified the lowest 20% to 30% of their students for assessment on six tasks from Clay's Observation Survey (to assess reading and writing ability). The lowest three students were allocated to the program (they were not part of randomization). The fourth and fifth lowest children in the class were selected for randomization to receive the program in either the first half of the year (program, or 'first round') or the second half of the year (control, or 'second round'). The rationale for this approach is that each RR teacher has four half-hour slots, so the three lowest-performing students get the first three slots, and the fourth (last) slot is decided by random allocation.

      Two additional students from the same classroom were identified to participate in each of the three assessments. These students were selected on the basis of the classroom teacher's ranking and available assessment information as a high-average and low-average reader. The high-average child was from the middle of the teacher's rankings after the students expected to receive RR service were removed. The low-average child was the lowest student in the class who was not expected to receive RR service.

      Measurements were taken at three separate time points: pre-intervention, mid-year (at the end of round one of intervention, also referred to as 'transition') and at post-second-round intervention (i.e. at the end of the first grade academic year - usually 2 weeks before the end of the school year). Teachers who had missing data for mid-year testing were excluded from the analysis, so only data from 37 of the original 47 teachers were included in the analysis. It is not stated but this presumably means that the total potential sample was n=188 (47 teachers x 4 pupils) whereas the analysis sample was n=148 (37 teachers x 4 pupils) - a loss of 21%.

      Midyear ('transition point') measurements were taken either when the student was judged to have met the criteria to terminate the intervention (average level of literacy performance for his/her class, plus also demonstrating a particular set of strategies known to increase the chances of continued progression), or at the end of the 20th week of intervention (if adequate progression had not taken place). Generally, students ended their program participation after between 12 and 20 weeks of intervention sessions.

      The Reading Recovery teachers administered most of the measures themselves apart from the Observation Survey (used to decide upon discontinuation), which the RR program specified must be carried out by another trained teacher.

      The intervention was provided alongside standard classroom literacy instruction and any other additional literacy support provided by the school. The authors do not specify the content of the instruction/support that control participants had or any additional support that the intervention (round one) students may have had.

      The authors do not state explicitly whether there were equal numbers of participants in each of the four groups (Ns differ for each of the data tables).

      Sample:
      The analysis sample totaled n=148 first-graders from schools across 14 different US states. The sample was 53% male and 47% female, with lunch subsidy data (only available for n=107) indicating that 43% received free school lunches, 8% received reduced-price lunches, and 49% received no lunch subsidy. The racial and ethnic breakdown of the sample was 46% White, 40% African American, 12% Hispanic-Latino, and 2% Asian. No demographics were provided for the teachers involved in the study.

      Measures:
      The evaluation sought to measure whether RR improved a variety of reading and writing knowledge and skills related to literacy learning. A number of measures were used to capture different aspects of literacy development.

      Six measures taken from An Observation Survey of Early Literacy Achievement were used to assess reading and writing knowledge. These six measures were: (1) the text level task (book reading), (2) letter identification task, (3) concepts about print task, (4) Ohio Word Test, (5) writing vocabulary task, and (6) hearing and recording sounds in words task. Reliability statistics ranged from .62 - .98 (r and alpha) and intercorrelations for the tasks ranged from .554 to .894. All tasks had updated norms. Validity and discrimination data are provided in Clay (2002). All six measures were completed by teachers and were carried out at the beginning of the year (pretest), at the transition from first- to second-round of intervention service, and at the end of the school year (two weeks before the end).

      Teachers submitted a data summary for each child at each test period. They did not submit item information on each task, so reliability estimates for the research sample could not be calculated.

      The following additional measures of literacy were also used: The Phoneme Segmentation Test; The Deletion Task (10-item version of the Roser [1975] task); The Slosson Oral Reading Test-Revised; and The Degrees of Reading Power Test. However, no data was captured at pretest for these measures; it was only available at time points 2 and 3 (midyear and posttest). This means that no true pretest-posttest change scores were captured for intervention (first-round) group vs control (second-round) group for these measures. The results are therefore not reported here.

      Analysis:
      For each of the Observation Survey measures, a 4 (group) x 3 (test period) repeated measures ANOVA was conducted to examine intervention effectiveness. A significant Group x Test Period interaction for the Observation Survey variables was followed by a simple effects analysis among groups at each test period. The key test compared the randomized treatment and control group at the transition period. Effect sizes (Cohen's d) were calculated only for significant simple comparisons between the two randomized groups at the transition period. These were calculated as the mean difference between groups divided by the pooled standard deviation.

      Outcomes

      Implementation Fidelity: Implementation fidelity was not discussed by the authors.

      Baseline Equivalence and Differential Attrition: The simple comparisons between groups demonstrated baseline equivalence on all outcome variables between the round-one (intervention) group and round-two (control) group at pretest. However, the study failed to test for baseline differences by sociodemographic characteristics, despite some large differences. For example, the treatment RR group consisted of 61% males and 38% whites, while the control treatment RR group consisted of 41% males and 47% whites.

      Attrition rates were not provided by group, nor were tests done on differences in attrition by baseline characteristics.

      Posttest: The analysis for each of the Observation Survey measures resulted in a significant Group x Test Period interaction. Simple comparisons were therefore carried out and displayed significance for each of the variables, in favor of the intervention (round-one) group: Text Level, F(3, 129) = 22.77, p< .005; Letter ID, F(3, 129) = 7.54, p< .005; Ohio Word Test, F(3, 129) = 16.59, p< .005; Concepts About Print, F(3, 129) = 8.70, p< .005, Writing Vocabulary, F(3,129) = 6.67, p< .005; and Hearing and Recording Sounds in Words (HRSW), F(3, 129) = 10.29, p< .005.

      Effect sizes (d) were also provided for most of the variables, as follows: Text Level = 2.02; Ohio Word Test = 1.38; Concepts About Print = 1.10; Writing Vocabulary = .90; and HRSW = 1.06.

      Overall results for the two groups show that 65% completed early, with 16% reported as "incomplete". Interestingly, all of the "incomplete" program students came from the round-two group, with all but one of the "early completers" coming from the round-one group. This raises the question whether or not the groups were truly equivalent, or if there existed some fundamental difference not captured at the pretest stage. This could account for the unusually large effect sizes.

      Another point of note is that it is unclear whether or not the authors used comparisons of change scores in their initial calculations, as it looks as though a straight comparison between scores at midyear (one time point) have been used. This would mean that pretest scores were not factored in at all (although equivalence was demonstrated at pretest).

      No results relating to comparisons with the two non-randomized groups are displayed here, due to the fact that they were not equivalent to the randomized groups at pretest.

      Long-Term: No data was captured to assess any possible long-term effects of the program.

      No dose-response or mediation analyses were conducted.

      Limitations

      • Small group sizes.
      • Randomization of only two participants to two groups each time.
      • The teacher both delivered the intervention and administered the assessments.
      • Results of recommended completers and non-completers suggests groups were not equivalent.
      • Covariates such as race/ethnicity, gender, SES, age, and pretest data were not controlled for, in addition to the fact that clustering at the school level was not taken into account.
      • Unclear whether change scores were used in the analyses.
      • No information on differential attrition.
      • Missing data was not accounted for in the analysis, as the datasets of teachers missing any posttest data were excluded from any analyses.
      • Content of standard lessons (i.e. for control) not described or any additional services that may have been received by participants.
      • Equivalence across demographics for each group not demonstrated despite some apparently large differences by gender and race.
      • Implementation fidelity was not measured or discussed.
      • No assessment of long-term impact.
      • Possible selection bias as schools were already implementing the Reading Recovery program.
      • Group Ns change at each time point.
      • Pre-randomization selection procedure "varied across sites" - possible selection bias.
      • Self-selection of RR teachers (who volunteered to take part) - they may have been more motivated about or had greater belief in the intervention than non-volunteers.

      May, H., Gray, A., Gillespie, J. N., Sirinides, P., Sam, C., Goldsworthy, H., ... Tognatta, N. (2013). Evaluation of the i3 Scale-up of Reading Recovery: Year one report, 2011-12. Philadelphia, PA: Consortium for Policy Research in Education.

      Evaluation Methodology

      Design: This study used a randomized-controlled trial to estimate short-term program impacts on student achievement after the 12-20 week program was implemented in 2011-2012. Although this trial is one part of the study’s long-term evaluation, only findings from the posttests are yet available. Of the 628 schools involved in the larger evaluation, 209 schools were randomly selected to participate in the trial, of which 158 implemented the random condition assignments. The study did not report any details on the recruitment, characteristics, or locations of the schools. The study noted that few of the noncompliant schools deliberately decided not to participate, as many had legitimate reasons beyond the school’s control. At each school, the eight first-grade students with the lowest reading achievement were matched according to pretest scores and English language learner status, and within each pair, one student was randomly assigned to treatment and the other to control.

      In the 158 participating schools, 1,253 students were randomly assigned. The study administered pretests prior to randomization and posttests at the conclusion of the intervention (midway through the school year). Of the 1,253 randomly assigned students, 866 (69%) students in 147 schools had Reading Recovery data, outcome data, and a match with complete data. The study reported that missing data primarily resulted from student mobility or other factors that led to the inability or failure to administer the posttest assessments.

      Sample Characteristics: The study analyzed 866 students identified as having low reading achievement. The majority of the sample (61%) was male and most students were not English language learners (81-83%). Whites comprised the largest percentage of the group (56-57%), followed by Hispanics (20-22%), blacks (18-19%), and students of other race (3-5%).

      Measures: All outcome measures were taken from the Iowa Tests of Basic Skills, a well-regarded, group-administered, norm- and criterion-referenced, standardized assessment. Reliability coefficients for the test ranged from middle .80s to low .90s. The study provided references for additional details on the Iowa Test. The study used the following measures:

      • Composite reading
      • Reading words subscale
      • Reading comprehension subscale

      The study used the following pretest measure:

      • Reading performance, from the Text Reading Level subscale in the Observation Survey of Early Literacy Achievement. This one-to-one, teacher-administered, and standardized instrument has been validated by others and has shown moderate to high test-retest and internal consistency reliability.

      Analysis: To determine program effects, three-level hierarchical linear models nested students within matched pairs and matched pairs within schools. Models controlled for pretest reading performance (but not the exact outcome measure) and allowed random school intercepts and random treatment effects across schools. Effect sizes were determined with Cohen’s D and Glass’ D, the former of which was calculated with the standardized deviation for national norms and latter of which was calculated with the standardized deviation of the outcome for the control group.

      All student pairs that had complete data were included, but the study did not attempt to follow students with missing test scores.

      Outcomes

      Implementation Fidelity: The study concluded that the Reading Recovery model is being implemented with high fidelity since teachers, teacher leaders, and site coordinators met 95%, 87%, and 88% of standards, respectively. However, the study noted that there was less fidelity to the requirements of formally documenting each lesson. Further details on fidelity to program standards and guidelines are available in Chapter 4. Chapter 7 provides information on school-level implementation.

      Baseline Equivalence: The groups did not differ significantly on pretest reading performance, gender, English Language Learner status, or race, but the tests compared the analysis sample rather than the randomized sample.

      Differential Attrition: The study dropped both subjects in a matched pair if one subject was missing data. Analyses for those students included and excluded from the analytic sample indicated no significant differences in pretest reading performance, gender, race, or English language learner status. Baseline comparisons across condition for the analysis sample also indicated no differential attrition.

      Posttest: Treatment students scored significantly higher on all three reading outcomes (composite reading, reading words subscale, reading comprehension subscale). Cohen’s D effect sizes ranged from .44 to .47.

      Moderation: Results of analysis restricted to students in rural schools or to English language learners were similar to the overall results. Despite smaller sample sizes, the program had significant effects on composite reading for these subgroups.

      Limitations

      • The study may not have followed intent-to-treat, since it did not attempt to follow students with missing test scores.
      • Models controlled for pretest reading performance, a slightly different measure than the reading posttest.
      • Tests for differential attrition showed few differences across variables but were not complete in comparisons within and across conditions.

      D’Agostino, J. V. & Murphy, J. A. (2004). A meta-analysis of Reading Recovery in United States schools. Educational Evaluation and Policy Analysis, 26 (1), 23-38.

      Evaluation Methodology

      Design: This study conducted a meta-analysis of 36 U.S. studies of Reading Recovery. The studies were obtained through comprehensive searches of ERIC, PsycInfo, and Dissertation Abstracts databases and through the footnote and references lists of identified manuscripts. A total of 109 studies were collected for potential inclusion. Of these, 36 met the following eligibility criteria: (1) had evidence of treatment fidelity (students only received program instruction), (2) reported sample sizes in treatment and comparison groups, (3) had pretest or posttest scores, (4) did not duplicate data, (5) was conducted in U.S. schools, (6) had data to compute effect sizes, and (7) specified a reading skill outcome measure. An additional set of analyses used 11 studies that met the eligibility criteria and also reported pretest and posttest scores for treatment and comparison groups.

      The authors did not indicate how many studies used randomized controlled study design or how studies determined condition statuses, but noted that a small fraction of the 11 studies used randomly assigned groups. Treatment students were categorized into discontinued (students who improved enough to leave the program), not-discontinued (students who never improved enough to leave the program), and all program students. Comparison students were classified as “similar needy” (at or below the twentieth percentile) as the intervention group or as “regular” (above the 20th percentile).

      For the 36 studies, data were collected in years between 1984 and 1996. Sample sizes ranged from 9 to 1334 students. The 36 studies were conducted in various U.S. locations. Several studies were located in Ohio, a few were in Texas, and the rest were in other locations such as Oregon or Michigan.

      Sample Characteristics: No sample characteristics were provided.

      Measures: All studies used a reading skill outcome measure. Many used the following measures from the Observation Survey of Early Literacy Achievement created by the developer of Reading Recovery:

      • writing vocabulary
      • hearing and recording sounds in words
      • text reading level
      • letter identification
      • word tests
      • print concepts

      The study also used the following outcome measure:

      • standardized tests such as the California Test of Basic Skills

      Analysis: For each outcome and treatment group type, the study computed separate average weighted effect sizes at each test time (pretest, posttest, and 2nd grade follow-up), although not all groups had enough cases for each test time or outcome. For the 36 studies, many of which did not have pretest standard deviations, the study calculated effect sizes with population comparison-group means and pooled standard deviations estimated with various methods. Analysis for the group of 11 studies computed effect sizes using the conventional standardized mean difference formula.

      The study computed Z statistics for each effect size distribution to test the null hypothesis that each point estimate essentially equaled zero. For the analysis of the 36 studies, the study did not control for pretest levels or conduct significance tests for changes in effect sizes from pretest to posttest among the 36 studies. For the analysis of the 11 studies with more complete data, weighted meta-regression analysis predicted mean posttest scores controlling for mean pretest scores and condition status and using group degrees of freedom as weights.

      It is unknown if the analysis was conducted at the proper level, since the study did not report how condition statuses were determined. It is unknown if the studies followed intent-to-treat since there was no information on attrition. However, the analysis also examined results for discontinued students only – a subset of those who successfully leave the program but excluding those doing poorly enough to continue. These results likely violate the intent-to-treat principle.

      Outcomes

      Implementation Fidelity: All studies included in the meta-analysis showed evidence of treatment fidelity. The authors reported that “teachers must receive rather rigorous preparation to become [Reading Recovery] instructors, and the overall quality control in program delivery is relatively high” (D’Agostino & Murphy, 2004: 29). No other details were given.

      Baseline Equivalence: Treatment and comparison groups were not equivalent across the studies. For the 36 studies, the study reported that “across all outcomes and groups pretest effect sizes were negative, indicating that [Reading Recovery] students scored lower than comparison-group students initially” (D’Agostino & Murphy, 2004: 30). For the group of 11 studies, treatment students scored higher on standardized achievement tests and on letter identification than other low-achieving students.

      Differential Attrition: The study did not provide any information on attrition.

      Posttest: Results generally indicated improvements for the intervention group.

      Using all 36 studies, results indicated stronger findings comparing intervention students to other low-achieving rather than to regular students. Compared to similarly low-achieving students, the treatment group had significantly higher posttest scores on all seven reading skill outcome measures despite also having significantly lower pretest scores on all seven outcomes. Compared to regular students, the treatment group showed significantly higher posttest scores for three outcomes (writing vocabulary, hearing and recording sounds in words, and text reading level). The treatment group had significantly lower scores for the other four outcomes (standardized achievement tests, letter identification, word test, and concepts about print), but these scores were closer to the comparison group than pretest scores, though no significance test was conducted.

      Additional analysis using the 11 studies with pretest data for both treatment and control groups showed that all seven reading skill posttest outcomes were significantly higher among treatment compared to other low-achieving students. Weighted regressions controlling for pretest scores indicated that intervention students had higher scores for six of seven outcomes (writing vocabulary, hearing and recording sounds in words, text reading level, letter identification, word test, and concepts about print) compared to similar low-achieving students. There was no significant treatment effect for standardized achievement test scores.

      Moderation: Results showed generally higher pretest and posttest scores among students who were discontinued compared to those were not discontinued, although there were no significance tests comparing scores of these groups. However, the discontinued students would appear to be a selective group of the most successful program participants.

      1-year follow-up: In second grade, the treatment group scored significantly higher on standardized achievement tests than similarly low-achieving students. The treatment group had significantly lower scores on this outcome than regular students, but the scores were more similar at posttest than pretest, although no significance test confirmed this trend.

      Limitations

      • Most studies included did not use a randomized controlled design. No details were given on comparison groups.
      • No information on response rates, attrition, differential attrition, or intent-to-treat (but the analysis of only discontinued students likely violates intent-to-treat).
      • The strongest effects were observed for the six outcomes from Observation Survey Measures that were most closely tied to program content.
      • It is unknown if the analysis was conducted at the proper level, since the study did not report how condition statuses were determined.
      • Fewer studies had pretest scores than had posttest scores.
      • Many significant differences in pretest scores.
      • No sample characteristics were provided.

      May, H., Goldsworthy, H., Armijo, M., Gray, A., Sirinides, P., Blalock, T. J., ... Sam, C. (2014). Evaluation of the i3 Scale-up of Reading Recovery: Year Two Report, 2012-13. Philadelphia, PA: Consortium for Policy Research in Education.

      This study examined a different number of schools and a new cohort of first-grade students than May et al. (2013) in Study 10, but the sample came from the same national scale-up implementation of Reading Recovery.

      Evaluation Methodology

      Design:

      Recruitment: Prior to the start of the 2012-2013 school year, 348 schools participating in the scale-up were randomly selected for this randomized controlled trial. At each selected school, low-performing students were identified using the Observation Survey of Early Literacy Achievement. The eight students with the lowest scores were included in the study. However, 267 schools actually carried out the selection and assignment process, with the other 81 being dropped from the study. The 267 schools selected a total of 2,092 students to participate in the study. Page 43 notes that many IEP children were excluded despite low reading performance because they were seen as already receiving one-on-one reading support.

      Assignment: Of the 2,092 participating students, 1,048 were randomly assigned to the intervention group and 1,044 to the control group. They were first matched into pairs within each school according to pretest scores and English Language Learner status. One student in the pair was randomly assigned to the Reading Recovery treatment group for the first half of the school year in addition to regular classroom literacy instruction. The other student was assigned to the control group, which received regular classroom literacy instruction. The control student was eligible to receive the treatment after the program in the second half of the school year. The study noted (p. 18) that the vast majority of control group students received substantial support in addition to regular classroom instruction.

      Attrition: Assessment occurred at the end of the 12- to 20-week intervention period (midyear posttest). Of the 2,092 students, a total of 1,893 had available pretest data (90.5%). At posttest, 1,697 (81.1%) had data. A total of 1,430 students with data at both points were able to be matched into pairs of treatment and control (715 matched pairs in 233 schools). This sample represents 68.4% of the students in schools that carried out the random assignment. The missing data at the student level primarily resulted from student mobility or other factors that prohibited administration of the posttest measures to both treatment and control students in a pair.

      Sample Characteristics: Students in the sample were 58-60% male, 55% white, 21% Hispanic, and 16% Black. About 21% were English-language learners.

      Measures: The pretest measure, the Observation Survey of Early Literacy Achievement, is a one-to-one, teacher-administered, standardized assessment. It has six sub-scales: Letter Identification, Concepts about Print, Ohio Word Test, Writing Vocabulary, Hearing and Recording Sounds in Words, and Text Reading Level. The Text Reading Level subtest was used to block students during the random assignment process, and later as a pretest covariate in the statistical models of impacts (but not as an outcome).

      The Iowa Test of Basic Skills served as the outcome measure. The measure is a standardized, group-administered assessment of cognitive readiness for the academic aspects of the curriculum and growth in fundamental areas of school achievement.

      Analysis: The analysis used a three-level hierarchical linear model with students nested within matched pairs, and matched pairs nested within schools. Models controlled for pretest performance with a covariate for the Observational Survey text reading level scores, and included random effects for blocks (matched pairs), a random effect for overall school performance (random school intercepts), and a random effect for the impact of Reading Recovery (random treatment effects across schools).

      Standardized effect sizes were calculated with Glass’ D, which represents a standardized effect relative to the distribution of outcomes for control group participants. In addition, the study reported a population-based Cohen’s D standardized effect size, which was calculated by dividing the raw impact estimate by the standard deviation of Iowa Test of Basic Skills for the national norming sample.

      The study analyzed all student pairs with complete data, but it did not attempt to follow students with missing test scores. If based on student mobility and school absence, missing data are unlikely to be related to the condition.

      Outcomes

      Implementation fidelity: Overall, 85% of the indicators used to assess implementation fidelity showed adequate implementation, and all four categories of Implementation Fidelity Activities represented in the Implementation Fidelity Logic Model (Figure 1) were implemented with fidelity. However, some inconsistencies were found in the selection of students for participation in the program, particularly among students receiving special education services.

      Baseline equivalence: The baseline balance tests examined three demographic variables and one reading variable for the final analytic sample of 1,430 students in 233 schools rather than the full randomized sample. No significant differences were found between treatment and control groups on gender, ELL status, race, or text reading level.

      Differential attrition: The study dropped both subjects in a matched pair if one subject was missing data. Analyses of differences in student characteristics for those students included and excluded from the analytic sample indicated no significant differences in pretest text reading levels (p = .63), gender (p = .55), race (p = .94), or ELL status (p = .68). Baseline comparisons across condition for the analysis sample also indicated no differential attrition.

      Posttest: The intervention students showed significantly better posttest scores than the control students on total reading (Glass’ D = .42), the reading words subscale (Glass’ D = .40), and the reading comprehensive subscale (Glass’ D = .36). Separate tests for populations of special interest found significant intervention effects for rural schools and for ELL students.

      Limitations

      • The study may not have followed intent-to-treat, since it did not attempt to follow students with missing test scores.
      • Models controlled for pretest reading performance, a slightly different measure than the reading posttest.