Please take our brief survey

Blueprints Programs = Positive Youth Development

Return to Search Results

Promising Program Seal

Success for All

Blueprints Program Rating: Promising

A schoolwide reform initiative in which specific instructional processes, curriculum enhancements, and improved support resources for families and staff come together to ensure that every student acquires adequate basic language skills in pre-K through 2nd grade and that they build on these basic skills throughout the rest of elementary school.

  • Academic Performance
  • Preschool Communication/Language Development

    Program Type

    • Academic Services
    • Mentoring - Tutoring
    • School - Environmental Strategies
    • School - Individual Strategies
    • Teacher Training

    Program Setting

    • School

    Continuum of Intervention

    • Universal Prevention (Entire Population)

    A schoolwide reform initiative in which specific instructional processes, curriculum enhancements, and improved support resources for families and staff come together to ensure that every student acquires adequate basic language skills in pre-K through 2nd grade and that they build on these basic skills throughout the rest of elementary school.

      Population Demographics

      Elementary school children, K through 5.

      Age

      • Late Childhood (5-11) - K/Elementary

      Gender

      • Male and Female

      Race/Ethnicity

      • All Race/Ethnicity

      Race/Ethnicity/Gender Details

      Studies included diverse samples. The strongest study consisted of 56% African American and 10% Hispanic students.

      • School
      • Family
      Risk Factors
      • Family: Neglectful parenting
      • School: Poor academic performance, Repeated a grade
      Protective Factors
      • Family: Parental involvement in education
      • School: Instructional Practice

      See also: Success for All Logic Model (PDF)

      Success for All (SFA) is primarily a literacy program, but is also a schoolwide reform initiative in which specific instructional processes, curriculum enhancements, and improved support resources for families and staff come together to ensure that every student acquires adequate basic language skills in pre-K through 2nd grade and that they build on these basic skills throughout the rest of elementary school. As such, the need for remediation and grade retention should drastically decline. The program has two major components: (a) student-level intervention which includes instruction based on the SFA philosophy and curriculum; and (b) school-level intervention which involves establishing a schoolwide "solutions" team (i.e., a team that addresses classroom management issues, seeks to increase parents’ participation, mobilizes integrated services to help families, and identifies particular problems such as homelessness), hiring a full-time program facilitator, and undertaking training and ongoing professional development for staff. Due to the comprehensive approach to reform, the significant and ongoing professional development across multiple years, and the focus on faculty support and buy-in from the outset, a vote of at least 80% of teachers in favor of program adoption is required.

      Success for All (SFA) is more than just an elementary school literacy program. It is a schoolwide reform initiative in which specific instructional processes, curriculum enhancements, and improved support resources come together to ensure that every student acquires adequate basic language skills in pre-K through 2nd grade and that they build on these basic skills throughout the rest of elementary school. As such, the need for remediation and grade retention should drastically decline. The SFA program has two primary levels of intervention: (a) student-level interventions and (b) school-level interventions. Note that even student-level instruction is implemented school-wide.

      Student-level interventions

      • Instructional processes: Instruction focuses on cooperative learning which teaches metacognitive strategies. The cycle of instruction includes direct instruction, guided peer practice, assessment, and feedback on progress to students. Students are placed in skill-level reading groups which may cross over grades.
      • Curriculum: The curriculum is research-based reading, writing and language arts in all grades. The kindergarten curriculum is a full-day program where children learn language and literacy, math, science, and social studies through 16 2-week thematic units. The reading component in K-1 contains systematic phonemic awareness and phonics programs. Key to this curriculum is the use of mnemonic picture cards and embedded video clips that support phonics and vocabulary development. In grades 2-6, students use novels and basals but not workbooks. The curriculum emphasizes cooperative learning and partner reading activities, comprehension strategies such as summarization and clarification built around narrative and expository texts, writing, and direct instruction. Students are required to read books of their own choice 20 minutes at home each evening.
      • Tutors: In grades 1-3, specially trained certified teachers and paraprofessionals work one-on-one with any students who are failing to keep up with classmates in reading. Tutoring takes place 20 minutes per day during times other than reading periods.
      • Quarterly assessment and regroupings: Students in grades 1-6 are assessed every quarter to determine whether they are making adequate progress in reading. Assessment information is also used to suggest alternate teaching strategies, changes in reading group placement, or provision of tutoring services.

      School-level interventions

      • Solutions team: This team works in each school to help support staff and families in ensuring success of the children. For example, the Team addresses classroom management issues, seeks to increase parents’ participation, organizes and integrates services to help families, and identifies particular problems such as homelessness. The team is composed of school staff, parent liaisons, social workers, counselors, and/or assistant principals.
      • Facilitators: An on-site SFA program facilitator (a) works with teachers and staff to implement the reading program; (b) manages the quarterly assessments; (c) assists the solutions team; (d) ensures adequate communication between staff members; and (c) makes certain each child is making adequate progress.
      • Training and professional development: The staff receives three days of intensive training at the beginning of the first year of implementation. During the first year, SFA program staff typically provides 16 more days of on-site support. After the first year, approximately 15 days of additional training by SFA program staff are provided each year.

      The theoretical rationale for Success for All (SFA) exists on two levels -- theories of the importance of individual early literacy and theories of whole-school reform.

      The SFA program has a core and fundamental focus on early student literacy. SFA’s “defining characteristic” is the specific sequencing of literary instruction across the grades. The K-1 curriculum emphasizes the development of language skills and launches students into reading phonetically regular storybooks. The theory is supported by empirical evidence which suggests that phonemic awareness is the best single predictor of future reading ability.

      Some external school reform models have been criticized because their prescriptive designs may suppress teacher creativity and also require an inordinate amount of teacher prep time. However, if the reform model is clearly defined, developed with a mind toward greater fidelity, and has strong professional development and training components, these problems may be mitigated. Success for All has addressed each of these issues and is expected to have earlier and more sustained effects than models without such components.

      • Skill Oriented

      The main study (Borman et al., 2007) was a clustered randomized trial of the effect of the Success for All (SFA) literacy program on early literacy outcomes. The sample included 41 high-poverty elementary schools (grades K-5) across 11 states that were randomly assigned to either receive SFA or to act as a control school. The final sample size was over 15,000 students in 35 schools. All students in both groups took a baseline assessment at the beginning of the year. The treatment schools implemented SFA in K-2nd grade and their literacy outcomes at the end of each year were compared with literacy outcomes from the corresponding cohort from the control group. The program collected data across 3 years (i.e., the final year of data collection was when the kindergarten cohort completed 2nd grade).

      Another large study (Quint et al., 2013, 2014, 2015) used a randomized-controlled trial to estimate program impacts on kindergartners’ reading after the first, second, and third years of a multi-year evaluation project. The study recruited five school districts in four states for a total sample of 37 schools. The schools were randomly assigned to a condition, with 19 intervention schools and 18 control schools. Pretests were given in the fall of 2011 and kindergarten posttests were administered in the spring of 2012 while first grade posttests were administered in spring of 2013 and second grade posttests were administered in spring of 2014. The analysis sample for the 2013 study included 2,568 kindergartners who were present in the study schools in the fall and spring of the school year and who had valid spring test scores. In the 2014 study, outcomes were assessed among 2,251 students who remained enrolled in a school of the same type (treatment or control) and completed assessments in spring. At the end of the third year, the number of remaining students with data for all time points varied by test from 1,625 to 1,635.

      The majority of other SFA studies used a quasi-experimental design in which SFA schools were "matched" with other elementary schools in the school district based on percent free/reduced price lunch, race, and historical performance on standardized tests. The outcomes were often three subscales of the Woodcock Reading Mastery Test (Word Attack, Word Identification, and Passage Comprehension). Mean scores for SFA schools were compared to mean scores for comparison schools to determine SFA efficacy.

      Of the ten studies evaluated, Borman et al., 2007 corrected most of the serious design flaws found in the other studies, and therefore should be considered the most accurate representation of the Success for All (SFA) program. Quint et al. (2013) also used a high quality design, and had some significant effects on a phonics measure at 1- and 2-year follow-ups (Quint et al., 2014 and Quint et al., 2015).

      Borman et al., 2007 was a randomized-controlled trial which found small effect sizes (ranging from .21 to .36) for SFA students after three years of treatment (kindergarten through grade 2). The effects were similar for students who had been enrolled continuously and for students who enrolled after kindergarten. The researchers also found that the effect sizes tended to grow each year for both samples. Also, all three components (Word Attack, Word ID, and Passage Comprehension) of the primary assessment tool were significant, although Word Attack tended to have higher effect sizes.

      Quint et al. (2013) found that, adjusting for multiple hypotheses testing, intervention kindergartners scored marginally significantly higher on the word attack (p<.10), but not the letter-word test. Without the adjustment, the impact of the program on word attack scores was significant at the .05 level (effect size=.18). At the end of their first grade year (Quint et al., 2014), intervention-school students continued to improve word attack (p< .001, effect size= .35) and marginally improved letter-word identification (p= .08, effect size= .09) over the control group, though harmful effects were observed for those receiving special education. At the end of their second grade year, intervention school students continued to show significantly higher scores on the word attack subtest (p=.022).

      Four quasi-experimental studies controlled for pre-test scores and reported significance levels. Madden et al. (1993) found average effect sizes of .51, .60, and .57 for grades 1, 2, and 3, respectively. A long-term followup of these youth in the 8th grade found a reading effect size of .29 and a math effect size of .11. Munoz and Dossett (2004) found a significant but extremely small average effect size of .11 on the reading component of the Comprehensive Test of Basic Skills. Jones et. al. (1997) conducted a quasi-experimental design on a single school in Charleston, SC. The study found generally positive and significant effects on literacy achievement in the first two years of the program, but then effects disappeared in the third year. An English QED (Tracey et al., 2014) found small effects on word identification (d= .20) and word attack (d= .25) at the end of students’ 2nd grade year, but no effects on higher-level reading outcomes like passage comprehension or accuracy.

      Nunnery et al. (1997) addressed whether partial SFA implementations were as effective as full SFA implementations. They found that high-implementation, predominantly African American schools were the only schools that substantially exceeded comparison schools when controlling for pretest scores (effect sizes range from .14 to .49 in different literacy assessments). When controlling for pre-tests, no other significant differences were found between SFA schools, whether fully or partially implemented, and control schools.

      Slaven and Madden (1998) reported on evaluations of bilingual programs in different parts of the country. Two of those studies controlled for pretest scores. First, an SFA school in Philadelphia performed significantly higher in one subset (Word Attack—a test of phonetic understandings) of the literacy battery than the comparison school, but not in any of the other literacy subsets. Similarly, relatively highly impoverished SFA schools in Arizona performed better in Word Attack than comparison schools. Among less impoverished schools, there were no significant differences between SFA and comparison schools. These findings suggest that when the bilingual version of SFA works, it tends to work best on phonetically-based literacy outcomes.

      Finally, Chambers et al. (2005) looked specifically at the use of embedded video/multimedia in SFA programs. Multimedia SFA programs had higher scores than non-multimedia SFA programs in Word Attack scores, but not on the other assessments.

      The study with the strongest design (Borman et al., 2007) found the following:

      • Compared to the control schools, Success for All (SFA) schools exhibited significantly higher literacy scores after three years of the school wide implementation. Effect sizes (Cohen's d) were .33, .22, and .21 for different literacy domains.
      • SFA appeared to work equally well for students who were exposed to the treatment for the full three years and for those who enrolled after the program was implemented.
      • Program effect sizes either remained stable or grew as the SFA students were exposed to the purposively sequenced SFA program.

      Other findings include:

      • In a randomized controlled trial, program effects approached significance (p<.10) for word attack scores among Kindergarteners (Quint et al., 2013), attained significance for word attack among first and second graders (Quint et al., 2014; Quint et al., 2015), and trended toward significance (p= .08) for letter-word identification among first graders.
      • Other studies that suffered from design problems, but controlled for pretests, found very small (Cohen’s d= .11) to moderate (Cohen’s d = .6) effect sizes of the program on literacy achievement (Madden et al., 1993; Munoz and Dossett, 2004; Jones et al., 1997; Tracey et al., 2014).
      • One study on long-term effects (Borman and Hewes, 2002) found small to moderate effects through 8th grade on reading achievement (ES=.29), years of special education (ES=-.18), and never being retained in elementary school (ES=.27).
      • The evidence on whether full implementation produces significantly higher test scores than partial implementation was inconclusive (Nunnery et al., 1997).
      • Studies on bilingual versions of SFA indicate that the program can be just as effective as English-dominant SFA programs.
      • Using Multimedia as part of SFA shows promise.

      Munoz and Dosett (2004) sought to identify changes in student, teacher, and parent perceptions of school climate, educational quality, and teacher job satisfaction that could be attributed to SFA. Answers on surveys from three urban Kentucky SFA schools were compared with answers from comparison schools over a three year period. Teachers from SFA schools had increased their ratings of school climate more quickly than control school teachers. Educational quality ratings and job satisfaction ratings for teachers increased more quickly for SFA teachers compared to comparison school teachers. Students from SFA schools and comparison schools remained steady over time in their ratings of school climate over the period. SFA students increased their educational quality rating during the period while comparison school students remained steady with their rating of educational quality.

      In the main study, effect sizes were weak to moderate. The Cohen's d for the longitudinal sample compared to the control sample was .33 for Word Attack, .22 for Word Identification, and .21 for Passage Comprehension. The combined sample showed slightly higher effect sizes. The Cohen's d for the longitudinal sample was .36 for Word Attack, .24 for Word Identification, and .21 for Passage Comprehension. The authors provide context for interpreting these effect sizes. They maintain that these general effect sizes are about 1/2 to 3/4 the literacy achievement gap between black and white children.

      Effect sizes in other studies that control for pre-test scores include:

      • Cohen’s d of .51, .60, and .57 for grades 1, 2, and 3, respectively in Baltimore study (Madden et al., 1997).
      • Cohen’s d of .11 in urban Kentucky study (Munoz and Dossett, 2004).
      • Cohen's d of .29 for the 8th grade long-term followup in the Baltimore sample (Borman and Hewes, 2002).
      • Cohen’s d of .20 for word identification and .25 for word attack at the end of 2nd grade in an English study (Tracey et al., 2014).
      • Cohen’s d of .18 for kindergartners’ word attack scores (Quint et al., 2013), d of .35 for first grade word attack, and d of .09 for first grade letter-word identification (Quint et al., 2014). For word attack scores at the end of 3rd grade, the effect size was .15 standard deviations (Quint et al. 2015).

      The main study is generalizable to typical Success for All elementary schools -- i.e., high-poverty schools with the majority of students (more than 70%) eligible for free-lunch. The other studies are limited by the geographic and demographic characteristics of the sample.

      The main study (Borman et al., 2007) has only a few limitations:

      • No long-term followup on the kindergarten cohort is planned to see if the effects last into 3rd through 5th grades.
      • Word Attack, the phonetics-based assessment, has moderate effect sizes, while Word Identification and Passage Comprehension have only small effect sizes. This leaves open the question whether the phonetic foundation will ultimately translate into moderate effect sizes in more literacy domains.
      • Non-equivalency between the treatment and comparison group with respect to race/ethnicity may be a problem.

      Other studies:

      • Schools self-select into SFA by a vote of 80%. Therefore, SFA schools may be different from control schools and these differences may contribute to differences in outcomes.
      • Schools were not randomly assigned in any of the studies except for the main study (Borman et al., 2007), the Quint et al. (2013, 2014, 2015) studies, and the multimedia study (Chambers et al., 2005).
      • Many of the studies were inconsistent in reporting significance levels.
      • Many of the studies used students as the unit of analysis when the more appropriate unit would have been schools.
      • Cell sizes were below 30 in many of the studies.

      • Blueprints: Promising
      • Coalition for Evidence-Based Policy: Top Tier
      • Crime Solutions: Effective
      • OJJDP Model Programs: Effective
      • What Works Clearinghouse: Meets Standards Without Reservations - Positive Effect

      Borman, G., & Hewes, G. (2002) The long-term effects and cost-effectiveness of Success for All. Educational Evaluation and Policy Analysis, 24(4), 243-266.

      Borman, G., Slavin, R., Cheung, A., Chamberlain, A., Madden, N. & Chambers, B. (2005). Success for All: First-year results from the national randomized field trial. Educational Evaluation and Policy Analysis, 27(1), 1-22.

      Borman, G., Slavin, R., Cheung, A, Chamberlain, A., Madden, N., & Chambers, B. (2007). Final reading outcomes of the national randomized field trial of Success for All. American Education Research Journal, 44(3), 701-731.

      Chambers, B., Cheung, A., Madden, N., Slaven, R., & Gifford, R. (2005). Achievement effects of embedded multimedia in a Success for All Reading Program. Technical Report. Center for Research and Reform in Education, Johns Hopkins University.

      Correnti, R. (2009). Examining CSR program effects on student achievement: Causal explanation through examination of implementation rates and student mobility. Paper presented at the annual meetings of the Society for Research on Educational Effectiveness. Crystal City, VA.

      Jones, E., Gottfredson, G., & Gottfredson, D. (1997). Success for some: An evaluation of a Success for All program. Evaluation Review, 21(6), 643-670.

      Livingston, M. & Flaherty, J. (1997). Effects of Success for All on reading achievement in California schools. San Francisco, CA: Wested.

      Madden, N., Slaven, R., Karwit, N., Dolan, L., & Wasik, B. (1993). Success for All: Longitudinal effects of a restructuring program for inner-city elementary schools. American Educational Research Journal, 30(1), 123-148.

      Munoz, M. A., & Dossett, D. H. (2004) Educating students placed at risk: Evaluating the impact of Success for All in urban settings. Journal of Education for Students Placed at Risk, 9(3), 261-277.

      Nunnery, J., Slavin, R., Madden, N., Ross, S., Smith L. J., Hunter, P., et al. (1997). Effects of full and partial implementation of Success for All on student reading achievement in English and Spanish. Paper presented at the meeting of the American Educational Research Association, Chicago IL.

      Quint, J. C., Balu, R., DeLaurentis, M., Rappaport, S., Smith, T.J., & Zhu, P. (2013). The Success For All model of school reform: Early findings from the Investing in Innovation (i3) scale-up. New York: MDRC.

      Quint, J. C., Balu, R., DeLaurentis, M., Rappaport, S., Smith, T.J., & Zhu, P. (2014). The Success For All model of school reform: Interim findings from the Investing in Innovation (i3) scale-up. New York: MDRC.

      Quint, J. C., Zhu, P., Balu, R., Rappaport, S., & DeLaurentis, M. (2015). Scaling up the Success for all model of school reform. New York: MDRC.

      Slavin, R. E., and Madden, N. A. (1998). Success for All/exito para todos: Effects on the reading achievement of students acquiring English. Report No. 19, Baltimore, MD: Center for Research on the Education of Students Placed at Risk.

      Success for All Foundation
      200 W. Towsontown Blvd.
      Baltimore, MD 21204
      800-548-4998, ext.2372
      sfainfo@successforall.org
      www.successforall.org

      Study 1

      Borman, G., Slavin, R., Cheung, A, Chamberlain, A., Madden, N., & Chambers, B. (2007). Final reading outcomes of the national randomized field trial of Success for All. American Education Research Journal, 44(3), 701-731.

      Borman, G., Slavin, R., Cheung, A, Chamberlain, A., Madden, N., & Chambers, B. (2007). Final reading outcomes of the national randomized field trial of Success for All. American Education Research Journal, 44 (3), 701-731.

      This study sought to answer three key questions that dealt both with Success for All's direct literacy outcomes and its overall effectiveness in successfully establishing whole school reform:

      • What are the effects of the SFA program on early-elementary (i.e., through grade 3) literacy outcomes?
      • Within SFA schools, are the effects larger for youth who were enrolled for the entire three years (the "longitudinal sample") than for the sample that also included youth who enrolled after program implementation (the "combined sample")? If so, then SFA is primarily effective only as a literacy program. If not, then SFA is also effective in producing schoolwide reform that impacts all students.
      • Consistent with program emphasis on the sequencing of literacy instruction, do Year 3 program effects spread into all tested literacy domains, not just phonetic-based domains?

      Evaluation Methodology

      Design: This clustered randomized trial selected 41 elementary schools (grades K-5) across 11 states for this study.

      School recruitment took place in two phases. In Phase 1, all schools were offered a discount to purchase the SFA program. Ordinarily, schools would have to spend $75,000 the first year, $35,000 the second year, and $25,000 the third year. During the spring and summer of 2001, a one-time payment of $30,000 was offered to all schools in exchange for participating in the study. Only six schools were attracted by this incentive. Three schools were randomly assigned to SFA (Group 1) and three were allowed to spend the $30,000 on any innovation other than SFA (Group 2). The sample was not sufficient, so the following year (spring and summer of 2002), schools were offered SFA at no cost and 35 schools responded. Thus, the initial sample size was 41 schools.

      The Phase 2 recruited schools were randomly assigned to one of the two groups. Group 1 schools provided SFA to kindergarten and grades 1-2 and their outcomes were compared to corresponding students from Group 2 who received a different intervention (Phase 1 schools) or their normal reading instruction (Phase 2 schools). Group 2 schools from Phase 1 recruitment did not receive any SFA treatment, and Group 2 schools from Phase 2 recruitment received the SFA treatment only for 3rd – 5th grade students (note, however, that the effects of SFA on 3rd - 5th grade students were not studied because these students were not exposed to the program during the key foundational instruction period in K-2nd grade). Therefore, most of the schools had both a treatment and a control group within each school.

      This method of having both treatment and control groups within each school had advantages and disadvantages. The primary advantage was that this design allowed for fewer schools to participate in the study and still provide valid counterfactuals.

      One disadvantage was that contamination (i.e., instruction in the treatment grades might influence instruction in the control grades and vice versa) was a distinct possibility. However, during observations to check for treatment fidelity, researchers did not notice any significant contamination of this kind.

      A second disadvantage of this design was that having both a treatment and a control in the same school could possibly reduce the measured effects of whole school reform because both treatment and control students and their families could have taken advantage of the school-wide reform-based services (e.g., family meetings). However, during observations to check for treatment fidelity few, if any, control students were observed benefiting directly from school-level SFA services such as parental support.

      A third disadvantage of this study is that during the third year of this 3-year study, the majority of baseline 1st grade students had moved to 3rd grade. Because the Group 2 teachers used SFA with their 3rd grade students, there was no control group to compare with the treatment group. Thus, the analysis is restricted to baseline kindergartners who progressed through 2nd grade in this 3 year study.

      Of the initial 41 participating schools, five closed due to insufficient enrollment and one withdrew from the study because of “local political problems.” Of the remaining 35 schools, 18 were in Group 1 (the “treatment” group, SFA in grades K-2), and 17 were in Group 2 (the “control” group, SFA in grades 3-5 or no SFA at all). The final sample included 1,085 students in the 18 treatment schools and 1,023 students in the 17 control schools.

      Children in the kindergarten cohort were followed into any grade as long as they remained in the same school. They were also followed into special education.

      Sample: The sample was concentrated in the urban Midwest (e.g., Chicago, Indianapolis) and rural and small towns in the South. Approximately 72% of the students participated in federal free lunch program, which is similar to the 80% participation rate for SFA participants in the nation. The sample is 56% African American and 10% Hispanic. This is somewhat different than the SFA national figures of 40% and 35%, respectively. Overall, the researchers contend that the school sample was "reasonably well matched" with the SFA population.

      The total enrollment in the SFA schools was 7,923 students (mean per school = 440) and total enrollment in the control schools was 7,400 students (mean per school = 435).

      Measures: The measures used in this study were standard language arts assessments used in education research. The pre-test for the kindergarten cohort was the Peabody Picture Vocabulary Test. The Woodcock Reading Master Tests-Revised (WMTR) was used as the annual post-tests and the quarterly assessments. During Year 1 (kindergarten) and Year 2 (1st grade), four subtests of WMTR were administered: Letter Identification, Word Identification, Word Attack (decoding non-words), and Passage Comprehension. In Year 3 (2nd grade), Letter Identification was dropped because it is typically not taught in 2nd grade. The WMTR is nationally normed and has internal reliability coefficients for Word Identification, Word Attack, and Passage Comprehension subtests of .97, .87, and .92, respectively. Scores for the Peabody Picture Vocabulary Test pre-test and the WMTR post-test were standardized to a mean of 0 and a standard deviation of 1.

      The students were individually tested by trained testers who were unaware of whether the student was assigned to SFA or the control group. The testers were primarily graduate students who had undergone a 2-day training session, completed a written test, and participated in a practice session with children not in the study.

      Analysis: All analyses were run using two different samples. The complete sample included all students, regardless of when they enrolled. The longitudinal sample included only those students who attended the sampled school for the entire three years. A multi-level framework was used with students nested within schools. Hierarchical linear models, which allowed for student- and school-level variability, estimated school-level effects of post-test achievement, with a sample size of 35. All tests were run as two-tailed tests, with alpha=.05 and power at least .80, and degrees of freedom = 32 (35 schools - 3). Total student sample size was 15,323. Pre-test and post-test scores were standardized so that effects show group differences in standard units.

      Outcomes

      Implementation fidelity: In addition to the extensive training and ongoing professional development provided by the SFA staff, trainers from SFA made quarterly implementation visits to each school to assess the extent to which SFA program components were in place. The trainers also identified other potential obstacles including staff turnover and student attendance. The trainers did find some implementation variability. Some schools immediately embraced and implemented the program while others struggled, even after the first year. Classroom instruction was "of reasonable quality" at almost all schools, but the tutoring and "solutions team" were rarely adequately implemented. Finally, most schools had a part-time rather than the recommended full-time facilitator.

      Baseline equivalence: The authors report that the treatment and control schools were “reasonably well matched” with respect to demographics. Tests for statistically significant demographic differences between treatment and control schools were non-significant. However, when testing for significant differences, the researchers combined "percent African American" and "percent Hispanic" into "percent minority." They found that there was no statistical difference between the SFA schools and the control schools on “percent minority” but the African American and Hispanic proportions seem quite different to the naked eye. The SFA sample was 49% African American, while the control sample was 65% African American. The SFA sample was 13% Hispanic and the control sample was 7% Hispanic. This difference may be due to the attrition of the 5 schools because the original sample of 41 schools showed no statistical differences in demographics between the SFA and the control schools.

      The SFA schools were not significantly different than the control schools with respect to school-level pretest scores.

      Differential attrition: The study lost five schools to attrition (four closed due to insufficient enrollment and one refused to participate due to "local political problems"). Some of the study students had missing post-test data, but had, in fact, been consistently enrolled for three years at a study school. For these students, researchers imputed post-test data. However, for students who had missing post-test data but were not enrolled consistently over the three years, the researchers used listwise deletion. The listwise deletion did not did not cause differential attrition rates by program condition.

      No statistically different pre-test scores were found between treatment students who were dropped and control students who were dropped (internal validity satisfied). The researchers also compared attriters with those who were retained in the study. Attriters were more likely than non-attriters to be mobile (i.e., move into a school after the program had started) and had lower average pre-test scores. On one hand, since previous research has suggested that SFA is more effective for lower achieving students, the results from this study that has dropped a disproportionate number of lower achieving students might be biased downward. On the other hand, movers and attriters may be less compliant and their loss may exaggerate program effects.

      Posttest and Follow-Up: The primary outcome was the WMTR test (Word Attack, Word Identification, and Passage Comprehension) at the end of 2nd grade (Year 3). No analysis was completed using the 3rd through 5th grade SFA students (from the "control" schools) because data from them would not be representative of the effects of the SFA program and its emphasis on sequenced, foundational instruction in the early elementary years.

      To answer the question of whether SFA positively impacted early-elementary literacy outcomes, the researchers ran the model on the sample of those who participated in all three years (the "longitudinal" sample). The school-level effect size of SFA (Cohen's d) from the multi-level model was .33 units (p<.01) for Word Attack scores, .22 units (p<.05) for Word Identification scores, and .21 units (p<.05) for Passage Comprehension scores. Thus, in all three literacy domains of the WMTR, the SFA schools scored significantly higher than control schools by the end of 2nd grade (Year 3).

      To answer the question of whether the effects of SFA were larger for the longitudinal sample vs. the combined sample (includes students who enrolled after program implementation and were therefore not exposed to the program for the full three years), the researchers ran the model on both samples and compared the results. The school-level effect size (Cohen's d) of SFA was .36 units (p<.01) for Word Attack, .24 units (p<.05) for Word Identification, and .21 units (p<.05) for Passage Comprehension. Surprisingly, the effects for the longitudinal sample were not larger than the effects for the combined sample. The researchers concluded that the school-wide reform component is comprehensive enough to impact all SFA children, regardless of the number of years they were exposed to the SFA program.

      To address whether the sequencing and length of the program had a broad effect on all literacy domains by the end of 2nd grade, the researchers looked at effect sizes by year. For the combined sample, Word Identification effect sizes (Cohen's d) increased from .09 units in kindergarten to .19 units in 1st grade and then to .24 units in 2nd grade. Word Attack effect sizes were steady from kindergarten to 1st grade and then rose in 2nd grade ( from .32 units to .29 units to .36 units). Passage Comprehension effect sizes grew from -.10 units in kindergarten to .12 units in 1st grade to .21 in 2nd grade. This pattern was similar for the longitudinal sample. Thus, the researchers concluded that, in fact, improving early literacy can be achieved by first building a strong phenomic foundation in kindergarten and 1st grade.

      Correnti, R. (2009). Examining CSR program effects on student achievement: Causal explanation through examination of implementation rates and student mobility. Paper presented at the annual meetings of the Society for Research on Educational Effectiveness. Crystal City, VA.

      This study was not an evaluation of the Success for All program per se; rather, it was an analysis of the three most adopted comprehensive school reform (CSR) initiatives in the U.S. - Success for All, America's Choice, and Accelerated Schools Project. These three programs have different philosophies and designs. The goal of the analysis was to attribute improvements in early literacy achievement to these comprehensive programs and to better identify the causal mechanisms at work through the programs. This paper was presented at a conference and not published in a peer-reviewed journal. Therefore, very few of the technical methodological details and none of actual effect sizes were reported.

      Evaluation Methodology

      Design: This research used secondary data from the Study of Instructional Improvement (SII). The longitudinal SII contains data collected from 2000-01 through the 2003-04 academic years. The sample included 115 elementary schools (90 treatment schools, roughly evenly spread across the three programs, and 25 control schools). The schools were selected based on geographic region (to control for costs), length of time schools had been affiliated with the programs, and measures of socio-economic disadvantage. The comparison schools were chosen from the same geographic regions and were selected based on similar socio-economic disadvantage measures. Schools in the highest quartile of community disadvantage were over-represented in the sample. The 115 schools provided a student sample size of 7,692.

      Analysis: Treatment schools were matched with control schools using propensity scores based on school background characteristics (the author did indicate the specific characteristics used). Literacy achievement indicators for two cohorts of children, K-2 and grades 3-5, were compiled and reading outcomes for treatment schools were compared with reading outcomes for their propensity score matched comparison schools. The analysis was executed twice - once with all students and only with students who were stable in their schools during the treatment period.

      Outcomes

      Two of the three CSR programs demonstrated a positive treatment effect on student literacy outcomes (Success for All and America's Choice), while the third program (Accelerated Schools Project) showed no significant impact. According to the author, Success for All successfully produced a pattern of "skill-based" reading instruction. Success for All was primarily effective in the early grades (K-2). Importantly, the author noted that the treatment effect was more pronounced for students who were stable in SFA schools. This result implies a dosage response effect and the author argues that this is evidence that Success for All has a causal effect on student achievement.

      Limitations: This paper contained no demonstrated baseline equivalence, non-differential attrition, sample characteristics, or design details.

      Madden, N., Slaven, R., Karwit, N., Dolan, L., & Wasik, B. (1993). Success for All: Longitudinal effects of a restructuring program for inner-city elementary schools. American Educational Research Journal , 30 (1), 123-148.

      This study evaluates the Success for All elementary literacy program using longitudinal data from four Baltimore schools from the late 1980's through the early 1990's.

      Evaluation Methodology

      Design: This study was quasi-experimental in that the five Success for All schools were matched with five other Baltimore schools that were similar in terms of percentage of students receiving free lunch, historical achievement level, and "other factors" that were not identified by the authors. Once the comparison schools were selected, the students were themselves matched based on previous scores on standardized tests.

      Reading proficiency data were collected in the 1990-91 academic year from students in all 10 schools who had been stable attenders since program implementation in 1987-88. Therefore, all 3rd graders in this study had been exposed to the program for at least 3 years.

      Attrition: No schools left the study during the three years of data collection. In terms of student-level attrition, the study only used data from youth who were enrolled consistently at each school. The lack of effort to follow up or study those not consistently enrolled in the study schools may violate the intent-to-treat principle.

      Sample Characteristics: The five SFA schools had a total baseline enrollment of 2,598. The authors do not provide enrollment counts for the control schools. Of the five SFA schools, all had between 97-100% African American enrollment and between 83-98% free lunch eligible. No other data were provided for the five control schools.

      Two of the schools were considered "high resource" in that they hired the suggested number of tutors (6 in one school, 9 in the other); offered full-day kindergarten; hired at least two staff members to be on the family support staff (now known as the solutions team), and hired full-time facilitators. The other three schools were considered "low resource' and did not achieve the full level of implementation. These schools hired only 2-3 tutors each, did not hire any additional staff members to be on the family support staff, and had only half-time program facilitators.

      Measures: Assessments of reading proficiency were individually administered to students by trained students from local colleges who were unaware of the study hypotheses or the school's treatment status. Retention and attendance data were obtained from school records.

      The study used two reading proficiency measures: The Letter-Word Identification (tests letter and word recognition) and Word Attack (tests phonetic synthesis) components of the Woodcock Language Proficiency Battery and the Durrell Analysis of Reading Difficulty which assess oral reading and comprehension.

      The study also collected data on retention and attendance, yet this data was only available from Success for All schools. The researchers do not address why they could not get retention and attendance data from the control schools.

      Analyses: The reading proficiency analyses were conducted using MANOVAs with standardized pretest scores as covariates and raw scores on the three reading outcomes as dependent variables. The MANOVAs produced Wilkes's lambda statistics and these were used to test for significance. Following multivariate analysis, ANCOVAs were computed for each dependent measure separately. All reading proficiency analyses were done by grade to test program effectiveness as children progress through the successive program components. Significance levels were evaluated at p-values of .10 and below.

      The retention and attendance rates for each treatment school were computed for each year and compared over time.

      Outcomes

      Baseline Equivalence: The five Success for All schools were matched with five other Baltimore schools that were similar in terms of percentage of students receiving free lunch, historical achievement level, and "other factors" that are not identified by the authors. The researchers did not present baseline equivalence data at the student level or pre-test baseline equivalence data.

      Differential Attrition: No analyses of differential attrition were presented.

      Posttest: Compared to their matched control schools, each SFA school had significantly higher average reading proficiency scores on most outcomes. The average effect size was .51 for Grade 1, .60 for Grade 2 and .57 for Grade 3. The consistency of the effect sizes across grades does not reflect the true difference in average scores between the Success for All schools and the control schools because the standard deviation of scores increased over time. The raw difference in scores between the schools averaged approximately 3 months of grade-equivalency in grade 1, 5.5 months of grade-equivalency in grade 2, and 8 months of grade-equivalency in grade 3.

      The researchers also ran the reading proficiency analysis using a sample of students who were in the bottom 25% in terms of reading achievement. The effect sizes were even stronger, but insignificant and unreliable because of extremely small n's (n's between 9 and 16 students).

      Retention rates, defined as the percent of students required to repeat a grade, fell from an average of 8.4% before program implementation to an average of .8% in 1990-91. Note, however, that SFA is fundamentally opposed to retention and supports non-retention, even if student performance is not at appropriate grade level. Rather, SFA recommends advancing students and continuing with the program's special services to get them up to speed.

      Absentee rates, defined as the percent of students absent, fell from an average of 11.7% to an average of 9.0%. The authors do not report whether this drop is statistically significant for each school or overall.

      Long-term followup

      Borman, G., & Hewes, G. (2002) The long-term effects and cost-effectiveness of Success for All. Educational Evaluation and Policy Analysis, 24 (4): 243-266.

      This study focused on long-term effects of the original Success for All program that was implemented for first-graders in five elementary schools in Baltimore in 1988, 1989, and 1990. The student outcomes assessed in 1998-99 included 8th grade achievement in reading and math and a group of outcomes including years of special education, instances of grade retention, and age at grade 8. The pre- and posttest data for reading and math achievement were drawn from the California Test of Basic Skills in grade 1 and grade 8. The remaining data was drawn at grade 8 from school district records.

      Of the original sample of about 2,500 students, 1,310 students remained in the sample for achievement, and 1,730 remained in the sample for the other outcomes. The bulk of the attrition was due to three factors including: (a) students remained in Baltimore schools but had missing data for one or more measure (50%); (b) students left the Baltimore school district (25%); and (c) students had not yet made it to grade 8 (12%).

      The rates of attrition among SFA students and control students were statistically equivalent and the reasons for attrition were similar. A further attrition analysis revealed that the SFA attriters and control attriters were statistically equivalent on all background characteristics except for pretest reading score. However, the magnitude of the difference was "essentially" the same as the magnitude between the SFA non-attriters and the control non-attriters in pretest reading score. Thus, internal validity remains intact. With respect to attriters vs. non-attriters, all background characteristics were equivalent except that non-attriters had higher math and reading pretest scores than attriters in both the SFA and control samples. However, to the extent that the SFA program had stronger effects on the lowest achievers, the outcomes may have underestimated the program effect.

      ANOVA and logistic regression analysis produced results for achievement outcomes (reading and math CTBS/4 scores) and transcript outcomes (years of special ed in elementary school, years of special ed in middle school, ever retained in elementary school, ever retained in middle school, and age at 8th grade).

      The analysis for achievement included controls for pretests. For the full sample, SFA produced a statistically significant effect on reading achievement (E.S. = .29, equivalent to a 6 month advantage) and math achievement (E.S. = .11, equivalent to a 3 month advantage). For the sample of low-achievers, SFA produced a statistically significant effect on reading achievement (E.S. = .34), but not on math achievement.

      The analysis for the other outcomes produced some significant results, but the results do not reflect whether students were, in fact, improving academic performance to a point beyond special ed or retention thresholds. Rather, if the school is following the suggestion of SFA, it will, by definition, have fewer special ed placements and fewer retentions than otherwise. These significant outcomes have relevance in that cost savings may accrue because of fewer special ed placements and retained students and the savings could be reallocated to SFA.

      Limitations

      The Madden et al. (1993) study has a few limitations:

      • Only students who were stable in their enrollment were studied and no analysis of differential attrition was provided.
      • Due to lack of randomization, the study results may be due to school differences rather than the program.
      • The authors do not provide enough data on the control schools to ensure baseline equivalence.
      • Control school retention and attendance data was not available to compare with the treatment schools.
      • With only five matched schools, it is difficult to ensure that all relevant school characteristics are the same.
      • The results on retention are not relevant because not retaining students is a component of the SFA program.
      • Schools self-selected into the program.
      • Matching occurred at both school and individual level, but the analysis was done only at the individual level.
      • The long-term followup had attrition rates approaching 50%.

      Nunnery, J., Slavin, R., Madden, N., Ross, S., Smith, L.J., Hunter, P., et al. (1997). Effects of full and partial implementation of Success for All on student reading achievement in English and Spanish. Paper presented at the meeting of the American Educational Research Association, Chicago IL.

      The Success for All (SFA) program is a relatively expensive literacy and school-reform initiative. In this study, the researchers assess whether SFA is still effective if only a subset of the components are implemented. If a partially-implemented SFA program is just as effective as a fully-implemented program, then perhaps the costs of SFA could be scaled down and more schools could afford the program using just their Title I funds.

      Evaluation Methodology

      Design: In this quasi-experimental design, SFA was offered to the highest poverty elementary schools in the Houston Independent School District. The schools were offered SFA with the reading component only, the reading component plus tutoring, or the full SFA program (reading, tutoring, support team and facilitator). Fifty schools volunteered. From the pool of elementary schools that did not volunteer, 23 were chosen to make up the "matched comparison" schools. The authors did not indicate how precisely the matching was made or why 23 schools were chosen.

      Of the 50 SFA schools, 19 schools used the Spanish-bilingual version of SFA alongside English SFA, and one school used the Spanish-bilingual version exclusively. None of the SFA schools were fully implemented in mid-fall 1995, but the Spanish-bilingual programs were especially late in implementation.

      In the English dominant study, the cohorts were defined as follows: Cohort 1 began first grade in 1995 and Cohort 2 began first grade in 1996. Only Cohort 1 students were given a pretest (n=4,256). In 1996, posttests were given to ten Cohort 1 students from each school (had reached 2nd grade, n=595) and ten Cohort 2 students from each school (had reached 1st grade, n=682). The researchers reported that Cohort 1 had some missing pretest data and were dropped using listwise deletion. They reported that 46 SFA schools and 18 comparison schools had complete data.

      The pre-test (Spanish Language Assessment Scale) was given in 1994-95 to Spanish dominant students who were entering first grade (n=1,682), but because the Spanish-bilingual program was not completely implemented until late in 1995-96 school year, there were no pretest data for Spanish-bilingual students. The final sample included 278 Spanish dominant first grade students in 20 SFA and 10 comparison schools. The authors did not indicate how many of the 278 were SFA and how many were comparison. Also, because the Spanish bilingual version of the program took so long to implement, the researchers did not draw a Cohort 1. This may violate the intent-to-treat principal by not analyzing data that may be negative because the program was difficult to implement.

      Attrition: For Cohort 1, the analysis was performed on all students who had both pretest and posttest data. For Cohort 2, or the Spanish-bilingual students, the researchers did not mention attrition. Also, two schools dropped out at some point, but the authors do not address it.

      Sample characteristics: Only general characteristics of the schools were provided. The schools had an average of about 78% eligible for free lunch, between 47% and 57% Hispanic, and mobility rates between 30% and 53%.

      Measures

      Reading measures: The English-dominant reading pretest was the Language Assessment Scales - Oral (LAS). A battery of four reading posttests included the Word Attack, Word Identification, and Passage Comprehension of the Woodcock Reading Master Tests and the Durrell Oral Reading Test.

      School characteristics measures: Six measures were drawn from each school: average pretest LAS score, percentage of students eligible for free or reduced-price lunch, student mobility rate, percentage of teachers with advanced degrees, average years of experience of teachers in the school, and teacher attendance rate. Factor analysis was used to generate two aggregate measures - student background characteristics and teacher experience measures of each school. The student background characteristic composite variable was converted into a dummy variable (low/high) at the median.

      Implementation measure s: An implementation questionnaire was administered to principals or facilitators in all SFA programs. A 100% response rate was obtained after three mail and two telephone followups. The questionnaire collected program data such as number and type of tutors, facilitator status (non, part-time, full-time), and whether the school implemented a support team. An overall support score was computed by summing the standardized scores for the various measures. Schools were grouped into three implementation categories - low, medium, and high. In general, programs identified as high implementers had a higher number of certified tutors, were more likely to have full-time facilitators, had higher percentages of Hispanic students, and lower percentages of African American students. Among Spanish-dominant programs, only two implementation categories were used to "retain adequate power and balance in the design".

      Analyses: Multivariate analyses of variance (MANOVA) were performed to test for overall treatment differences. The reading outcomes were the dependent variables, while implementation level, ethnicity of student body (majority Hispanic or majority African American), and the student background aggregate variable were the independent variables. For Cohort 1, the pretest score was also used as a covariate, while Cohort 2 did not have pre-test scores available. Follow-up univariate analysis was conducted when the multivariate hypothesis tests suggested significant treatment effects. When univariate effects were significant, ANOVA was conducted on residual scores for each student. The authors write, "For Cohort 1, effect size estimates were computed as the difference between mean standardized residual scores of a given SFA implementation level and the comparison mean. For Cohort 2, effect size estimates were computed as the standardized difference between posttest means."

      Outcomes

      Differential attrition: Differential attrition was not assessed.

      Baseline equivalency: SFA schools had a similar percentage of students eligible for free lunch (about 78%). SFA schools had lower percentages of Hispanic students than comparison schools (47% vs. 57%) and SFA schools had higher average mobility rates (53% vs. 30%). Comparison schools had slightly higher average pretest scores than SFA schools. The authors did not present an analysis of how these differences in baseline equivalency may impact the results. The authors did not provide any student-level base equivalency information.

      Fidelity: Fidelity is explicitly measured as the "implementation" variable that took the value of low/middle/high in the English-dominant SFA programs and low/high in the Spanish dominant programs. This variable was derived from survey data on number and type of tutors, facilitator status (non, part-time, full-time), and whether the school implemented a support team.

      Posttests: In the English-dominant program for Cohort 1, the authors did not present the effect of implementation level on outcome. Rather, the results presented represent interactions between implementation and racial status. The analysis indicated that high-implementation, predominantly African American schools were the only schools that substantially exceeded control students when controlling for the pretest scores (ES=.49 in Oral Reading, ES=.18 in Passage Comprehension,ES=.14 in Word Attack, and ES=.22 in Word Identification).

      The analysis of Cohort 2 did not include controls for pretest, so the results should be interpreted with caution. SFA implementation had main effects on Oral Reading (p<.001), Passage Comprehension (p<.001) and Word Identification. Also, a significant multivariate interaction occurred between implementation level and socioeconomic strata (p=.04). High implementation effect sizes for schools with low Student Background characteristics were .33 for Oral Reading, .34 for Passage Comprehension, .73 for Word Attack, and .55 for Word identification. Again, without controlling for pretest scores, the results cannot be clearly interpreted.

      Limitations

      This study has significant limitations and the results should be interpreted with caution:

      • The results were not comprehensive, which suggests that some null or negative results may have been excluded.
      • Basic results of the effect of implementation on outcomes (without interactions) were not presented.
      • The lack of a Cohort 1 for the Spanish program due to late implementation may violate the intent-to-treat principle.
      • No pre-test data was available for Cohort 2 or the Spanish program, which suggests that results could be attributed to pre-existing differences in school achievement, especially given the study's lack of clarity around how the comparison schools were selected.
      • The schools self-selected into SFA and the comparison schools explicitly did not select SFA, which suggests the strong possibility of selection bias.
      • Although matched at the school level, the analysis was done at the individual level.

      Munoz, M. A., & Dossett, D. H. (2004). Educating students placed at risk: Evaluating the impact of Success for All in urban settings. Journal of Education for Students Placed at Risk, 9(3), 261-277.

      This quasi-experimental study focused on identifying the effects of SFA in a Kentucky school district three years after implementation. Outcomes include reading achievement, but also school-wide reform measures, such as attendance, suspensions, and perceptions of students, teachers, and parents on school climate, educational quality, and job satisfaction (teachers only).

      Evaluation Methodology

      Design: This quasi-experimental design used data from three SFA schools and three matched comparison schools in an urban Kentucky school district. The three SFA schools had been participating in SFA for three years, from 1999-2000 to 2001-2002, and the analysis looked at changes from Grade 1 to Grade 3. Student achievement, attendance, and suspension data were taken from school records; schoolwide reform measures were taken from surveys of students, teachers, and parents. Thus, this study sought to examine the effects of SFA not only as an early literacy program, but as a whole-school reform initiative.

      Matching took place on two levels - school and student. The treatment and control schools were matched on the following characteristics: percent free\reduced price lunch, race, percent with disabilities, percent from single-parent households, gender, and on historical test scores. The authors are unclear whether the historical test scores are from the Stanford Diagnostic Reading Test, as stated in the text of the article, or the Comprehensive Test of Basic Skills (CTBS), as reported in the table. After school matching, the treatment and control students were matched on free/reduced price lunch status, race, single-parent household status, and gender. The matching procedure was checked using Chi-squared analysis and no significant differences were found between groups on these matching characteristics.

      The three SFA schools had been participating in SFA for three years, from 1999-2000 to 2001-2002, and the analysis looked at changes from Grade 1 to Grade 3. The baseline sample size was 1,074 (593 treatment students and 481 control students). The final analytical sample, however, excluded students who transferred out of their baseline schools or did not have assessment data through the entirety of the study. The final N used for analysis was not reported.

      Attrition: Only students who were enrolled continuously in their schools from fall 1998 through the 2001-02 school year were included in this analysis. From this group, only students with complete demographic and testing data were included in this analysis. The authors did not provide an analysis of the potential systematic effect of this attrition on the results. The lack of effort to follow up or study those not consistently enrolled in the study schools may violate the intent-to-treat principle.

      Sample characteristics: The sample was entirely urban, about 55% female and 57% minority. About 85% of the sample received a free/reduced price lunch and slightly over 70% lived in single-parent homes.

      Measures: The measures in this study included: (a) student test scores on the CTBS Reading component, normal curve equivalent (NCE), taken from computer files; (b) school-level records on attendance (mean daily attendance rate) and on behavior-based suspensions (per year); and (c) survey responses from parents, teachers, and students on Likert-scale type questions, including perceptions of school climate, educational quality, and job satisfaction (for teachers only).

      The perception surveys were given each year. The student surveys were administered to students in schools. The teacher surveys were to be completed by teachers in private, with assurances of confidentiality. The parent surveys were taken home by students and returned to school.

      The combined response rate for all years of the survey was 69% for teachers, 68% for students, and 42% for parents. A total of 115 teachers, 667 students, and 867 parents completed the instruments. The authors did not report response rates by treatment status or justify why the response rate for students was so low, given that the surveys were administered in school.

      Analysis: Student-level data were analyzed using ANCOVA methods, with the treatment of SFA as the between-subject factor and the pretest scores as the covariate. Effect sizes reflect standardized differences between SFA and comparison students. The mean Likert-scores for each survey item were averaged by school and overall and were reported separately for each year.

      Outcomes

      Baseline Equivalence: While equivalence was examined for both schools and students, only student equivalence was tested for significance. Specifically, the authors report the factors that were used to match schools, but given the small numbers did not indicate whether there were statistically significant differences in these or other factors between the treatment and control schools. Importantly, the authors do not report whether significant differences exist on pre-test scores, even though they control for pre-test scores in the ANCOVA. Chi-squared tests indicated that the baseline characteristics of the students themselves were not significantly different by treatment status.

      Differential Attrition: The authors did not present any differential attrition analysis.

      Achievement on Standardized Reading Tests: The researchers calculated the improvement in the mean CTBS NCE scores from 1998-99 through 2001-02. The SFA treatment schools averaged a gain of 4.4 points, compared to the control schools' improvement of only 2.3 points. The authors do not report whether this is a significant difference. However, using the student level sample (n=295), ANCOVA tests revealed that, adjusting by pretest scores, the effect of the program was statistically significant, but with a very small effect size (ES=.11).

      Attendance: The average attendance rate at SFA schools rose 1.2 points, from 93.5% to 94.7%. The average attendance rate at the control schools rose 0.7 points, from 94.4% to 95.1%. The researchers do not report whether there is a statistically significant difference in improvement between the control and the SFA schools.

      Out-of-School Suspensions: Among the SFA schools, the mean number of annual suspensions decreased by 23 suspensions (from 49 in 1998-99 to 26 in 2001-02). Among control schools, the mean number of annual suspensions decreased by 11 suspensions (from 22 in 1998-99 to 11 in 2001-02). SFA schools experienced a decrease of 47% in suspensions, while the control schools experienced a decrease of 50%. As with the previous outcomes, the authors do not report whether this is a statistically significant difference.

      Mediating Effects

      Perceptions of school climate, educational quality, and teacher job satisfaction: Compared to teachers from control schools, teachers from SFA schools had higher increases in ratings of school climate from 1998-99 to 2000-01 (SFA teacher ratings increased from 4.1 to 4.3, compared to no change (4.0) for control school teachers). Educational quality ratings also grew substantially for SFA teachers compared to control teachers (SFA teacher's ratings of educational quality grew from 3.9 to 4.3, compared to no change (3.9) for control school students). Job satisfaction ratings for teachers from SFA increased by .4 points (4.1 to 4.5) and increased by .1 points (4.4 to 4.5) for teachers from comparison schools.

      Students from SFA schools rated school climate as 4.1 in 1998-99 and 4.2 in 2000-01, while students from control schools remained steady in their rating of school climate (4.3). Students from SFA schools rated educational quality as 4.3 in 1998-99 and 4.5 in 2000-01, while students from control schools rated educational quality as 4.5 in 1998-99 and 4.5 in 2000-01.

      Parents from SFA schools had higher increases in ratings of school climate than parents from control schools (4.0 to 4.4 for SFA parents and 4.2 to 4.4 for control parents). Parents from SFA schools and parents from control schools had identical ratings of educational quality, 4.1 in 1998-99 and 4.4 in 2001-02.

      Limitations

      This study contains the following very significant limitations:

      • The authors do not justify low survey response rates for teachers who presumably were encouraged to take the survey, or students, who presumably were required to take the survey.
      • The authors report no systematic analysis of non-response bias in the survey results, especially among parents and students.
      • For the first three research questions, the authors do not report significance levels.
      • The authors rely on fidelity of implementation to justify different outcomes by school, but do not measure fidelity in the study.
      • The SFA schools self-selected into the program, which may introduce selection bias.
      • Although matched at the school level, the analysis was done at the individual level.

      Jones, E., Gottfredson, G., & Gottfredson, D. (1997). Success for some: An evaluation of a Success for All program. Evaluation Review, 21 (6), 643-670.

      This evaluation is of a Success for All program implemented in Charleston, South Carolina, in the late 1980's. The analysis suggests that fidelity in program implementation is crucial and that previous designs of SFA evaluations may have compromised the generally positive results that SFA has enjoyed thus far. This study is unique in that it includes additional outcomes that are not endorsed by SFA developers but used by school-districts to judge school achievement.

      Evaluation Methodolog y

      Design: This quasi-experimental study evaluated a single Success for All (SFA) program in Charleston, SC. The SFA school was matched with a comparison school based on "demographics" and "history of performance on district standardized tests." The SFA program was implemented in 1989-90, with pre-test data collected in fall 1989 for kindergarten and first-grade students (Cohort 2 and Cohort 1, respectively) and in fall 1990 for kindergarten students (Cohort 3). Cohorts 1 and 2 were re-tested in the 1990-91 and 1991-92 school years (one and two years from baseline). Cohort 3 was tested again in 1991-92 (two years from baseline).

      The base sample sizes for Cohorts 1, 2, and 3 were 172 (113 SFA and 59 control), 157 (109 SFA and 48 control) and 169 (117 SFA and 52 control), respectively. The authors did not report why the SFA sample was almost twice the size of the control sample. Only students who were consistently enrolled in the same school through the course of the study were included in the analysis.

      Attrition: Only students who had attended the schools consistently for the length of the study were eligible for final analysis. The number of actual students used in the final analysis excluded students with missing data, regardless of whether the data were missing due to attrition, absence, or some other reason. The sample sizes used in the calculation of each outcome varied according to how many students happened to take the assessment on the day it was offered.

      Sample characteristics: Each study school was approximately 50% male, and almost all of the schools were at least 99% African American.

      Measures: This study was somewhat unique in that it used the typical SFA measures of literacy achievement, but also used measures that are more typically required by school districts to assess school achievement.

      Pretest

      • Cognitive Skills Assessment Battery

      SFA outcome measures

      • Woodcock Reading Mastery Tests-Revised
      • Durrell Test of Reading Difficulty

      District outcome measures

      • Merrill Language Screening Test
      • Test of Language Development
      • Basic Skills Assessment Program
      • Stanford Achievement Test
      • Teacher achievement ratings
      • Teacher behavior ratings

      The SFA outcome measures were not collected in the third year of the study because, according to the authors, the developers had “lost interest” in the evaluation.

      Analyses: Analyses were run for each cohort and for each year separately. Means were adjusted for pretest scores and calculated for the treatment and comparison schools using ANCOVA. The standardized regression coefficients were calculated from multiple regression models in which the test score was the dependent variable, and pre-test score and treatment status were the independent variables.

      Outcomes

      Baseline Equivalence: The comparison school was chosen based on its similarity to the treatment school in demographics (gender and race/ethnicity) and history of performance on district standardized tests. The student sample was roughly evenly split by gender (although Cohort 1 from the control school was 64% male). All of the study schools were almost exclusively African American. The authors did not report on significance of baseline equivalence.

      Fidelity: This implementation of SFA was severely compromised. One of the requirements of SFA is that faculty agree to the new program with an 80% majority in a secret ballot. The SFA school in this study was required to participate by the school district. Also, Hurricane Hugo had occurred just before the program was implemented, which caused a good deal of disruption in implementation. The researchers also noted that the SFA facilitator had a somewhat hostile relationship with some teaching staff and that the components of the program (e.g., assessing progress every eight weeks and making reading group adjustments) were not evenly implemented.

      Differential Attrition: Neither of the two schools dropped out of the study. The analysis was conducted only on students who were enrolled continuously at their schools and were non-absent on the day of the assessments. No analysis of the effects of student mobility or absence on the outcomes was reported.

      Posttest: The outcomes that follow are based on multiple regression betas. The outcomes include SFA developer outcomes (the Woodcock and Durrell Assessments) and school-district outcomes. They indicated that the program appeared to successfully influence achievement in kindergarten, but that the effects did not continue into 1st and 2nd grade.

      For Cohort 1 (1st grade in Year 1), none of the developer literacy outcomes or school district outcomes were significant for Years 1 or 2. In fact, the SFA program appeared to have a negative effect on math achievement in Year 1 (beta = -.28, p<.01). Year 3 SFA developer outcome data were not collected and Year 3 school district outcomes results were generally insignificant.

      For Cohort 2 (kindergarten in Year 1), with only a few exceptions, the developer literacy outcomes and the school district outcomes were generally significant and positive for the SFA program in Year 1. However, with the exception of scores on the Woodcock Word Attack assessment, all the positive effects of SFA disappeared by the end of Year 2 (1st grade). Year 3 SFA developer outcome data were not collected, and none of the school district outcomes were significantly positive.

      For Cohort 3 (kindergarten in Year 2), the developer literacy outcomes were strongly positively significant in Year 2, and the district outcomes were generally significant and positive as well. However, the effects of SFA on the school district measures disappeared in Year 3 (no SFA developer outcome data were collected in Year 3).

      The authors conclude that buy-in, cooperation, and implementation is crucial in allowing SFA to function properly and produce positive results. A school culture that approves of an SFA implementation may be very different from a culture that would not vote to approve SFA. Thus, when comparison schools are chosen, the authors strongly believe that the comparison schools should have also have voted in SFA. This would reduce the likelihood that the differences in outcomes are due to factors other than SFA.

      Limitations

      This quasi-experimental study had the following limitations:

      • The program produced evidence of iatrogenic effects on math achievement.
      • Possibly because of the lack of SFA approval by the staff or because of Hurrican Hugo, fidelity was extremely weak, so it is difficult to determine whether the results (or lack thereof) are indicative of how a well-implemented SFA program might perform.
      • As with all other SFA evaluations that use this design, the matching of schools on demographics and history of performance may not be strong enough to allow researchers to conclude that differences in outcomes are due to SFA.
      • Although matched at the school level, the analysis was done at the individual level.

      Slavin, R. E., & Madden, N. A. (1998). Success for All/exito para todos: Effects on the reading achievement of students acquiring English. Report No. 19, Baltimore, MD: Center for Research on the Education of Students Placed at Risk.

      Success for All (SFA) can be used with native English speakers, but with a few adaptions, can also be used with children who are learning English as a second language (ESL). The adaptation focuses on integrating the work of ESL teachers and reading teachers. Also, SFA itself has been translated into Spanish and is used in Spanish bilingual programs to help children read in both Spanish and English. This study summarizes evaluation results for these modified SFA programs.

      FRANCIS SCOTT KEY ELEMENTARY (ESL) - PHILADELPHIA

      Evaluation Methodology

      Design: SFA was implemented in the Fall of 1988. For this study, Francis Scott Key was matched to a “similar Philadelphia elementary school.” The exact factors that were used for the match were not specified.

      Fourth and fifth grade students from both schools were individually assessed for reading in Spring 1995, seven years after SFA was implemented. All of the students were assessed, whether or not they had been at the school from kindergarten.

      Attrition: The study was conducted on all students in 4th and 5th grade in spring 1995.

      Sample Characteristics: During the study, 622 children were enrolled in Francis Scott Key, with 365 enrolled in K-3 and 257 enrolled in grades 4 and 5. The school was 62% Asian-American (mostly of Cambodian descent), 21% White, and 15% African American. Almost none of the Asian Americans spoke English when they entered kindergarten. Ninety-six percent of students in the school qualified for free lunch.

      Measures: As a posttest, the 4th and 5th grade students were individually administered three scales from the Woodcock Language Proficiency Battery – Word Identification, Word Attack, and Passage Comprehension.

      Analyses: ANOVA was conducted for each reading outcome separately (with no baseline controls, given the cross sectional sample). Effect sizes were calculated as the difference in the means divided by the comparison school’s standard deviation.

      Outcomes

      Baseline Equivalence: The comparison school enrollment in spring 1995 was 1,128 students, while Francis Scott Key’s enrollment was about half that – 622 students. The treatment school was 62% Asian-American, 21% White, and 15% African American. The comparison school had a much different racial distribution – 65% African American, 33% Asian American, and 0% white. However, given the differences in the sizes of the schools, the number of Asian Americans at each school was similar (62% of 622 and 33% of 1,128). However, the actual sample sizes used in the final analysis were very low (52 in the SFA school, 64 in the control school).

      Fidelity: No measures of fidelity were reported.

      Differential Attrition: Differential attrition was not addressed. The authors do not describe how they arrived at sample sizes of 52 for Francis Scott Key and 64 for the control school.

      Outcomes
      This analysis did not incorporate controls for pre-tests, so the results should be interpreted with caution.

      Among Asian-Americans, Francis Scott Key 4th grade students had significantly higher scores on all three components of the Woodcock Reading Scales, compared to the control school (effect sizes were 1.54, 1.49, and .62). Among Non-Asians, only Word Identification was significantly higher for Francis Scott Key 4th grade students, compared to the control school students (effect size of .64).

      Among Asian-American 5th grade students, Francis Scott Key students had significantly higher scores on all three components of the Woodcock Reading Scales, compared to the control school students (effect sizes of 1.40, 1.33, and .75). Among Non-Asians, Francis Scott Key 5th grade students had significantly higher scores on two components, compared to the control school (effect sizes of .66 for Word Identification and .92 for Word Attack).

      Limitations

      • The researchers did not control for pre-test scores or any other school-based factors that could have caused differences in outcomes.
      • The comparison school has a very different racial distribution than the treatment school.
      • The researchers do not indicate how the final n's were derived.
      • The schools self-selected into the SFA program, which introduces the potential for selection bias into the results.

      FAIRHILL ELEMENTARY (BILINGUAL) – PHILADELPHIA

      Evaluation Methodology

      Design: A bilingual version of Success for All (Exito Para Todos) was first implemented at Philadelphia’s Fairhill Elementary school in 1992. A control school was selected that used a “Sheltered English” type of instruction. Sheltered English is a model in which the instruction is considered to be bilingual, but in fact emphasizes native language instruction.

      This study reported findings as of 1996 for 3rd graders who had been in Fairhill and the control school from kindergarten to 3rd grade.

      Attrition: No mention was made of attrition, although the analytic sample excluded students who did not have pre-test data because they had transferred into the study schools from kindergarten to the end of grade 3. It is unclear who constitutes the final sample because the final analytical n is 21 for Fairhill and 29 for the comparison school.

      Sample Characteristics: The researchers did not present characteristics of the sample. Rather, the researchers presented school characteristics. Enrollment in the SFA school was 694 and 706 in the comparison school. Enrollment was about 77% Hispanic (mostly Puerto Rican) and 23% African American, and about 96% were eligible for free lunch. Only about 20% of each school’s students were enrolled in bilingual programs.

      Measures: The students were pre-tested at the beginning of grade 1 with the Spanish Peabody Picture Vocabulary Test (PPVT). In grade 3, students were individually assessed with the Spanish and English versions of the Woodcock Language Proficiency Battery (Word Identification, Word Attack, and Passage Comprehension).

      Analyses: The study uses ANCOVA for each outcome measure, with the pre-test Spanish PPVT as the covariate.

      Outcomes

      Baseline Equivalence: Fairhill Elementary and the comparison school were very similar in terms of enrollment, percent Hispanic, percent African American, percent in bilingual programs, percent free lunch, and past performance on standardized exams. No significance is reported. The authors did not report on baseline equivalence of the actual analytical sample (n=21 for Fairhill and n=29 for comparison school).

      Fidelity: The authors did not report on the fidelity of the implementation at Fairhill.

      Differential Attrition: Differential attrition was not addressed. Again, the study does not make clear who exactly is subject to the data analysis. The final analytical n is 21 for Fairhill and 29 for the comparison school. The authors do not explain how the sample was derived from an enrollment of roughly 700 for each school, with about 20% in bilingual programs.

      Posttest: Fairhill students performed significantly better than the comparison students in all three Spanish Woodcock scales, with effect sizes over 2. With respect to the English Woodcock scales, only Word Attack was significantly higher for the Fairhill students compared to the control students.

      Limitations

      • The sample sizes are somewhat small.
      • The researchers do not provide enough information regarding how the final sample was derived.
      • Fairhill self-selected into the SFA program, which introduces the potential for selection bias into the results.

      ARIZONA (ESL)

      Evaluation Methodology

      Design: This evaluation compared Spanish-dominant first graders in two Success for All schools to those in three locally developed Title I schoolwide projects and one Reading Recovery school. All six schools came from the same Arizona school district.

      The schools were assigned to one of two strata, based on percent free lunch and percent Hispanic. Stratum 1, defined as more impoverished, was characterized as having at least 81% of students eligible for free lunch and 50% Hispanic. The Stratum 1 schools included one SFA school and two schools with locally developed Title I projects. Stratum 2, defined as relatively less impoverished, was characterized as having 53% of students eligible for free lunch and 27% Hispanic. The Stratum 2 schools included one SFA school, one school using a locally developed Title I project, and the Reading Recovery school.

      Attrition: Attrition was not addressed.

      Sample Characteristics: Other than the data provided on how the schools were assigned to strata, no sample characteristics were provided. The sample sizes were in the 20’s for each of the six schools.

      Measures: Kindergarten children were pretested with the Peabody Picture Vocabulary Test (PPVT). The same children were post-tested in grade 1 with the three scales from the Woodcock Reading Assessment (Word Identification, Word Attack, and Passage Comprehension) and the Durrell Oral Reading Test.

      Analyses: ANCOVA with PPVT as the covariate was used to assess differences in outcomes by school.

      Outcomes

      Baseline Equivalence: No data were given on baseline equivalence other than the general characteristics of the schools, by stratum.

      Fidelity: Fidelity of implementation was not assessed.

      Differential Attrition: The researchers did not address differential attrition, nor did they provide any information on how the final analytical sample was determined.

      Outcomes: Within the most impoverished schools (Stratum 1), SFA outcomes were not significantly different than the other schools. Within the less impoverished schools (Stratum 2), SFA outcomes were significantly better than the other schools for Word Attack, but not for the other reading outcomes.

      Limitations

      • The sample sizes are somewhat small (in the 20s).
      • The researchers do not provide enough information regarding how the final sample was derived.
      • Results are mixed - SFA was not associated with significant literacy improvements in the more impoverished schools, but was associated with improvements in Word Attack (only one of the four literacy domains assessed) among the less impoverished schools.
      • The schools self-selected into SFA, which introduces the possibility of selection bias into the results.
      • Matching was done at the school level, yet the analysis was at the student level.

      Livingston, M. & Flaherty, J. (1997). Effects of Success for All on reading achievement in California schools. San Francisco, CA: Wested.

      This evaluation was conducted by Wested, an organization that formally partnered with the Success for All (SFA) developers to provide technical assistance and training to Success for All schools in the western region of the U.S. In this evaluation, Wested assessed the effectiveness of SFA on reading achievement for English Language Learners (ELL) as well as native English speakers.

      Evaluation Methodology

      Design: This quasi-experimental design compared reading outcomes for three cohorts of students from three SFA schools to three cohorts from three matched comparison schools. Pretesting took place in kindergarten in fall 1992 (1992 cohort), fall 1993 (1993 cohort) and fall 1994 (1994 cohort). Posttests were given in the spring of 1993, 1994, and 1995. Thus, the 1992 cohort had three years of data, the 1993 cohort had two years of data, and the 1994 cohort had one year of data.

      The authors did not indicate how the study schools were selected. Comparison schools from the same cities as the treatment schools were chosen based on "student demographics and other selected factors."

      The treatment and control schools were Fremont Elementary and Taft Elementary from Riverside, CA; Orville Wright Elementary and Garrison/Kelly Elementary from Modesto, CA; and El Vista Elementary and Tuolumne Elementary also from Modesto, CA. The students were pretested in kindergarten, and the baseline sample sizes were 118 for Fremont, 142 for Taft, 72 for Orville Wright, 135 for Tuolumne, 90 for El Vista and 90 for Garrison/Kelly.

      In the treatment schools, the SFA program was modified to be more appropriate to ELL students.

      Sample characteristics: The authors did not provide sample characteristics at the student level. Rather, the characteristics of the schools were presented as of Spring 1992. All six schools had reading scores below the 60th percentile and all had at least 50% minority enrollment.

      Measures: All kindergarten students were pretested with the Peabody Picture Vocabulary Test. The assessors were current and former classroom teachers who had received training on proper administration of the test. The posttests were three scales from the Woodcock Language Proficiency Battery (Word Identification, Word Attack, and Passage Comprehension).

      Analysis: To prepare for the analysis, the students were divided into four analytical groups, defined as follows:

      • English Speakers: Dominant language in kindergarten was English, the pretest language was English, the instruction was in English, and the posttest was in English.
      • Spanish Bilingual: Dominant language in kindergarten was Spanish, the pretest was in Spanish, the instruction was in Spanish, and the posttest was in Spanish.
      • Spanish ESL: Dominant language in kindergarten was Spanish, the pretest was in Spanish, the instruction was in sheltered English, and the posttest was in English.
      • Other ESL: Dominant language in kindergarten was not English or Spanish, the pretest was in English, the instruction was in English, and the posttest was in English.

      ANOVA analyses were conducted within each analytical group and cohort, with PPVT pretest score as a covariate. Effect sizes were calculated.

      Outcomes

      Baseline Equivalence: The treatment schools were somewhat equivalent to their matched schools on the following characteristics: historical reading scores, percent AFDC, percent free lunch, percent minority, percent ELL, and percent Spanish speaking. The authors did not indicate whether the differences between treatment and comparison schools on these factors were statistically significant. Baseline equivalency at the student level was assessed with the PPVT pretest scores, and there were no differences between treatment and control students within each analysis group and cohort.

      Differential Attrition: The authors did not address differential attrition. They also did not address student mobility in and out of the control and treatment schools.

      Posttests: Importantly, the researchers did not do tests of statistical significance for any of the results.

      No tests of statistical significance of results were presented. For English speakers, the SFA program showed moderate positive effect sizes for the 1992 Cohort (effect sizes = .41, .42, and .23 for grades 1, 2, and 3, respectively.) The SFA program showed moderate positive effect sizes for the 1993 Cohort (effect sizes = .87 and .34 for grades 1 and 2, respectively). The SFA program showed a weak positive effect for the 1994 Cohort (effect size = .27). In general, the effect size decreased over time within cohort.

      No tests of statistical significance of results were presented. For the Spanish Bilingual group, the SFA program showed extremely strong effects early, but the effects declined over time. Specifically, the effect sizes for the 1992 Cohort were 1.36, .19, and .09 for grades 1, 2, and 3, respectively. The effects sizes for the 1993 Cohort were 1.32 and .72, for Grades 1 and 2, respectively. The effect size for the 1994 Cohort was 1.4 for grade 1.

      No tests of statistical significance of results were presented. For the Spanish ESL group, the SFA program effects were similar to the Spanish Bilingual group. Specifically, the effect sizes for the 1992 Cohort were .97, .45, and .03 for grades 1, 2, and 3, respectively. The effect sizes for the 1993 Cohort were .72 and .43, for Grades 1 and 2, respectively. The effect size for the 1994 cohort was 1.41 for grade 1. However, the sample sizes for the SFA students in this group were extremely low - n=7 for Cohort 1992, n=4 for Cohort 1993 and n=4 for Cohort 1994.

      No tests of statistical significance of results were presented. For the Other ESL group, the SFA program effects were small to moderate. The effect sizes for the 1992 Cohort were .24, .25, and .05. The effect sizes for Cohort '93 were .96 and .49 for Grades 1 and 2, respectively. The effect sizes for the '94 Cohort were nil. Again, the general trend was decreasing effect sizes over time.

      To address the general trend toward lower effect sizes over time within cohort, the authors provided grade equivalencies for each cohort and analytical group. For the Spanish Bilingual and Other ESL groups (Spanish ESL sample sizes are too low to be trusted), the grade equivalency differentials between treatment and control for Grade 3, unfortunately, appear to be quite similar. Without tests of statistical significance, the case of non-decreasing effects is difficult to make.

      Limitations

      The limitations of this study include:

      • No tests of statistical significance were presented.
      • The sample sizes for the Spanish ESL students were so small that the results are extremely difficult to interpret.
      • The effectiveness of SFA decline over time and may have actually been non-significant by grade 3.
      • The authors did not report an analysis of differential attrition.

      Chambers, B., Cheung, A., Madden, N., Slaven, R., & Gifford, R. (2005). Achievement effects of embedded multimedia in a Success for All Reading Program. Technical Report. Center for Research and Reform in Education, Johns Hopkins University.

      Evaluation Methodology

      Design: This study used a cluster randomized trial design to identify the effects of using embedded multimedia in SFA programs. Staff from ten SFA elementary schools in an inner city Hartford, CT school district agreed to implement the embedded multimedia component. Five of the ten schools were randomly chosen to implement the multimedia component of SFA, and the other five served as the control group for the first year, using SFA without multimedia. After the first year, the control group was given the embedded multimedia component.

      The components of the embedded media treatment included:

      • Animal Alphabet: Animations that teach and reinforce sound/symbol relationships.
      • The Sound and the Furry: Videos in which SFA puppets model the word blending process, phonemic awareness, spelling, fluency, reading strategies, and cooperative routines.
      • Word Plays: Live action videos of skits dramatizing important vocabulary concepts from the Success for All beginning reading texts.
      • Between the Lions: Clips from the award-winning PBS program in which puppets and animations teach phonemic awareness, sound/symbol correspondence, and sound blending.

      The subjects were SFA first grade students who were pretested in early October 2003 and posttested in early May 2004.

      Sample characteristics: The SFA embedded media schools and the SFA control schools were very similar. Each group of schools had an enrollment of just over 200 1st grade students; more than 95% of the students from each group of schools received free lunch; and about 30% of the students from each group had LEP's. The racial/ethnic distribution was very similar, with both groups of schools enrolling about 2/3 Hispanic students, 1/3 African American. The authors did not provide characteristics of the actual sample of first grade students.

      Attrition: Of the 450 first graders enrolled in all ten schools in the fall of 2003, 394 completed pre-and posttests (n=189 in treatment schools, n=205 in control schools).

      Measures: The pretests were the Peabody Picture Vocabulary Test (PPVT) and the Word Identification subtests from the Woodcock Reading Mastery Test. Each testing session took approximately 30 minutes per child.

      The posttests were the Dynamic Indicators of Basic Early Literacy Skills (DIBELS) and three scales from the Woodcock Reading Mastery Test: Word Identification, Word Attack, and Passage Comprehension. Testing sessions were about 42 minutes per child.

      Analyses: The data were analyzed using Hierarchical Linear Modeling with students nested within schools. The dependent variables were the DIBELS score and the three subscales of the Woodcock Reading Mastery Test. The independent variable was treatment condition and the PPVT and Word ID pretest were used as covariates. The analysis was conducted on the entire sample and on a sub-sample of Hispanics.

      Outcomes

      Baseline Equivalence: The authors did not provide demographic baseline equivalency data on the first grade students. However, there was no significant difference in the pretests between SFA treatment and SFA control students. No significant difference existed between the embedded media SFA schools and the SFA control schools on mean PPVT and mean Word Identification score. However, at the individual level, the Word Identification scores for students from the control schools were higher (p<.01) than Word Identification scores for students from the embedded media SFA schools.

      Fidelity: The researchers did not measure or report on fidelity.

      Differential Attrition: The authors did not present an analysis of the 56 students who did not complete both pre- and posttests.

      Posttest: Only one of the four outcomes measures showed significant effects for the embedded media SFA program. Specifically, embedded multimedia SFA schools scored significantly higher than the control SFA schools on the Word Attack subtest (p<.05 and individual ES=.47), but did not score significantly better on Word Identification, Passage Comprehension, or the DIBELS assessment. This pattern of outcomes held for the Hispanic subset as well.

      The authors expected that Word Attack would be the assessment that was most effective because three of the four multimedia segments dealt primarily with letter sounds and sound blending, which are key components of Word Attack. The fourth, Word Plays, focused on vocabulary. The other measures, especially Passage Comprehension and DIBELS, are more logically related to reading of connected text, which was emphasized equally in both groups.

      Limitations

      The limitations of this study include:

      • Sample size is low (n=10).
      • This study does not address how multimedia may impact SFA students on important literacy outcomes other than phonetics.
      • No long-term followup.
      • No differential attrition analysis.

      Quint, J. C., Balu, R., DeLaurentis, M., Rappaport, S., Smith, T.J., & Zhu, P. (2013). The Success For All model of school reform: Early findings from the Investing in Innovation (i3) scale-up. New York: MDRC.

      Quint, J. C., Balu, R., DeLaurentis, M., Rappaport, S., Smith, T.J., & Zhu, P. (2014). The Success For All model of school reform: Interim findings from the Investing in Innovation (i3) scale-up. New York: MDRC.

      Quint, J. C., Zhu, P., Balu, R., Rappaport, S., & DeLaurentis, M. (2015). Scaling up the Success for All model of school reform. New York: MDRC.

      Evaluation Methodology

      Design: This study used a randomized-controlled trial to estimate program impacts on K-2 reading over three years of a multi-year evaluation project. The study recruited five school districts in four states for a total sample of 37 schools and examined the effects of the intervention from the 2011-2012 school year through the 2013-2014 school year. Each of the schools had to be willing to participate and meet the following eligibility criteria: it had to serve students from kindergarten through fifth grade; at least 40% of students had to be eligible for the free and reduced-price lunch program; it had to identify a school staff member to serve as program facilitator; at least 75% of teachers had to vote to adopt the program. The 37 schools were randomly assigned to a condition, resulting in 19 intervention schools and 18 control schools.

      The study followed the 2,956 kindergarten students enrolled in the 37 schools in the fall of the 2011-2012 school year that were not enrolled in separate special education classes. Pretests were given in the fall and first-year posttests were administered in the spring. The analysis sample included 2,568 kindergarten students who were present in the study schools in the fall and spring of the school year and who had valid spring test scores. An additional sample used in supplemental models included any kindergarten student with a valid spring test score, regardless of whether the student was enrolled in the study school in the fall (N=2,897).

      Follow-up data from spring of students’ first grade year was collected in 2013. A total of 2,251 students (though N was as low as 2,147 for one measure) who remained enrolled in a school of the same type (treatment or control) and completed assessments in spring comprised the analytic sample.

      At the 3-year follow-up in 2014, up to 1,635 students (55%) had scores on the outcome measures.

      Sensitivity analyses were also performed among all students completing measures in first and second grade regardless of whether students attended a program school in previous years (N ranged from 2,802 to 2,962 across measures).

      Of those enrolled in a study school at baseline, 10.4% of program students and 9.8% of control students transferred to a non-study school. Some students transferred from one study school to another, and these students’ treatment statuses were determined by the status of the fall school. Of the students in the program group at baseline, 0.9% transferred to a control group school; of those in control schools at baseline, 0.6% changed to a program group school. Of the total treatment sample, 63% were in the treatment group for all 3 years.

      Sample Characteristics: Study schools were located in the West, South, and Northeast regions of the country, with most located in large or midsize cities. The average school enrollment was 547 students. Across the sample, the kindergarten students averaged 5.5 years old and were evenly divided across gender. Most students were Hispanic (64-65%), followed by black (20%), white (13-14%), other race/ethnicity (1-2%), and Asian (1-2%). Over 88% of the sample was comprised of families in poverty. Between 18 and 25% of the students were English language learners and a small percentage (8%) were in special education.

      Measures: At posttest, two measures came from the “Basic Reading” achievement cluster of the Woodcock-Johnson III Tests of Achievement, developed and validated by others. Students who were instructed primarily in Spanish were given Spanish and English versions of these assessments. The study used raw scores for these measures, since the standard scores would rely on the test’s norming sample that was reported to be out of date (p. 126).

      • Letter-word identification test. This assessment measures a student’s letter and word identification skills and tests reading decoding.
      • Word attack test. This test measures a student’s ability to apply phonic/decoding skills to unfamiliar words.

      At the first and second grade follow-ups, two additional measures from the Woodcock-Johnson reading cluster assessed more advanced reading skills:

      • Test of Word Reading Efficiency. Assesses efficiency of sight word recognition and phonemic decoding in children.
      • Passage Comprehension. Students orally supply the missing word removed from a sentence or brief paragraph.

      The study also administered the letter-word test at baseline. Additionally, the following measure was collected at baseline:

      • Vocabulary test score, using the Peabody Picture Vocabulary Test, developed and validated by others.

      Analysis: The study conducted two-level hierarchical models that nested students within schools and treated the five districts as fixed effects. Models included school- and student-level covariates. It appears that student-level and school-average baseline outcome scores for the letter-word test and the vocabulary test were controlled (Appendix F, p. 123-4 in Quint et al., 2013). Baseline scores for word attack do not appear to have been included as covariates, but they may not have been developmentally appropriate at pretest. To adjust for multiple tests, the year-1 analysis applied the Benjamini-Hochberg procedure, while the year-2 and year-3 analyses noted results after adjustment in appendices and footnotes rather than in all analyses. Additional analyses were performed for the full sample of students assessed in spring of first grade, regardless of Kindergarten program exposure.

      Moderation analysis applied the same multilevel models to the following subgroups: Blacks, Whites, Hispanics, males, females, special education, not special education, English language learners, non-English language learners, poverty status, and not poverty status. Additionally, models determined whether program effects varied across initial achievement levels through terms interacting condition status with baseline vocabulary test score and baseline letter-word test score.

      Intent-toTreat. The study used all subjects with outcome data. Students with missing outcomes, due primarily to those moving to non-study schools and secondarily to missing the assessments, were dropped from analysis. Students new to the study schools, and not present for the full program, were included in separate analyses. Students missing covariates (but not outcomes) were included with covariates indicating missing values.

      Outcomes

      Implementation Fidelity: Although teachers voiced some concerns, “by the end of the first year, all but one of the study schools were deemed to have met SFAF’s standards for adequate first-year implementation, although there was also considerable room for improving the breadth and depth of that implementation” (p. ES-4, 2013). Further information on implementation fidelity is reported in Chapter 4 of the 2013 report and Chapter 3 of the 2015 report.

      By the end of the second year (Quint et al., 2014, pp. 5-13), “program group schools improved their implementation of SFA… [putting] in place new practices that they had not previously implemented, and they increased the proportion of classrooms within a school where SFA-prescribed practices were in evidence” (p. 5). However, due to more strict standards for implementation as schools progress with the program, only “16 of the 19 program schools were judged to meet SFAF’s standards for adequate implementation fidelity” (p. 8), and qualitative assessments from teachers implementing the program indicated that they “reported feeling much more at ease with the SFA initiative in the second year than in the first year, although they continued to express some concerns about the program” (p. 11). Perhaps most notably, intervention group teachers were significantly less likely than controls to believe that their reading program helps adequately prepare students to do well on state achievement tests. During the 3rd year (Quint et al., 2015, p. 27), 17 of 19 schools achieved adequate implementation fidelity.

      Baseline Equivalence: Program and control schools did not differ on free and reduced-price lunch eligibility, race/ethnicity, sex, school enrollment, number of full-time teachers, or percentage of students at or above reading proficiency level. Program and control students did not differ on age, poverty status, race/ethnicity, sex, special education status, or vocabulary test score. Marginally significant differences (p<.10) across condition status were noted for English language learner status and letter-word identification test score. See Quint et al. (2013, p. 14).

      Differential Attrition: All studies tested for different rates of attrition by condition, and two studies examined differential attrition by testing for baseline equivalence in the analytic sample, after excluding dropouts.

      At the end of year 1 (Quint et al., 2013), there were no statistically significant differences across conditions for students transferring schools, including changes to another study school, to a non-study school, or to either a study or non-study school in spring of students’ Kindergarten year. In addition, there was no significant relationship between condition status and the proportion of in-movers (students enrolled in a study school in the spring, but not the fall).

      At the end of year 2 (Quint et al., 2014), tests for differential attrition among those retained in the spring of students’ first grade year revealed no significant differences in response rates by condition, but one marginally significant difference (p= .058) on teacher surveys measuring implementation. Baseline sociodemographic or outcome measures were not tested for differential attrition.

      At the end of year 3 (Quint et al., 2015, Table 2.5), the study reported no significant differences in attrition across conditions. Further, tests for baseline equivalence of the analysis sample (Table 2.4), which excluded those lost to attrition, revealed no significant differences across conditions. Appendix B indicates some differential attrition. Specifically, Table B.3 shows that out-movers differed significantly on several measures from those retained for the analysis sample, and Table B.4 shows that out-movers in the intervention group differed significantly on several measures from the control group out-movers. However, based on Table 2.4, the differential attrition was not strong enough to compromise the randomization.

      Kindergarten Posttest: Adjusting for multiple hypotheses testing, the intervention group scored marginally significantly higher on the word attack (p<.10), but not the letter-word test. Without the adjustment, the impact of the program on the intervention group word attack scores was significant (effect size=.18). Results using a sample that also included students who were not enrolled in the study school in the fall showed the same results, with word attack scores significantly improved among the treatment group (effect size=.18).

      Moderation Analysis: Positive and significant program effects for the word attack test were observed for males, black students, students in poverty, non-English language learners, and students not in special education. Hispanic and female students showed marginally significant improvements on word attack, while whites, students in special education, English language learners, and students not in poverty did not differ. No significant differences on letter-word test for any subgroup were reported. Among students who primarily received reading instruction in Spanish, analysis revealed no significant differences across conditions on four measures (English and Spanish letter-word and word attack tests).

      Additional models found that program effects did not vary by initial achievement. For both outcome measures, terms interacting condition status with baseline vocabulary test score and baseline letter-word test score were not significant when included separately or together.

      First Grade Follow-up: By spring of the students’ first grade year, the treatment group had made significant small-moderate improvements in word attack (effect size= .35) and marginally significant improvements in word identification (p= .08, effect size= .09) compared to controls. No treatment effects were observed for higher level reading functions such as reading efficacy or passage comprehension.

      A supplementary analysis examining whether program effects persisted among a sample of all students who completed measures in spring (including those who did not attend program schools in Kindergarten) indicated that the treatment was still positively associated with improvements in word attack, but not word identification, relative to controls.

      Moderation Analysis: Positive, significant impacts of the program were observed for letter-word identification among Hispanic and female students. Similarly, Black, Hispanic, female, male, and non-English language learner students receiving the intervention improved word attack, relative to like controls. Treatment group Whites also improved passage comprehension; however, special education students performed significantly worse on 3 of 4 measures (letter-word identification, word attack, and passage comprehension) than their control group counterparts -- an iatrogenic effect.

      Second Grade Follow-up: The study reported significant improvement in the treatment group for the Woodcock-Johnson Word Attack subtest of phonics decoding skills (p=.022, d = .15), but not for the other three reading tests. The program also had no impact on school-level measures of special education or grade retention rates.

      For a subset of the sample that had Woodcock-Johnson letter identification scores below the median score of the primary sample, the intervention had some additional marginal effects. Among “lower performing” students, the treatment group had better scores on the Woodcock-Johnson Letter-Word Identification (p=.074), Woodcock-Johnson Word Attack (p=.014) tests and the Test of Word Reading Efficiency (p=.099) at the second grade follow-up. There were no moderation effects for the Peabody Picture Vocabulary test. The study reported that results for socio-demographic groups were consistent with earlier results.

      Limitations:

      • Iatrogenic effects were observed for special education students on 3 of 4 outcomes

      Tracey, L., Chambers, B., Slavin, R. E., Hanley, P., & Cheung, A. (2014). Success for All in England: Results from the third year of a national evaluation. SAGE Open, 4, 1-10.

      Though the basic structure of the intervention is the same as that used in the U.S.-based studies, the authors state that the program was “substantially adapted to the language, culture, and standards of England, Scotland, and Wales” (pg. 3). This means that instructional elements were emphasized while some of the family services aspects of the program were underutilized.

      Design

      The study evaluated the effects of the Success for All program using a quasi-experimental design. Twenty schools from a range of regional contexts throughout England that were already using the program were recruited in spring of 2008 to participate in the evaluation. Once these treatment schools consented to participate, researchers recruited 20 control schools whose academic and student demographic characteristics matched those of the treatment schools. The matches used prior test scores, % free-lunch eligible, and % additional language students for the full school rather than for the kindergarten subjects. The study did not present the number of students randomized to each group.

      Baseline measures were collected from students attending the 40 participating schools in fall of their reception or kindergarten year (September 2008). Measures were also collected at the end of kindergarten (spring 2009) and at the end of grades 1 and 2, though only the grade 2 (posttest) results were presented. At posttest, 36 schools (90%), 18 in both treatment and control conditions, were retained. The number of students in the posttest analysis varied by outcome. Tables 2 and 3 show that the number of students in the control group ranged from 381 to 471, and the number of students in the intervention group ranged from 356 to 415.

      Sample

      Little information was given describing the kindergarten student sample, though aggregate measures suggest that about 40% of pupils were eligible for free school meals, about 35% were English language learners, 23% had special educational needs that were provided by the school, and 13% had special educational needs that were fulfilled by outside specialists.

      Measures

      The study’s outcome measures were collected at posttest, in the spring of students’ 2nd grade year. As with the other studies, measures primarily come from the Woodcock-Johnson Tests of Achievement, which was normed in the U.S. Testers were blind to condition.

      • Letter-word identification test. This assessment measures a student’s letter and word identification skills and tests reading decoding. Reported internal consistency was .97.
      • Word attack test. This assesses a student’s ability to apply phonic/decoding skills to unfamiliar words. Reported internal consistency was .87 for the measure.

      Additional measures of higher-order reading accuracy, reading rate, and comprehension came from the York Assessment of Reading Comprehension. Reliability for the three constructs was .87, .95, and .62 among the posttest sample.

      Baseline reading ability was assessed using a more developmentally appropriate measure, the British Picture Vocabulary Scale- Second Edition: An English adaptation of the Peabody Picture Vocabulary Scale. Cronbach’s alpha for the measure using a national sample of English children was .93.

      Analysis

      The program’s impact on reading outcomes at posttest was estimated using multilevel regression models, with students nested within schools. Analyses adjusted for baseline picture vocabulary scores at the school level, but not for demographic characteristics that differed between treatment groups.

      The study used all schools that were willing to continue to provide data and all students who were present on testing days. No effort was made to follow students who moved out of the study schools or into another study school.

      Outcomes

      Implementation Fidelity: Schools were rated by program personnel on 19 items related to teacher and student behaviors. Each item was rated on a scale of 0 to 3, with 3 indicating the highest fidelity. In total, the 18 intervention schools had medium or high implementation ratings: 10 schools received ratings of 3, 7 schools were rated 2, and only 1 school was rated 1.

      Baseline Equivalence: Despite the matching strategy used to identify control sites, treatment schools had significantly more students eligible for free lunch and a significantly greater proportion of students learning English as a second language. Schools did not differ significantly on baseline reading measures.

      Differential Attrition: Groups in the analytic sample used for the posttest results differed significantly on one baseline measure-- the percentage of English language learners.

      Posttest: Analysis revealed that program schools significantly improved 2 of 5 literacy outcomes relative to controls: Word identification and word attack. Standardized effect sizes for both outcomes suggest that the treatment led to small improvements in basic reading skills (d= .20 and .25, respectively), and while results favored the treatment group on higher-level reading skills, differences were non-significant for those measures.

      Limitations

      • Used a matched QED design that may be biased by self-selection of schools into the intervention group.
      • Lack of information on attrition of baseline student sample.
      • Intent to treat unclear given the lack of information on students (though all available schools were used).
      • Groups differed at baseline on two measures.
      • Some evidence of differential attrition.

      Video

      http://video.msnbc.msn.com/msnbc/49150438/#49150438