Using scientific evidence to improve educational decisions

February 15th, 2012

I often find myself playing the role of Grumpy Old Man in conversations about the selection of intervention programs and other teaching practices. A statement along the lines of “but there’s no evidence that it works” is often preceded by much face rubbing and hair pulling on my part. The response I hear most often is “but we see it working” which precipitates more face rubbing and hair pulling from yours truly. Perhaps the biggest barrier in these conversations is that teachers and scientists often have different definitions of what is meant by “evidence“. This blog attempts to explain what scientists mean by the term evidence.

Evidence-based practice

The term evidence-based practice began in medicine. It seeks to maximise the accuracy of clinical decisions based on evidence gathered from the scientific method. No one wants to see a doctor who prescribes a treatment just because they believe it works or because they heard a colleague give a presentation on it. There should be a burden of proof (and theoretically there is, although this burden doesn’t exempt doctors from making mistakes) on a doctor to make treatment choices that have been shown to work significantly better than no treatment or, if there are alternative treatments available, to choose the one that works most effectively with the fewest side effects. These statements are axiomatic but they aren’t often applied to education, an area of at least equal importance as health.

What constitutes scientific evidence?

A very brief description of the scientific method in relation to treatments is as follows:

  1. Select 2 equivalent groups. If they don’t share the same characteristics and the same level of skills before the intervention you can’t be certain that any differences observed after the treatment were due to the treatment itself or to the pre-existing differences between the groups.
  2. Gather pre-treatment data using well-validated and reliable instruments that clearly measure the outcome in question. For example, a comparison of a group versus a one-on-one reading program should be evaluated with tests of reading ability not of motor skill.
  3. Randomly allocate students to the groups. If students, parents or teachers actively select the groups it damages the results. For example, a recent study compared neurofeedback therapy for ADHD to a non-treatment group. The results showed that parent-report measures of ADHD symptoms improved in the neurofeedback group relative to the untreated group. However, because parents actively decided to enrol their child in the neurofeedback group or actively decided not to, all these data show is that if parents believe that neurofeedback is going to work they will report that it does work!
  4. Implement the treatment while making sure that the treatment is run the same for everyone and that additional teaching or therapies are not going on at the same time.
  5. Administer post-tests to determine outcome.

Additional points of note:

  1. Children’s development is not static; they are in a constant state of personal improvement. Observations that students seem to improve over the course of a program/intervention are therefore mostly meaningless. All students will improve over time. The only way to tell if a teaching method works is to compare a child (or preferably a group of children) to an equivalent group who receive a different teaching method.
  2. Placebo effects are large in children. They will often improve just because of the extra attention paid to them. Therefore, observations, or even more empirical data, collected on a treated group in the absence of a group receiving an equal amount of attention are prone to error.
  3. Regression to the mean is a statistical artefact that holds that extreme scores (e.g., low scores on a reading test) are likely to return closer to the mean (average) on repeat testing. Hence, proponents of a brief intervention, say of 2-weeks, may claim that their treatment resulted in the student improving from a score of 70 to 80 and that change was significant when in fact the reason for the improvement may simply be regression to the mean.
  4. Beware research using tests that measure what is taught in the treatment. See Merzenich et al. (1996) for an example.

An example of good data

Hatcher, Hulme, and Ellis (1994) compared four groups of children all of whom had comparable reading dififuclties. The children were randomly allocated to four groups each of which received a different treatment. One got phonological awareness training, one phonological awareness plus reading and another just did reading. The last group received no treatment. Although the 3 treated groups received different treatments, they received the same amount of instruction in terms of time, attention and contact with teachers. Because this study controlled for all other variables they were able to claim that the stronger reading growth in the phonological awareness plus reading group was due to that treatment being superior. These claims could not have been made if the students were not randomly assigned to groups, if they differed on some other variable before treatment, or if some other variable such as the time spent on intervention differed between the groups.

An example of poor data

I recently suggested to a teacher that their suggestion that a child with reading problems see a behavioural optometrist was not evidence-based. See here for a review of the evidence for visual therapy in reading difficulties. The person claimed that they had a student in a previous year who improved in their reading ability while doing vision therapy and that they therefore believed that vision therapy worked. Unfortunately, the belief simply cannot be supported by scientific principles. First, the improvements could have been a placebo effect. In fact, it is safe to assume that placebo effects represent at least part of all positive teaching outcomes for all students. Paying attention to them helps them improve. It is probably equally likely that the child would have improved if they recited the alphabet while standing on a wobble board each day. Second, children improve in almost all skills over the course of a year as a natural course of events. In this case, there is no way of being certain that the improvement wouldn’t have occurred anyway. Finally, teachers obviously want students to improve and history is full of examples where even eminent scientists have deluded themselves into believing something because they were keen for it to be true (e.g., see the case of cold fusion). In summary, beware of using observations of single cases like this as the basis for educational decisions.

Types of evidence

Carter and Wheldall (2008) have proposed 5 levels of evidence that can be used to guide interpretation of educational research.

Level 1 

Level 1 programs or practices meet two criteria. First, they are consistent with existing scientific evidence in terms of current theory and recommended practice. Second, there is evidence of efficacy from a number of independent randomised controlled trials. Carter and Wheldall (2008) refer to Level 1 as the ‘gold standard’ and suggest that programs and practices meeting these criteria may be recommended with confidence.

The Hatcher et al. (1994) study described above represents an example of the gold standard Level 1 evidence.

Level 2

Like programs or practices that meet Level 1 criteria, Level 2 programs or practices are consistent with existing scientific evidence in terms of current theory and recommended practice. They also have empirical evidence supporting their efficacy but the design of the studies may not quite meet the gold standard of a randomised controlled trial necessary for Level 1 rating. These programs represent the silver standard and can be recommended with reasonable confidence.

An example, of a Level 2 program is my own Understanding Words reading intervention program. The data we have on Understanding Words is summarised briefly below.

  1. A clinic study showed that a group of students made significant improvements in response to two terms of Understanding Words teaching. The strength of the gains were strong and similar to the average growth seen in randomised trials reported in the literature. However, because the study didn’t have a control group we can’t guarantee that the changes were the result of the intervention rather than to some other variable.
  2. A controlled study that showed that a group of Grade 1 students with reading difficulties, made significantly greater gains than a control group of average readers. In other words, the poor readers ‘closed the gap’ on the good readers as a result of the intervention. This study goes close to meeting the gold standard criteria except that the students were not randomly allocated to groups and the research was not independent of the program developer.
  3. We also have four studies using well-controlled case series designs. These studies have showed that introduction of the Understand Words treatment prompts increased reading growth in individual students compared to baseline periods in which no treatment or an alternative treatment was being provided.

Together, these studies fall short of the gold standard of Level 1 evidence but the program fits into the Level 2 strata based on its theoretical soundness and treatment-outcome data.

Level 3

Level 3 programs and practices make theoretical sense. These programs could be said to be based on evidence because there is often empirical data showing the effectiveness of the type of teaching contained in the program. However, there have been no scientific studies documenting the effectiveness of the program or practice itself. These programs might be used in the absence of an alternative that has stronger evidence. However, they should be used with caution. An example may be the ELF reading program. Arguably, ELF has a reasonably sound theoretical basis; however, there is no evidence beyond observations that the program works. Teachers and clinicians who want to be evidence-based practitioners would be cautious about selecting the program when there are other programs with stronger evidence bases.

Level 4

Level 4 programs are Not Recommended. They provide little or no empirical evidence for efficacy. They often rely on testimonials and observational ‘data’ to support their claims. Examples include fish oil as a treatment for ADHD and behavioural optometry as a treatment for reading difficulties.

Level 5

Level 5 programs and practices represent those for which there is evidence that the program is unsafe or results in negative outcomes. These programs and practices should be avoided at all costs.


Teachers can do a lot of good by becoming evidence-based teachers. At present, most of teachers’ professional reading involves practically-oriented periodicals or books rather than research-based and peer-reviewed journals (Rudland & Kemp, 2004). It has also been reported that regular and special education teachers value the opinion of colleagues, workshops and in-service activities (which may present opinions with no evidence-base) as more trustworthy than professional journals (Landrum, Cook, Tankersley, & Fitzgerald, 2002). Further, Boardman, Arguelles, Vaughn, Hughes, and Klinger (2005) reported that when making decisions about classroom implementation of practices, special education teachers did not consider it important that they be research-based. I suspect that this is understandable as practical strategies may seem to be more applicable for teachers. However, it would be nice if these things began to change as they have in medicine. Teachers could become more critical about the claims made by proponents of educational practices and more critical of their own teaching methods. They could begin by asking themselves what the evidence is to support the use of programs and practices. They could actively seek out that evidence from peer-reviewed sources rather than relying on books, the Internet or the opinions of colleagues presenting PD. They could ask themselves: “Would I be happy if my GP used a Level 3, 4 or 5 treatment on me just because s/he believed that it worked?“. If not, they could ask another question: “Should I therefore be cautious in selecting educational programs that have limited evidence?“.

Final note

The last thing I intended was for this blog to be interpreted as teacher-bashing. I could write a similar blog about some of my psychologist, occupational therapist, or paediatrician colleagues. Nor am I immune to human foibles and biases. However, the fact is that we should all strive to do better for students who have learning difficulties and indeed all children. To do that we need to recognise the dangers of our belief system and of relying on the opinions of peers. We all need to strive to base decisions on science, not on philosophy or pseudoscience. We would also do better to recognise the limits of our knowledge. To paraphrase Donald Rumsfeld, the best teachers (and other clinicians) know what they don’t know.