Rapid Learning: Methods for Testing and Evaluating Change in Social Programs

SCOTT CODY: Welcome. Thanks, everyone. Thanks, OPRE, for having us. MaryCatherine and I
are going to try and do a joint presentation
here that hopefully will look like we’ve been
rehearsing this together for several weeks. There’s a chance it might
look like we just met for the first time yesterday. One of those two things
is actually true. Also, we’re both
from Boston, so it might look like we were up
late watching the Red Sox. I know that is true
for at least one of us. I don’t know if it’s
true for both of us. So those of you who have
heard me talk before know that I feel like we
are embarking on an era where we can use program
data and rapid testing to help programs target
services, become more effective, and improve rapidly. But the more I do
this work, the more I see people looking for
the “one size fits all” approach to program improvement. And there is no
“one size fits all” approach to program improvement,
just like there’s no one recipe for making dinner. What you make for
dinner is going to depend on what you
want to get out of it. Are you trying to cut carbs? Are you trying to
eat more greens? Who else is eating? What do they want
to get out of it? What do you have in the fridge? How much time do you
have to make dinner? How comfortable are you
doing something complicated? Those are the factors that
are going to determine what you make for dinner. And similarly, with program
improvement methodologies, what you decide to do depends
on contextual factors. I think that’s one
of the main things we want you to take away. Over the next two
days, you’re going to hear about a large number of
different program improvement methodologies. They’re all valuable. What we want to
talk about is, how do you figure out which
approach to program improvement is right given a given
set of contextual factors? So I’m going to start by
addressing the jargon that was already referenced. There will be a lot of
jargon over the next two days that you will hear. Those of you who may be
relatively new to the program improvement field, you might
become overwhelmed or confused by the jargon. PDSA, rapid cycle evaluation,
rapid cycle testing, small tests of
change, PDCA, CQI– there’s a lot of terms that
will get thrown around. And they’re confusing. In some cases, we have two
terms or multiple terms that can mean the same thing. In other cases, we have
a single term that’s used in different contexts. I often talk about
rapid cycle evaluation. And sometimes what
I’m talking about is different than
what other people mean when they’re talking about
rapid cycle evaluation. So what I’d ask at least
for the next hour– and hopefully, maybe,
for the next two days– if you can kind of
let go of, maybe, preconceived notions
you have associated with some of these terms,
some of the jargon, maybe any emotional attachment
you have to some of the jargon. And just keep in mind that
these different methods, they share a similar backbone. And they’re all kind of
headed in the same direction. MARYCATHERINE ARBOUR:
One way that I like to think about
getting beyond the jargon is to recognize and
think about these in the historical context. So rapid cycle evaluation
arose and gained traction in the 1980s and 90s
initially using data to drive improvement largely
through reporting data externally and focusing
on accountability. And that occurred in
multiple disciplines and was called different
things at that time. And in more recent years,
we’ve moved towards use of data to inform improvements
in practice that focuses on shorter time
frames and engagement of frontline service providers
in looking at and using that data to make
decisions about practice. So I like to think about–
the spirit of these methods is to get beyond reporting data
externally for accountability purposes towards using it to
drive improvements in practice. And the different
methods that you’ll be hearing about over
the next two days fall along varying
degrees of rapidness and also use different
amounts of engagement in frontline providers in
looking at and using that data. But if we keep that spirit in
mind and the history in mind, I think it helps us get
beyond some of the jargon. SCOTT CODY: So there are
important differences among some of the program
improvement methodologies that we’ll be talking about. But in general, we
think they share this common set of principles,
this common backbone. They all start with, either
implicitly or explicitly, a set of objectives. What is it you’re
trying to achieve? What are you trying to change? They identify a
strategic element, something to test
to see if you can help achieve that objective. Importantly, they don’t
assume that that change is going to work. All of these methods go
into it acknowledging that there’s a chance that
what we’re testing actually might not work. And that’s actually a
pretty important component. They hypothesize how this
change might actually lead to the improvements
that are targeted, possibly using a logic model. They determine the
appropriate way to measure whether those
improvements are happening– again, possibly, along the
way, using that logic model. Then you run the test. You analyze the results, make
a decision based on that, and then hopefully repeat
on your path to improvement. So I’ve already
kind of said this, but I think it’s important
not to sort of blindly apply a program improvement
rapid-learning method to every situation
or any situation. You run the risk of either
answering the wrong question or maybe answering
the right question but doing it sub-optimally,
inefficiently, or in a way that’s
just not useful. So we think that the best way
to figure out the right method is to first really understand
the context and the problem that you’re trying to resolve,
and use that to determine which method to use. So how do you do that? We’ve compiled this
list of questions. It’s probably not a
comprehensive list. But we think these are some of
the most important questions to answer in order
to figure out, what’s the best method for
rapid-learning improvement? So let me walk through these. The first is– what am
I trying to understand? This is pretty obvious. But it’s obviously a
really important question. And the more specific,
the more explicit you can be in saying what it
is you’re trying to understand, that will not only help figure
out what’s the best approach but that’s going to
help you communicate the approach to the folks who
are on the ground actually incorporating this change into
their day-to-day activities, the frontline staff, as
well as the people who are paying for your rapid testing. So being explicit
about what it is, and as clear as possible what it
is you’re trying to understand is really the first, most
important thing to do. Similarly, what outcomes
am I trying to change? Again, being as clear
here as possible. Next, who’s going
to use the results? And how are they
going to use them? And I’ll say, as someone who
comes at this really from more of a traditional social policy
research, randomized control trial background, I think we
as researchers historically have dismissed this
issue of ultimately who’s going to use the results? And how are they
going to use them? I think as a field, we’re
getting better at it. And I know that there are
people here these next two days who are going to talk
about work that they’ve done that really does
incorporate the end user. But in general, I think we as a
field have room for improvement here in really
understanding the end user. We spend a lot of time focused
on minimizing bias and ensuring internal validity. And personally, I think that
those are important things to get right. But I will say that an unbiased,
internally valid finding that is not useful is
a useless finding. So we need to keep that in mind. What are the
organization’s priorities? Where does this fit in? Again, that can influence
the nature of the design and how you go about
the improvement effort. How confident do I need
to be in the results? I like to think of this as,
what’s the risk of being wrong? What’s the consequence of
a false positive, right? If I test a change, and
I see an improvement, and I conclude that that
change caused the improvement, what’s the consequence
if I’m wrong? In some cases, it might
not really matter, right? I’m trying to improve
program completion rates. I make a change. Completion rates go up. Does it matter if that change
caused the completion rate? Maybe not, right? In some cases, maybe not. In other cases, the change
that I’m implementing– I’m testing it. If it works, I’m going to
roll it out system-wide. And that could be expensive. And I want to be
confident before I roll this expensive thing out
system-wide that it’s actually working. So in that case, the
risk of a false positive is I’m going to spend money on
something that’s ineffective. Or what I’m testing
will affect who gets what services,
which beneficiaries get what services. And that’s going to
affect people’s lives. And I want to make sure that
I’m making a decision that doesn’t lead people into
ineffective services or down a path that’s
going to harm them. So, “what’s the consequence
of a false positive?” is how I think about that. How hard is it to
implement the innovation? Again, that can
definitely affect how you set up the design. How much will-building and
engagement of program delivery staff is needed? And again, this is
something that I’ll admit from the
research perspective, historically maybe we have
dismissed this too much. I think there are
definitely folks who are getting much better at this. But if the inner thing
that we’re testing, if frontline staff– who are
actually incorporating it into their activities– need to tweak it in order
to get it to work for them or if they’re
going to reject it, if there’s going to
be some kind of organ rejection of this change
that you’re implementing, you’re not going to
have a good test, right? I’m pretty confident
you’re going to have an experiment that
doesn’t show an impact. So really assessing–
how much do we need to engage staff in
the implementation of what we’re testing? What data are available? How soon do I need
to know the results? What time is needed
between when I make a change to be able
to observe, actually, an impact on the outcomes
that I’m interested? And often, those two
things are in conflict. I usually need to know the
results sooner before I can actually observe the impact. So in those cases, is
there a near-term proxy that I can use as a measure
of potential impact? So those are the
questions we think– and we’ll dig into these
a little bit more– but that we think are
important in order to begin to select your
program improvement approach. MARYCATHERINE ARBOUR:
So I just want to illustrate one
of these questions and how I tend to
think through this. For the first one,
the question about, what am I trying to understand? There are situations where
what I’m trying to understand is how to apply
established evidence. So I don’t need to prove that
handwashing reduces infection. I don’t need to prove
that reading to children and providing ways to
introduce new vocabulary that is not a part of their daily
life helps develop language. I do need to prove or test how
I get people in the front lines to apply that evidence across
a diversity of contexts. Sometimes, the question is how
to adapt established evidence to a new context. So this is an example
that’s fascinating. It’s from the MacArthur 100
Million and Change Initiative. And there is a team from the
International Rescue Committee, Sesame Street and NYU Steinhardt
who are adapting Sesame Street interventions to Syrian refugee
families across the Middle East. So there’s 30 years
of research that show that Sesame
Street, the program, has influenced childhood
outcomes for three to five-year-old kids
in the United States. We think the underlying
theory should apply similarly to three to five-year-old
Syrian immigrants. But how to apply Sesame
Street in those contexts is the question they’re
setting out to answer. And they’re going
to test and adapt a home- and center-based
based approach for 101.5 million children and
programming across the Middle East for 9 million families. Another interesting example
that I and others in the room are involved with is the Home
Visiting Applied Research Collaborative that’s
looking to see, how do we adapt evidence-based
home visiting model components,
active ingredients, to better improve
outcomes for all families? And then the last is, how
do I discover new evidence? If there’s something we
really don’t know how to do, that should play into
which designs we choose. So here’s an example. I’ve worked on a
project in Chile. This was funded by a
Chilean foundation, Fundación Oportunidad. And Harvard worked with the
University Diego Portales to work on this project,
Un Buen Comienzo. In 2006, this foundation
set out to improve the opportunities available
to low-income children. Chile is a stable and
prosperous democracy with some of the highest levels
of inequity in the world. So the country as a whole has
established early childhood education as a priority. And four and five-year-olds
have universal access to preschool in Chile. The quality of that is a
concern to Chileans and others. So this foundation
said, what we’re trying to understand
is the impact of a professional
development intervention for preschool teachers. The outcomes are
classroom quality and child-level language
and literacy outcomes. They wanted the results to
be used to influence policy and that policymakers
would use these results. And the organization’s
priorities were really completely aligned
with this intervention. They want to be a
national leader improving the quality of public early
education and children’s opportunities. In 2006, they wanted
to be completely confident in the results. So they put a premium
on experimental design. At this time, the
other questions really weren’t priority
questions for this foundation– how hard is it to
implement the innovation, and what data were available? They were willing to invest
to overcome implementation barriers and to create
data collection systems. How soon do I need
to know the results? They were very patient. They understood
that it was going to take probably two
years for a change in professional development
to affect children’s language development. And there was no near-term
proxy for that impact. And so they were
willing to collect it over a duration of time. They also knew that
the premium they placed on experimental design, which
required cluster randomization, necessitated a
series of cohorts. So these were
two-year cohorts that were in enrolled sequentially. The results of this study
were available in 2011. And we found that
the intervention had a large positive
impact on classroom quality and null impact on average
child-level outcomes. There was a positive impact
on several language outcomes for a subset of children
who attended the most. And that brings us to 2011. Many institutions
would have walked away from this intervention
at this time. There aren’t a lot
of funding mechanisms that are set up to
say, we’re going to invest for five
years in an intervention and then we expect
to do more, right? But this foundation was
faced with a decision. They could either start
over with something new or they could
continue with this. And what they decided
was that the experience of the previous six years showed
they had already adapted– we had already adapted– the best existing
evidence, as best anyone could, through a
participatory design. And this intervention
was implemented well. The mixed positive
effects suggested that parts of this theory
of change were solid but that it needed
to be improved. So in 2011, the priority
questions were different. What we were trying
to understand was how to change
the intervention, the same intervention, to
maintain or improve impact on quality and achieve impacts
on child-level language outcomes. The outcomes were the same — classroom quality and children’s
language and literacy. And the use of the
results was going to be slightly different to
redesign this intervention. And it was going to be
used by the implementation team, program recipients,
and program leaders. The organization’s
priorities were still to be a national leader. But they felt they
needed to develop a more effective intervention
to take on that role. And they changed the time frame. They wanted to know these
results in one to two years. They still didn’t place
priority on how hard it was to implement the intervention. And they reduced the priority
placed on how confident we needed to be in these results. So the message of this
is really to demonstrate that as we answer
these questions, we can and should anticipate
that the priorities may change for an intervention depending
on the developmental stage of the intervention itself,
the evolution of the evidence, and we’ll talk a little
more as we go on, the evolution in
the workforce who’s delivering the interventions. SCOTT CODY: When
we first started talking about doing
this presentation, the concept was I
would do play-by-play and she would do
color commentary. I think I just feel compelled to
do that because I keep wanting to talk about the Red Sox. MARYCATHERINE ARBOUR: [LAUGHS] SCOTT CODY: Sorry. I know, someone from
Boston who keeps talking about their sports fans. We’re endeared
everywhere, I know. OK. So the foundational
questions are really important
to help figure out what’s the best approach
in a given context. After you’ve answered
those questions, one of the next
things to consider is, what comparison
are you going to make? How are you going to
measure whether there’s a change in the outcome? And there are a number of
options for making comparisons. And I think the important
thing I want to emphasize here is that the different
comparisons are really answering different questions. So you need to pick
the comparison that’s going to be most
beneficial to your context. You can make a comparison
to historical trends for your program. After we make this change, does
the trend in whatever increase? You can compare relative
to performance targets. Are we better able to achieve
the performance targets that HHS has set for our program
after we make this change? You can look at individual
client histories. So for a given beneficiary,
or for a set of beneficiaries, once we implement a
change in their services, do they have a
change in outcomes? You can look at
national benchmarks. Are we more like our peers? Or are we more like other states
after we implement this change? You can look at other
participants in the program. So do participants
with the change have different outcomes
than participants without the change? You can use forecasting. After we implement
the change, do people do better than we would
predict they would do? You can compare relative
to the program model. After we make the
change, are we better able to implement
this with fidelity than before the change? And of course, you can
look at non-participants. So, are program participants
with the change faring better than people not in
the program at all? Each of these
comparisons, like I said, is answering a
different question. In addition to selecting
the comparison, you also need to select what’s
the methodology for making that comparison. It can range from, really,
just a naive comparison to statistical control
charts, propensity score matching, randomized
control trials, a Bayesian adaptive
randomized controlled trial. And that’s a plug for last
year’s OPRE conference, which materials are online. And so there’s a
couple of points I want to make about this. I could spend a whole hour
just talking about this slide. And I’m not. I’m going to spend
maybe a minute. But there’s a couple of
points I want to make. One is that as you
go down this list, you have increasing
levels of confidence that what you’ve tested– if you observe a
difference in outcomes, a change in outcomes– you have increasing confidence
that what you’re testing is causing that
change in outcomes. So as you go down this list,
the level of confidence increases, but as does the
methodological complexity. So the further you go down
this list, the more complex the methods are,
the more likely it is you’re going to need to
bring in someone to help do the analysis. And that’s important to
consider in this context. Because sometimes you actually
want the program staff to be doing the comparisons. So you want the frontline staff
involved in doing the analysis. And the more complicated
you make the analysis, the harder that’s going to be. So that’s another
factor to consider. I think a lot of
people would assume that I would have a third
column here with time and there would be a
similar relationship that the further you
go down this list, the longer it’s going to take. That’s not necessarily the case. I’m saying not
necessarily because I’m expecting to get questions
on this so I’m hedging. But I can do a randomized
controlled trial in one day under the right
circumstances, right? If I know what I’m testing
can actually have an impact and I can observe that
impact within a day, if I can get that
data that I need to observe the
impact within a day, and if I have a
sufficient flow of people through what I’m testing in a
day, I can do an RCT in a day. I think Google’s doing that
kind of thing all the time. So the factors that drive how
long these different methods take to implement– they’re not
necessarily the method. They include, do I have
access to the data? How long is it going to take
me to observe the impact? And do I have sufficient
sample to identify a meaningful impact? MARYCATHERINE ARBOUR:
One of the factors that will influence where
you make those choices is the data that are available
and the time needed to observe the impact
whether or not there is a near-term proxy for it. So use of a rapid-cycle
evaluation approach will be informed by this. The example from
the Chilean study of children’s early language
and learning skills– there was no near-term
proxy, right? So that was a challenge. And the process measures
were instructional time and instructional
quality, which have a couple of different measures. And I want to illustrate
that what the program wanted to do in the moment in time
was to allow preschool teachers to look at their own practices
and make their own changes. And therefore, they
invented a near-term proxy. And I’m going to show
you an example of this. So we were using the
model for improvement that uses three questions to
guide program staff to make changes and PDSA cycles. The first question of what
we’re trying to accomplish, the teachers
themselves answered– we’re trying to improve our
children’s language skills, but specifically vocabulary. “How we’ll know a change
is an improvement?” is the question about, what
measure are we going to use? And the measure
that the study uses is the Woodcock Munoz,
which is standardized and internationally validated. So it’s a direct assessment. But the teacher
said they’re going to measure the
number of children who could use a new
vocabulary word with help and the number who could
use it without help. And that was their
homemade measure. The change they wanted
to make that would result in an improvement was
introducing one new vocabulary word per day using rotating
instructional strategies. Some of them were part of
the professional development. Some of them came from
their own experience. And the PDSA cycle invites
them to ask their own question much like developing your
own hypothesis before you do an experiment. So if we introduce one new
vocabulary word each day using rotating
instructional strategies, will the number of
children who can use these words without
help increase in practice? They planned it. On the next slide, you’ll
see the “do”– they did this. And they studied it. So they collected data daily. And they did reflection
on this weekly. And the chart on
the left-hand side shows the number of children
who can use the new vocabulary word. The blue line is with help. And the red line
is without help. And the x-axis are days. So they did this
every day, right? And you can see, roughly,
that the blue line goes down and the red line goes up. And the teachers
use this to see, were children using new
vocabulary words without help? In the study, they concluded
that these strategies were working, but also
that there were words that were harder than others. And in their “act,”
the last step, they adapted this to say–
for especially hard vocabulary words where we introduced
them and a lot of the kids need help using
them, we’re going to introduce that
word again later using a different instructional
strategy, doing reinforcement. So we used a lot of
different measures for the program improvement
portion of this. The top two are
teacher-reported measures that are collected daily. So the upper-left-hand
side is the measures we just talked about. The upper-right-hand side
is teacher-reported time spent on language
instruction every day. And we have a lot of
conversation within our own program about how valid
these measures are– they’re not
standardized measures, they are self-report– and how much it matters. So for us, the choice was
made that these are really made to empower the
teachers to reflect on their practice
in different ways and to make changes in
day-to-day practice. The upper-right-hand corner,
at the beginning of 2014, which is the lower
line, the average number of self-reported minutes
spent on instructional time for language per day was five. And at the end of
the year, you can see that it was close to 60. I wouldn’t bet my life that
it’s really five, and not eight, and not four. But I’m convinced that
five is different from 60 even through
teacher self-report. The lower two measures– the lower-left is a
formative evaluation. This was a part of the
teacher’s practice that existed. And we used it, and started
graphing it, and looking at it for the project. And the lower-right-hand
corner is a class assessment. And this is a validated,
standardized measure based on full-day videotaping. That sort of measure we can
only get three times a year. If we think about
the comparisons that Scott outlined,
these are all comparisons to self over time, right? And so it’s not the
same causal analysis that the foundation
still always prioritizes. So on the next
slide, what I’ll show is the effect sizes from a
quasi-experimental design using the Woodcock Munoz. So this is a validated,
international direct assessment of children’s language. The comparison is
between children in schools that got this
teacher professional development intervention to children
in regular preschools. And what you see here is
that from year to year, we’ve seen an increase in the
impact of this intervention on three of the children’s
language outcomes with effect sizes
between 0.2 and 0.3. The other point about this
is that you can only do it– we could only do it–
every two or three years. So we’re convinced that
the increases in impact of our intervention are
driven by the behavior changes based on the measures
that were taken on a daily basis and
three times a year by the teachers themselves. SCOTT CODY: So we wanted to
walk through an example– a real-ish example, this
is a synthetic example– of a situation where you
have a general context but with slight variations
in some of those framing questions. It would lead you to two
different approaches– specifically, one
approach using PDSA cycles in a breakthrough series
collaborative versus one approach using an RCT. So we’ve come up
with this example. I’m going to set the
stage for the example. And then we’re going to talk
through two different scenarios of those background questions
with some minor differences and see how we end up with
two different approaches. So the situation is
we’re an agency that runs programs to help
individuals who have substance abuse, addiction problems
when they come out of detox to help them avoid relapse. We know that people
coming out of detox have a high risk of
relapsing once they get back into the community. We also know that there are
these evidence-based programs that, if they
participate in them, can reduce the risk of relapse. But the programs only
work if people show up and they persist
through the program. So we’re an agency that
runs one of these programs. And we have a high dropout rate. And we want to test
a package of services to reduce that high
dropout rate of people who enroll in the program. And that test is going to
include a couple of things. One– a risk assessment
assessing the risk of dropout. So of people who are coming
into the substance abuse program, what’s the risk that
they’re going to drop out? We’re going to give
them a risk score. And then for those people with
a high risk of dropping out, they’re going to get some
combination of services that will likely
include text reminders. It’ll include assigning
them a mentor. It’ll include incorporating
their family members in the therapy. And it’ll include, potentially,
the counselor making home visits with the person,
all intended to keep them in the program and to persist. So let’s talk through
the two scenarios. MARYCATHERINE ARBOUR: All right. So in my version of reality,
what I’m trying to understand is if this effective
framework, which has been tested
in other programs with a very different
population, can be effective and can be adapted to
reduce dropout in my program with my clients. The outcome that I care about
is substance abuse relapse. The proxy is reducing
dropout in the program. And the results are going
to be used by the program staff and the program director. This is a high priority
for the organization. And in this scenario, the risk
of a false positive is low. So if I test a bunch
of these things and I believe that this
combination of things reduces dropout, and
actually, it wasn’t exactly that combination of things,
it’s not as important to me. How hard is it to
implement the innovation? It’s moderately complex. And how much will-building and
engagement of the program staff is needed? I think that I need
moderate will-building of the staff in this case. I’ve got staff who are very
committed to what they do. And this is true in substance
abuse prevention programs. Many of them have come through
substance abuse themselves. And they believe
their experience. And they think that what’s
worked for them is going to work for everyone else. And so I need a
model that’s going to allow them to test and try
things, and let them succeed and fail, and increase the
buy-in as they experience this in their own practice. I’m not going to
be able to come in and say “do these five things”
and have people do that with conviction unless
I have something that builds will along the way. And that’s one of the reasons
to choose BTS in this case. The data available are
participant history, the risk predictor,
the daily attendance and program compliance. I expect these results
in nine months. And the near-term proxy
is six-week dropout rates. SCOTT CODY: So my situation
is a little bit different. We have this package
of services that includes a risk score, and
text messaging, and mentors, and including family,
and home visits. That has been tested
in other communities. And it’s shown to
reduce dropout rates. My staff are frustrated by the
dropout rates that we have. And they’re eager to
try something new. And they’re familiar
with this model. And they want to try it. But our agency doesn’t
have a lot of resources. And these are
resource-intensive components. In particular, the home visiting
takes a lot of counselor time. And I know that if I’m going
to roll this out agency-wide, I’m either going to need to find
resources to add counselors, or I’m not going to be able
to serve as many people. So what I want to do is test– can I streamline these services? Can I reduce the
components here in a way that I can leverage my
resources without having the same impact
on dropout rates, so reduction in reduction. So the situation’s a
little bit different. My question is,
can the components of an effective framework be
streamlined to reduce dropout? Again, ultimately, I’m
interested in reducing substance abuse relapse. It’s really the program director
who’s using the results here to make a decision about
what’s being implemented. It’s not necessarily the
staff using the results to figure out how they’re going
to integrate this and innovate on this with their services. This is a high priority. What’s the consequence of
the risk of a false positive? Here, it’s moderate. If I’m wrong, I run the
risk of either rolling out a program that actually doesn’t
have an impact on program dropout, or I run the
risk of rolling out a program that’s more
expensive than it needs to be. So I want to be a little
more confidant in what’s having an impact here. In terms of will-building,
like I said, it’s not as important here. The staff are on board. They want to use data. They want to use the prediction. And we think we can get them
to adopt this with fidelity. All other aspects are the same
between the two scenarios. MARYCATHERINE ARBOUR:
So how does this roll out in a timeline? In a breakthrough
series approach, there would be a
preparation period. We’ve given it a month to design
the theory and the measures. We’d recruit a team of experts
who know the evidence base and a team that includes
frontline providers who can help us define that
theory with short-term process aims and longer-term outcomes. In this case, six weeks is
the longer-term outcome. In month one, we’d
recruit the programs. And as we recruit
programs, we are going to coach
them to form teams that include different
participants across the levels of the program,
ideally including clients, program
recipients, potentially graduates of the program. And we will prepare them to work
together using these methods. In month two, we would
run a learning session. This is where the theory
of change would be taught. So they’d be taught by the
experts in that committee, including the frontline mentors. And they’d be taught how to
run PDSA cycles, specifically, how am I going to ask, “if we
test x, will it result in y? In what period of time? With how many clients
per week, per day? How quickly can I do this?” But training people to
use the interventions– those text messages,
mentoring, et cetera– and also to use these
methods as a team happens at the first learning session. In the months in between
month two and five, teams go off and practice,
and test, and learn, collecting the daily attendance
and the dropout rates, and looking at that
data as they go. There’s monthly supports
where everyone comes together, typically on virtual
calls or webinars to review data across sites,
and to have teams that are having a lot of learning,
either successes or failures, present their experiences. At learning session two, you
can dig deeper into the theory, refine any pieces of
the intervention that need refinement, and really
have the team start teaching one another what
adaptations are working. And then they go back into
their sites and practice from months six, seven. And in eight, they
come back together to look at the data
across the entire group, and comparing and
achieving those results. In month nine, typically we
bring back together that group of expert faculty. And we may now have new
frontline provider experts that choose to join
to refine the theory, refine the measures,
and slim down those adaptations of
the four interventions they tested to get something
that you think is tested and available for future use. SCOTT CODY: So we can also do
the RCT in the same timeframe. We would start, again,
with basically designing the approach and
figuring out how we’re going to measure the
outcome of program completion, figuring out how we’re
going to monitor fidelity to the different program
models that we’re testing. We would recruit sites and
counselors into the study and train them on
two different sets. So in one set, we’re going
to train the counselors in using all of the components
of the intervention. And then the other
set, we’re going to train them to do everything
but the home visiting, because that’s the thing
that we want to test. And once we get
them trained, we’ll start randomly
assigning new enrollees. People are coming out of detox
and enrolling in the programs. Well assess risk and assign
those high-risk people into one of the
two treatment arms. And we’ll let that run as
people cycle into the program. We’ll let the two
treatment arms play out and monitor program completion
rates for the two treatment arms. We’ll also do work
to monitor fidelity to make sure that
counselors are actually incorporating these services
according to the model. And then at the
end of the study, we’ll look at the results
and make some decision based on them. MARYCATHERINE ARBOUR: So
to contrast these two, the approach of breakthrough
series on the left is going to monitor
more measures, actually– the daily attendance,
retention to five visits, time between visits,
program completion. The sites will
annotate their own data with the interventions
and the adaptations they make to those
interventions. The outcomes are going
to be daily return rate to the program and the
program completion rate. And at the end, the
questions we’ll answer– and depending on the way the
measurement system is defined, you may use a statistical
process control chart which fits on the
moderate level of complexity and the moderate level of
confidence of Scott’s table. Or you may use run charts. And the question will
be, did the drop out rate shift significantly after
we developed, refined, and implemented a new process? By how much? And which adaptations were
associated with that shift? So this is associations
between adaptations and a shift that will have differing
degrees of statistical rigor depending on how you set the
charts up at the beginning. SCOTT CODY: And on
the RCT side, we are going to monitor
daily attendance. We will monitor
program fidelity. The outcome of interest is
the program completion rate. That’s what we’re going to use
to make an assessment here. The question we’re answering
is, is one combination of these services more
effective at reducing dropout than the other? And if so, by how much? And in particular, what’s
the cost effectiveness of each of these combinations? And in thinking about the
pros and cons of, at least, the RCT approach,
the benefit here is that we have confidence in
any sort of causal inference that we’re making. But we also get a more
precise measure of the impact. And that can be beneficial
in a cost-benefit analysis. The cons of the
RCT are that this is a fairly locked down
approach, at least the way that I’ve presented it. We are going to roll out
these two different models. And we don’t want staff
to innovate with them along the way. We want to make sure they’re
doing the model with fidelity so that we know what it
is we’re actually testing. And so that’s one of the
cons of this approach. MARYCATHERINE ARBOUR: One
last thing before we wrap up– I think another contrast
between these approaches is what happens to
your program staff throughout these experiences. So for some social
service interventions– and I’ll use home
visiting as an example, the home visiting CoIIN, Mary
Mackrain, is in the audience here. HRSA has done an
intentional push to build the capacity
of frontline providers to do this kind of
rapid-cycle testing. And so coming out of a
breakthrough series, what we have seen in home
visiting and what we may see in this experience
is the program staff may be more prepared and
more ready to incorporate other innovations in the future. So the transformation of
the frontline providers and, potentially,
clients or patients as partners in the
work and as innovators, and as potential
adapters of improvement, is transformed through a
breakthrough series experience in a way that it typically
isn’t in a different RCT design. SCOTT CODY: Yeah. OK. We have a couple
of final takeaways. And then we’ll open
it up for questions. The main takeaway here
is let the context determine the method
for rapid improvement. This includes focusing
on decision-maker needs, assessing staff readiness for
adopting the intervention, and assessing the
level of confidence that you need in the findings. The second takeaway– factors
that drive the timeline, they are context-specific. So it’s not necessarily the case
that the RCT is the longest. How long does it take
to observe an impact? Are there proxies available
if that time is too long? What data are available? Can you get the data readily? And what’s the flow? What’s the sample
size of participants through the program? I think the biggest takeaway,
from my perspective, is that this type of improvement
work, this iterating, learning, and improvement
work is great work. So I’m very happy OPRE is
having this conference. Because I think we all
need to keep doing this. Do you have other
takeaways to add? MARYCATHERINE ARBOUR: The only
other comment I would make is that I think the Chilean
example is illustrative of the duration
of investment that needs to be made to really
adapt and tweak interventions to achieve the kinds of outcomes
that social programs aim to make. And it’s really rare to see RFAs
that commit to that up front or provide the mechanisms
that say, you know, we have something we need
to learn before we can RCT. Or we’re going to
RCT and then we’re going to have more to
learn, whatever that is. So for the 60% of the
government agencies and funding agencies that are
in the room, I think I’ve seen some new RFAs
that take this longer view. But the experience,
and the learning, and the impact we saw on
children’s outcomes in 2015 would not have happened
without starting in 2006 and continuing that work with
a variety of methods over time. SCOTT CODY: Great. So we’re happy to
take questions on– MODERATOR: –stand
up for questions. I just wanted to remind
folks to please state your name and affiliations. Thank you. AUDIENCE: Hi, I’m
Steve Bell from Westat. In these very rich
examples, I learned a lot. I want to zero in
on the “r” word– rapid. Besides being willing to
assume that an intervention with a lower dropout rate
reduces long-run recidivism more, what aspects of
the research approach made it faster? SCOTT CODY: So I think
it is focusing on– we’re testing not the full
intervention, but a change to the intervention. And then, yeah, focusing on– in the RCT example– program completion
rates as, ultimately, a proxy for reduction in
recidivism, as you said, that also reduces
the time frame. That answers your question? MARYCATHERINE ARBOUR:
Yeah, I agree with that. And I also think that
the rapid-cycle learning that you see in the
breakthrough series examples can refer to the
learning that happens on a daily basis because of
the cycles of questioning and answering that
you’re doing with data, or a weekly basis in
those reflections. So for example, between
a first learning session and second
learning session, typically what we’re
looking for are specific sites or
participants that already have shifts in their data, right? So in an RCT, you wouldn’t see
a statistically significant difference until you get
to the end, which is why you designed it to be the end. But in the breakthrough
series, if you’re willing to accept that these
proxies that are measured more frequently can show
statistically-informed shifts on run charts, then you
can learn in rapid cycles before you get to
the full sample size. At the second learning
session, typically we’re looking for superstar
sites that have something to teach other people. And you can see
successful adaptations to how I talk to parents about
developmental surveillance and screening by using this
half sheet spread like wildfire across programs, and make the
program staff more comfortable in having those conversations. So rather than
relying specifically on “the model is this,” if we
can get people to fidelity, we’ll get to what we want. And the rapid
cycles, to me, speak to that shorter-term testing
and learning by the staff. SCOTT CODY: Yeah. And I think it’s useful
to think about what we’re talking about–
“rapid” relative to what? “Rapid” relative to how
programs have kind of developed and improved over
time historically? And this is going to be a
complete over-generalization, but you think of a model where
there’s a university that develops a program in the lab. They roll it out in
their local community. They have evidence
that it works. That model wants to get
adopted in other communities. You end up with this
rolling out over years. A multi-site evaluation
of that model happens that takes
multiple years to complete. It’s a number of years before
you start to have evidence of, is this working or do
we need to improve? Where what we’re talking about
is, sort of along the way, you’re testing and
improving based on short and near-term outcomes. AUDIENCE: I agree with you that
this kind of learning cycle happens all the time. And these are in
learning organizations, and organizational learning,
and individual learning. And that’s so
important in the cycle. And I was thinking about
rapid-cycle evaluation. And, well, it’s really
cycle evaluation. Well, it’s really
evaluation cycle that your presentation– and I
think that the intention here emphasizes that it’s
this continual learning. And so my proposition
to you and to all of us is reduce emphasis
on your rapidity. Because sometimes
it’s really important. But I wonder if it
sets up expectations among all stakeholders
that just aren’t realistic in some cases
depending on the data, depending on what
the question is. And maybe increase the presence
of something like a logic model, or a theory of change. And you talked a
little bit about it. But in addition to thinking
about the methodology and confidence or risk
of a false positive, can you talk a little about how
you think about development, adaptation of the theory
of change, or the logic model, or either one
working into the system? SCOTT CODY: So on
your first point, I personally have no
problem getting rid of the word “rapid.” I think you make a
great point that it does set up expectations. I’ve had a number of people
who approach me looking for rapid-cycle
evaluation which they think is a traditional
full-program evaluation, just quicker. And that’s not what I’m
talking about when I talk about rapid-cycle evaluation. So I have no problem
stepping away from that word. And I think that it’s a
really good point that would be useful to discuss
throughout the next two days of this conference. The second question was, how
do we incorporate a logic model into the framework here? Let’s see, did you
have thoughts on that? MARYCATHERINE ARBOUR: Yeah. To comment on that,
I mean, that was one of the biggest differences
between the first phase of the Chilean work
and the second. We always had a
theory of change. We always had measures
associated with our moderators and mediators, as we
referred to them in the RCT. We talked about it once a year. And in the second phase, which
was the breakthrough series, the first piece of work
in that expert meeting was to adapt that
theory of change to a key driver diagram, which
is a different format that’s a little more accessible. And that was in front
of our frontline teams at every opportunity. At every learning session, in
between the learning sessions, the process measures
and the outcomes hung directly on what the
drivers of improvement were. So I think that that’s a
place where a lot of progress can be made and I appreciate
you highlighting it. SCOTT CODY: Yeah. I mean, rapid testing
is not a substitute for traditional
program evaluation. And I think the Chilean example
really is a great example to demonstrate that. And I think that having a logic
model, a theory of change, can be really valuable in
figuring out in the improvement framework, if I don’t
have a lot of time to wait to observe
an impact, what are my potential near-term proxies? AUDIENCE: Lauren Supplee
from Child Trends. I wanted to bring out one
other piece of the conversation that’s sort of
underlying right now, which is what outcomes can
be used in these models and frameworks? I know in the
precision home visiting HARC study that MaryCatherine
and others and I in the room– we’re talking a lot about
the importance of mediators and building
stronger connections between our mediators
and long-term outcomes so that we can more look
at proximal mediators in these kinds of
models, to look at change rather than having to
wait for that long-term. And it seems to me that,
from the discussion already, these methods
can be used clearly for implementation factors. Because they’re often closer. And then having to look at these
sort of near-term mediators– but can you talk a little
bit more about the outcomes that you’ve seen used in these
and how those things play out? MARYCATHERINE ARBOUR: Sure. I mean, there are
some outcomes– and I’ll use examples in health
care because I know them. But there are efforts
to improve outcomes in things like door-to-balloon
time with heart attacks. So if you get your cardiac
arteries dilated with a balloon within 45 minutes of when
your heart attack started, you will do much better, right? So in a high volume
cardiac center, you can test an
intervention and watch your door-to-balloon times. And that is a shorter-term
proxy for cardiac mortality and cardiac function
after a year. But the link, as you referred
to in the home visiting stuff, the link between
that and your outcome is tight enough
that you don’t even have to include it in
your introduction anymore. You don’t have to
explain or justify that. An example from
the Chilean project is we used full-day classroom
recordings to score the class three times a year. And the class has three
domains of classroom quality. If you score above
a certain threshold, you can see impacts on
children’s outcomes. So that link is fairly strong. In 2014 and 2015, we started
training teachers in the class and training our teacher
coaches to score specific pieces of the class based on 20-minute
classroom observations. So the coaching model always
had two classroom visits in a month. It always had a planning
session, and observation, and a reflection after it. We overlaid non-validated class
scores, direct observation of a specific
portion of the day, to help teachers focus
on specific pieces of their practice. And we monitored those twice
a month over time and showed increases in their class scores
with these non-valid scores that did correlate
with three-time-a-year and end-of-year class scores. So that’s an example
where the outcome itself could be used in
a less appropriate, less rigorous, less validated way to
drive improvements in practice. But it was the outcome itself. And what’s fascinating is we
have people on our evaluation team who are like, that’s
nonsense, that can’t be done. And then we presented it to
the developers of the class. And they were like,
that’s really interesting. So it depends on where you fall
along the spectrum of rigor. But I think we should be
open to using measures that are the outcomes themselves
if it’s available even if the way it’s collected
and the way it’s gathered wouldn’t stand up
to self-correlation, to inter-rater reliability
and to all of the rest. SCOTT CODY: Yeah. A different example– I know an employment training
program for ex-offenders that’s gone through– MDRC did a great randomized,
controlled trial evaluation of the program. And it didn’t have
the big impact on long-term earnings,
long-term employment that the program wants
to have on ex-offenders. And the program includes an
intensive job skills training component and then a
transitional jobs component. And the program administrators,
they didn’t look at the results and say, oh, that can’t be,
we are having an impact, we know it. They looked at the results
and they said, well, we have this attrition problem
from after the intensive skills into transitional employment. People are dropping out
either right at that seam, or shortly after they enter
transitional employment. And that if we’re going
to have an impact, we need to increase persistence. So they’re not saying
that that’s ultimately the measure of success. They’re recognizing that the
next time someone comes along and evaluates us, if we want
to show a bigger impact, we can’t have an impact if we
don’t address the attrition. And so the attrition ends up
being the outcome that they’ve, basically through
their logic model, identified as “this is
where we’re potentially falling down.” AUDIENCE: I’m Sai Ma from CMMI. The rapid-cycle
evaluation is actually part of our group’s name. And we’re constantly under
pressure of delivering something quick and fast. The hesitation, however– I really like the term
“near-term proxy.” However, the hesitation
is, what if the direction or the magnitude is not
consistent with the impact analysis at the end of the
term of the evaluation? So I wonder if you can
share some insight on how to select the proxy measures. And how do you frame it
and convey the message when the proxy and the
long-term impact analysis are not consistent? SCOTT CODY: Actually,
if I could make sure I understand the question. So the scenario that
you’re talking about– I’ve identified this
near-term proxy. I’m running a test to see if
I can improve that measure. And it’s not
showing improvement? AUDIENCE: No. I’m thinking more in the
scenario where your proxy shows something like improvement. And you keep telling
your model team, oh, the intervention
seems to be working based on the proxy measures. Two years down the road,
you do that impact analysis of outcome measures. We deal with health
outcomes, which take years to a show result.
Then you don’t find anything. How do you convey
and explain that? Because that happens
a lot of the time. SCOTT CODY: Yeah. I mean, in the example I
was just talking about, the employment and training
example for ex-offenders, let’s say they
implement a change and it includes persistence. They get evaluated again. And there’s still no impact
on long-term earnings. I think that’s the scenario
that you’re talking about. Well, then there’s a question of
looking at the theory of change for this program. Where else might
we be falling down? Are there things we
can test to improve? Or there is a tough
conversation to have of, maybe this model
is not effective. And so I think that, again, the
value of a traditional program evaluation– it’s still valuable because you
need that objective assessment of, is this program effective? Is it worth it? But you’re going to
maximize your chances of showing a positive impact
with the program evaluation if you test improvements along
with your theory of change along the way. MARYCATHERINE ARBOUR: Yeah. I think that this question calls
back to what Lauren was trying to highlight about, what
are these near-term proxies and how tight is
that relationship? And so my suggestion was,
let’s use the real outcomes even if the measure is less
reliable and standardized. The other thing is to rely
on your theory of change. So in the Chilean example,
we had very big impacts on the class, really big
impacts on the class. People anticipated we
would have big impacts on children’s outcomes. And at the end of
the study, what we were able to
then learn from it was that the
impacts on the class were statistically significant,
the effect sizes were big. But the instructional
support was still scoring below the target,
which was 3.25 out of 7. The instructional
time effect size was very big and
statistically significant. And when you looked
at the data, it was the difference between
five minutes per day and twelve minutes per day
spent on language activities. And when we did the moderation
analysis for attendance, the positive impact
on children’s outcomes happened among the quintile of
children who had the greatest likelihood of attending. So it was causal. But anyway, who
attended the most. And those kids still
missed 16% of school days. The average chronic absenteeism
rate in our classrooms was 66%. So that doesn’t mean what the
teachers are learning and doing doesn’t work. It means that kids aren’t
there enough to get it. So I think that the
question that you raise– what happens when you’re
near-term proxies show something and your
outcomes don’t?– to me, it just makes
it interesting. How did that happen? Is the relationship between the
proxy and the outcome false? Or are there other
things that you can see that were in your
theory of change that played a role that you can
address in the next iteration? AUDIENCE: I’m Aleta Meyer
from the Office of Planning, Research and Evaluation. And I’m super excited to
hear this conversation. It reminds me of a paper I was
forced to read for my doctoral defense on program development
evaluation, which was a — Gottfredson made a modification
to action research. And one of the things that
you’re alluding to here that I want to emphasize is the
aspect of the worldview of the people
implementing the program and the disconnects
that can exist between a theory of change
that a social scientist or an anthropologist
might develop and people on the ground. And I think that that
issue of will-building is really important. And it brings to mind something
somebody said to me a couple days ago who had gone through
training to be a foster parent. And she’s an assistant
professor of psychology. And the other people in
the room who are also learning about being
a foster parent were really most
frustrated about not being able to use
corporal punishment. And for human services, to
think about the ways in which our efforts to change behavior
in really important ways, we really need to get into
the worldview of the people doing the work and
take that seriously. And I think it’s exciting. And I just was wondering
if you had examples of that as well, or– I don’t know. I just wanted to
exclamation point that. SCOTT CODY: –your domain. MARYCATHERINE ARBOUR: Yeah. I mean, one of my passions, one
of the things I like the most, is coaching and
working with teams that bring together people
across a diversity of training backgrounds and
life experiences. So clients, patients,
paraprofessionals, and professionals, and
CMOs all at the same table, and the power of
personally relating across these societal
boundaries and telling stories across them, and
discovering commonalities, and discovering real differences
like the one you described. So I love the CQI approaches
and the team-based learning that facilitates and requires that. It’s challenging. I’m sure we’ll hear about
some of the challenges over the next two days. I think it’s tough. Because even in our first
phase of the RCT in Chile, we tried to do that. We had a strong
stakeholder engagement. We did what would be called
“community-based participatory research methods” in
building the intervention. I just think that
there’s a limit to, truly, how open
we are as researchers when we’re doing something that
requires what we do not change. So I do think that we can do
fidelity of implementation better in ways that
are more respectful and do that within an RCT. But I think if we’ve got a
transformation, a culture change, and a workforce
renovation issue, as we do in health
care with burnout, as we do in lots of
social services– there are some of these
methods that are actually useful to that end
in and of itself as well as potentially improving
outcomes of the recipients. MODERATOR: I’m going to take one
final question before we break. AUDIENCE: Hi, Pamela
Velez-Vega Millennium Challenge Corporation. Excellent presentation. Thank you. I was fascinated when you
mentioned that in the teacher intervention, the combination
of what the teachers were doing might not be that important. But it was the possibility
of them trying, and becoming empowered, and understanding
what is working for them, and making adaptations. And that just made me think of
a very dynamic cycle, right? Feeling a lot more mindful
about what you’re doing, reflecting as a teacher. I’m imagining reflecting
on what you’re doing, and being able to catch
yourself in the moment. So if you could expand on
what were some variables that you were using? If you were using in
terms of the teacher being more empowered, or changes
in behavior-specific or attitudes of
these teachers as they were going
through the program. MARYCATHERINE ARBOUR: Yes. Let me back up just a second. Because I want to make
sure I didn’t misspeak. What the teachers
did and the content of the professional
development itself– the teaching strategies, the
ways they were trained to use open-ended questions, or a
word wall, or the strategies– were really important. I have worked on
breakthrough series where the evidence base
and the intervention itself wasn’t completely
vetted or strong, or the theory of change
wasn’t completely clear. And it’s really hard. So that’s important. I think the breakthrough
series that worked in Chile benefited hugely from
having the testing through the randomized
control trial. The home visiting CoIIN
has worked really well because the evidence-based
interventions that went into it– they exist. We have home-visiting models. OK. So let’s not discount that. Because if you try to
do this work without it, it’s really challenging. That being said, there
really were transformations in the teachers. And we didn’t intentionally
set up a qualitative study to examine that transformation. But we did, in the development
in the series of these, develop mentors and
later interview them. And we have beautiful
videos of them talking about their experiences. And I’m happy to share with
you some with subtitles. But one of the
things that I love is, as one of our strongest
teacher mentors said, I used to hate PDSA cycles. She said, you know, the
data is complicated. Writing down what the kids
said or didn’t say takes time. I used to hate these. But that’s where the
learning happens. And so really engaging them
as learners, and as people who came into this
profession with a passion and that are still there
for the same reasons but looking at the data
in ways and helping them learn and grow, really
transform them from recipients of a training they were meant
to execute with fidelity, to protagonists in innovation. And that transformed
their own practice, but also led to many of
them seeing themselves as mentors for their peers for
the first time in their lives. So that was exciting. And I’m happy to share
those videos with you if it’s something you’d
be interested in seeing. SCOTT CODY: The
last thing I’ll say, because we’re about to get
ushered off the stage here, is program improvement is hard. And if this wasn’t hard, we
wouldn’t be here today having this conversation. It’s something that does
require all levels to be involved and engaged in the
program improvement effort. AUDIENCE: [APPLAUSE] [MUSIC PLAYING]

Leave a Reply

Your email address will not be published. Required fields are marked *