Can you emulate clinical trials with observational data?

While randomized controlled trials (RCTs) are the gold standard of experimental research, there are some limitations. For instance, in some cases use of an RCT is not feasible (e.g., the sample size may be too small as in the case of rare disease), not ethical (e.g., if the previous standard of care is not effective) or not practical (e.g., answers are needed quickly or there is not sufficient funding for an RCT). A paper by Hernán et al. (2022) provides some guidance of how to emulate a clinical trial using real-world observational data.

What are the the key challenges of using observational data?

There are two primary challenges with using observational data. First, individuals are not randomized into treatment and control groups. For instance, if we look at individuals who visit the hospital, they have much worse health outcomes that individuals who do not. This does not mean that hospitals cause people to get sick, but rather there is selection bias that people who visit the hospital tend to be sick.

Second, the follow-up time over which outcomes are measured may be less clear in observational data. In an RCT, the index date (i.e., the start of the follow-up period) typically occurs at the date of randomization. For observational data, the index data may be less clear.

How can we sovle the problems of selection bias and index date selection?

Hernán and co-authors argue that the best way to conceive of a valid observational study design is to attempt to emulate a hypothetical RCT that would answer the research question of interest.

How does one emulate a target trial with observational data?

The paper states that this is a two-step process.

The first step is articulating the causal question in the form of the protocol of a hypothetical randomized trial that would provide the answer. The protocol must specify certain key elements that define the causal estimands (eligibility criteria, treatment strategies, treatment assignment, the start and end of follow-up, outcomes, causal contrasts) and the data analysis plan. The randomized trial described in the protocol becomes the target study for the causal inference of interest.
The second step is explicitly emulating the components of that protocol using the observational data: finding eligible individuals, assigning them to a treatment strategy compatible with their data, following them up from assignment (time zero) until outcome or end of follow-up, and conducting the same analysis as the corresponding target trial, except that there is adjustment for baseline confounders in an attempt to emulate random treatment assignment.

What are key sources of bias to be concerned about?

Most people focus on the issue of selection bias: that the treatment and control group may be different in unobservable ways which cannot be controlled. However, the treatment group itself may have bias. For instance, in an RCT we generally observe when patients initiate treatment. In real world data–unless we impose a clean period with no prior treatment–individuals may be using the treamtent for a short or long time. Outcomes from new users and long-term users may differ. By attempting to emulate a target trial–often by focusing on individuals who initiate treatment–you can eliminate the bias from differences in long-term users.

What are the limits of target trial emulation?

Hernán and co-authors name a few:

Selection bias may still remain. “Explicit target trial emulation alone cannot eliminate the bias that arises from lack of randomization—confounding from noncomparable treatment groups—even if the observational analysis correctly emulates all other components of the target trial.” Also, treatment assignment is not blinded (i.e., placebo-controlled studies are generally not feasible or available)Missing data. “Some sources of routinely collected data (eg, administrative claims databases) may have reasonably detailed data on treatments and outcomes but insufficient data on clinical factors that require adjustment.” For instance, claims data often have rich information on patient comorbidity but limited information on disease severity. Limited to treatments used in practice. Use of observational data is problematic if a treatment has not yet been approved for use or is rarely used in real-world settings (e.g., treatment for rare disease) as the sample sizes in these cases will be small (or even zero).

You can read the full article here.