PSI Wonderful Wednesday created a simulated data set, based on a clinical phase III trial on Psoriasis.
The simulated outcome variable is Pain which was collected on a visual analogue scale (range: 0-100). Greater values mean worse pain.
A dichotomized version of pain is also included in the data set: Pain reduction from baseline of at least 30.
Covariates include age, gender, and BMI.
In this hypothetical study, the main interest lies in the comparison of an active treatment arm and a placebo arm.Data were collected at baseline and at ten follow-up time points, but the Pain endpoint has some level of missing data.
Looking at the distribution of missing data shows incresing missingness over time
But - are there within-subject patterns which might be important?
An Upset plot shows that the most common pattern of within individual missingness is monotone missingness from visit 6 onwards
Maybe this needs further investigation...
It looks like missingness increases over time,
But this time series is hella hard to read, yo!
By ordering patients we can see that monotone missingness increases throughout the study, with around a third of patients having monotone missigness consisting of at least the last two timepoints - not great if your primary endpoint is Visit 10!
This was achieved by:
We often assume missing at random
This just means that the missing data is related in some ways to the data we do have
If we plot the demographic data based on groups like:
1) those with data, and (BLUE)
2) those with missing data (GREY)
We can see whether some demographics are more related to missingness than others
*NOTE:
-timepoints included here are 6 -10
It looks like the older participants are more likely to have missing data.
but that's ok! Kind of...
By knowing that missingness is related to age we can use this variable to help us predict what these patients may have scored if they had not been missing!
We can use multiple imputation to "fill in" the gaps in a dataset based on the other varibles.
When we know what variables are the most related to the missing data we can use this to inform our imputation
"Multiple" because we make many datasets and then combine them
The graph based on the imputed data shows no real difference
This is good, really, as we can have some confidence that our MAR data due to age was not biasing the results