Regression to the mean code

This code is based upon the analysis in Barnett et al published in the International Journal of Epidemiology. Last updated: October 2007

Code to calculate the expected size of the regression to the mean effect in SAS and R, and an example Analysis of Covariance (ANCOVA) using proc glmmod in SAS, lm in R, and glm in Stata, as well as a brief description of the assumptions of ANCOVA, and a few good references.

SAS code to calculate the expected RTM effect, Equations (1) and (2)
data work.rtmeffect;
* Change these parameters depending on your data;
sigma=15; * total std;
mu=60; * population mean;
cut=40; * cut-off;
* Loops to run through rho and m scenrarios;
do rho=0.0 to 1.00 by 0.1; * within-subject correlation;
sigma2_w=(1-rho)*(sigma**2); * within-subject variance;
sigma2_b=rho*(sigma**2); * between-subject variance;
do m=1 to 1; * Number of baseline measurements;
zg=(cut-mu)/sigma; * z;
zl=(mu-cut)/sigma; * z;
x1g=pdf('Normal',zg); * phi - probability density;
x2g=1-cdf('Normal',zg); * Phi - cumulative distribution function;
x1l=pdf('Normal',zl); * phi;
x2l=1-cdf('Normal',zl); * Phi;
czl=x1l/x2l; * C(z) in paper;
czg=x1g/x2g; * C(z) in paper;
Rl=(sigma2_w/m)/sqrt(sigma2_b+(sigma2_w/m))*czl; * RTM effect, Equations (1) m=1 & (2) m>1;
Rg=(sigma2_w/m)/sqrt(sigma2_b+(sigma2_w/m))*czg; * RTM effect;
output;
end;
end;
run;
proc sort data=work.rtmeffect;
by rho m;
run;
title "The expected RTM effect for a range of baseline samples sizes and rhos";
proc print data=work.rtmeffect label noobs;
var m rho sigma sigma2_b sigma2_w Rl Rg;
label sigma="std(Y)" sigma2_b="between-subject variance" sigma2_w="within-subject variance" rho="within-subject correlation" Rl="RTM effect (<cut-off)"
Rg="RTM effect (>cut-off)" m="# baseline measurements";
format Rl Rg 10.2 rho 15.3;
run;

SAS code to perform an ANCOVA, using the Nambour skin cancer prevention trial as an example
* Read in the Nambour data (n=96);
* Note!: treatment codes are scrambled from those in paper and will not give the same results;
data work.nambour;
input betac_b betac_f group @@;
datalines;
0.04 1.29 0 0.44 1.36 1 2.7 1.5 0 0.34 2.62 0
0.65 0.03 0 0.7 5.6 1 1 0.43 1 2.25 4.91 0
0.22 0.2 0 0.03 0.16 0 0.49 0.42 1 0.63 0.48 0
0.34 0.24 1 1.67 3.9 1 0.23 0.22 1 0.7 0.75 0
0.16 0.32 1 0.8 0.31 1 0.22 1.6 0 0.16 0.42 1
1.08 0.72 1 0.49 0.64 1 0.73 4.8 0 1.1 0.63 1
0.86 0.87 1 1.24 0.32 1 0.4 0.3 0 1.1 4.45 0
0.36 0.32 0 0.47 2.24 0 0.92 2.72 0 0.41 0.43 0
0.13 0.13 1 0.44 1.69 1 0.12 0.1 1 0.37 0.13 1
0.44 0.26 1 0.91 1.46 0 0.67 0.45 0 1.01 0.47 1
0.91 0.36 1 0.56 1.46 0 0.82 0.62 0 0.96 0.79 1
0.18 0.61 0 0.65 0.34 0 1.8 1.2 0 1.5 1 1
0.11 0.55 0 2.41 2.4 0 1.36 2.73 1 0.57 1.62 0
0.34 0.43 1 0.47 0.1 1 0.48 6.55 0 0.78 2.1 0
0.48 0.42 1 0.51 0.65 0 0.27 0.15 1 1.13 0.94 0
0.24 0.51 0 0.6 5.5 1 0.55 0.47 0 0.72 2.63 0
1 1.77 1 0.6 3.41 1 0.24 0.17 1 0.27 0.26 0
0.33 1.82 1 0.33 0.31 1 0.49 0.58 0 0.46 0.91 1
1.17 2.04 0 0.31 2.4 1 0.17 0.32 1 0.19 1.65 1
0.32 0.51 1 0.56 0.54 1 1.31 1.13 0 0.11 1.34 1
0.42 0.78 1 1.29 3.31 1 0.22 0.36 1 0.51 0.55 0
0.17 0.16 1 0.64 0.54 1 1.02 0.02 0 0.28 2.26 1
0.22 0.16 0 0.26 1.62 0 1.04 0.98 1 0.13 1.01 1
0.14 0.17 1 1.2 4.5 0 0.08 0.13 0 1.24 11 1
;
run;
* Log-transform the betacarotene data;
data work.lognambour;
set work.nambour;
dummy=1;
l_betac_b=log(betac_b);
l_betac_f=log(betac_f);
label l_betac_b='Log-transformed baseline betacarotene'
l_betac_f='Log-transformed follow-up betacarotene'
group='Treatment group';
run;
* Calculate the baseline mean;
proc univariate data=work.lognambour noprint;
by dummy;
var l_betac_b;
output out=work.bstats mean=meanb;
run;
* Difference the baseline mean from every baseline observation;
data work.meandiff;
merge work.lognambour work.bstats;
by dummy;
adiff=l_betac_b-meanb; * Baseline-mean;
run;
* ANCOVA;
proc genmod data=work.meandiff;
class group;
model l_betac_f=adiff group; * Follow-up=(baseline-mean);
estimate 'Treatment (drug-plac)' group 1 -1;
run; quit;

R code to calculate the expected RTM effect, Equations (1) and (2)
# Change these parameters depending on your data;
sigma<-15; # total std;
mu<-60; # population mean;
cut<-40; # cut-off;
# Loops to run through rho and m scenrarios;
sigma2_w=vector(length=11,mode="numeric")
sigma2_b=vector(length=11,mode="numeric")
Rl=vector(length=11,mode="numeric")
Rg=vector(length=11,mode="numeric")
rho=vector(length=11,mode="numeric")
for (rhox in 0:10){
rho[rhox+1]<-rhox/10
sigma2_w[rhox+1]<-(1-rho[rhox+1])*(sigma^2); # within-subject variance;
sigma2_b[rhox+1]<-rho[rhox+1]*(sigma^2); # between-subject variance;
for (m in 1:1){ # Number of baseline measurements;
zg<-(cut-mu)/sigma; # z;
zl<-(mu-cut)/sigma; # z;
x1g<-dnorm(x=zg); # phi - probability density;
x2g<-1-pnorm(q=zg); # Phi - CDF
x1l<-dnorm(x=zl); # phi;
x2l<-1-pnorm(q=zl); # Phi;
czl<-x1l/x2l; # C(z) in paper;
czg<-x1g/x2g; # C(z) in paper;
Rl[rhox+1]<-(sigma2_w[rhox+1]/m)/sqrt(sigma2_b[rhox+1]+(sigma2_w[rhox+1]/m))*czl; # RTM effect, Equations (1) m=1 & (2) m>1;
Rg[rhox+1]<-(sigma2_w[rhox+1]/m)/sqrt(sigma2_b[rhox+1]+(sigma2_w[rhox+1]/m))*czg; # RTM effect;
}
}
output<-cbind(sigma2_b,sigma2_w,rho,Rl,Rg)
print("The expected RTM effect for a range of baseline samples sizes and rhos")
print(output)
print("sigma2_b=between-subject variance, sigma2_w=within-subject variance")
print("rho=within-subject correlation, Rl=RTM effect (<cut-off), Rg=RTM effect (>cut-off)");

R code to perform an ANCOVA, using the Nambour skin cancer prevention trial as an example
# read in the data
nambour<-read.table("c:/temp/nambour.csv",sep=",",header=T)
# log-transform the data
nambour$l_betac_b<-log(nambour$betac_b)
nambour$l_betac_f<-log(nambour$betac_f)
# Calculate the baseline mean
meanb<-mean(nambour$l_betac_b)
# Difference the baseline mean from every baseline observation
nambour$adiff<-nambour$l_betac_b-meanb
#ANCOVA using lm
model<-lm(formula=l_betac_f~adiff+group,data=nambour)
summary(model)

Stata code to perform an ANCOVA, using the Nambour skin cancer prevention trial as an example
* read in the data
insheet using "C:\temp\nambour.csv", comma
* log-transform the data
gen l_betac_b=log(betac_b)
gen l_betac_f=log(betac_f)
* Calculate the baseline mean
egen float meanb = mean(l_betac_b)
* Difference the baseline mean from every baseline observation
gen adiff=l_betac_b-meanb
* ANCOVA using glm
glm l_betac_f adiff group

Assumptions of ANCOVA

The standard assumptions are that the measurements are Normally distributed and that the variance of the outcome variable is constant. If this is not the case the data should be transformed (where this is sensible). ANCOVA further assumes that the measurement errors in different subjects are independent and that the treatment effect and baseline measurements are independent.

As with any regression analysis it is important to check the assumptions. Simple scatter-plots of the estimated model residuals against time, subject, and the predicted value may indicate the need for a non-linear model or a non-constant variance.

Three references on ANCOVA are:

Barnett AG, van der Pols JC, Dobson AJ. Regression to the mean: what it is and how to deal with it International Journal of Epidemiology Feb 2005;34(1):215-20;

The American Statistician

Applied longitudinal data analysis for epidemiology: a practical guide.

Links:

To the Longitudinal Studies Unit home page.

To the top of this page.