PHC321: applied biostatistics
- Can we predict patients’ baseline HbA1c (in mmol/mol) from their Total Cholesterol (in mmol/L)? Provide a complete statistical investigation using correlation and regression analyses (including the regression equation) with proper written interpretation and visual illustrations.
In order to predict patients’ baseline HbA1c (in mmol/mol) from their Total Cholesterol (in mmol/L) we must fulfill assumptions of linear regression model. (SPSS Statistics | IBM, n.d.)
- Our two variables should be measured at the continuous level
- There needs to be a linear relationship between the two variables
- There should be no significant outliers
- The residual term “e” is Normally distributed, mean = 0, for each value of X
- Spread of residual terms should be equal, no matter the value of X. & e shouldn’t expand or contract as X increases
At first due to some significant outliers the two variables don’t meet Normal distribution and don’t show significant linear relationship by Spearman’s Rank-Order Correlation test
After considering of removing outliers and conducting correlation analysis by Pearson r correlation analysis, we will find a significant linear relationship between the two variables (r = 0.11, p-value = 0.013), by now we can build a linear regression model to predict patients’ baseline HbA1c from their Total Cholesterol.
When we conduct linear regression analysis by SPSS between baseline HbA1c as dependent variable and total Cholesterol as independent variable we will obtain the following result tables
Model Summary | ||||
Model | R | R Square | Adjusted R Square | Std. Error of the Estimate |
1 | .111a | .012 | .010 | 17.08107 |
a. Predictors: (Constant), Total cholesterol (mmol/L) at baseline |
ANOVAa | ||||||
Model | Sum of Squares | df | Mean Square | F | Sig. | |
1 | Regression | 1804.988 | 1 | 1804.988 | 6.186 | .013b |
Residual | 144422.597 | 495 | 291.763 | |||
Total | 146227.585 | 496 | ||||
a. Dependent Variable: HbA1c (mmol/mol) at baseline | ||||||
b. Predictors: (Constant), Total cholesterol (mmol/L) at baseline |
Coefficientsa | ||||||
Model | Unstandardized Coefficients | Standardized Coefficients | t | Sig. | ||
B | Std. Error | Beta | ||||
1 | (Constant) | 62.269 | 3.037 | 20.501 | .000 | |
Total cholesterol (mmol/L) at baseline | 1.619 | .651 | .111 | 2.487 | .013 | |
a. Dependent Variable: HbA1c (mmol/mol) at baseline |
From the table which provide R and R squared we can explain that 0.012 (1.2%) of the total variation in the dependent variable, baseline HbA1c, can be explained by the independent variable total Cholesterol.
ANOVA table indicates that the regression model predicts the dependent variable significantly well.
And from coefficients tables we can build a regression equation as the following
Baseline HbA1c = 62.269 + 1.619 Total Cholesterol
And the following scatterplot shows a graphical representation of the model
Drawing on your conclusion in the previous question, Can we add other variables to the regression model for confounding effect control? Provide at least two confounding variables with proper justification?
There are many variables that show correlation with baseline HbA1c that can be added to our regression model to make it more precise
- Duration of oral antidiabetic drugs (r = -0.127, p-value = 005)
- Age in years (r = -0.144, p-value = 001)
- Duration of lipids drugs use (r = -0.097, p-value = 042)
- Duration of antihypertensive drugs (r = -0.137, p-value = 005)
- Diastolic BP (mmHg) at baseline (r = 0.094, p-value = 036)
- Alkaline Phosphatase (IU/L) at baseline (r = 0.11, p-value = 015)
The above variables show significant correlation with baseline HbA1c so that we can add them to regression model