High-level implications and results are outlined below. For a more detailed analysis, please refer to each respective section.
We dicuss various steps in attempting to achieve the project statement; to build a classification model to reliably predict the Alzheimer’s stage given baselina data, analyze cost/accuracy trade-offs and determine importance of individual ADAS (type of cognitive test) questions.
Baseline Model
For the first pass, we considered the regularized Logistic Regression with l2 penalty. We were able to achieve the AUC score of .88, though we note somewhat inferior accuracy for the “transition” class LMCI with only 0.69 AUC.
Additionally, we quickly recognized that the Cognitive Tests have the most predictive power in our classification model.
Model Selection
Next, we trained several classification model and pick the one with the best AUC score on the test set. Even though all of our models performed roughly the same within +- 5 basis points AUC score band, LDA and Logistic Regression underperformed on the transition disease stage (“LMCI”).
Ultimately, we decided to continue our analysis of the full model with the Gradient Boost Model (GBM) which achieved marginally best AUC score (0.93) and CV score of 78% on the training set.
Full Model
After the selection of the GBM classifier hyperparameters, we achieved an AUC of 0.92 and ~78% CV accuracy on the training set. We see our results as acceptable given the limited time frame and the fact we limited ourselves only to analyze the observations from the baseline visits.
We also learned that the accuracy gain of including full set of predictors beyond the baseline predictors did not result in higher predictive power of our model.
Notably, Cognitive test remained the most important features category, irrelevant of our classification model choice.
Feature Cost Analysis
We performed a forward feature category selection process based on an accuracy/cost trade-off. Three cost categories (Medical, Early Detection, and Invasiveness) were assigned to each category of features. We ran different scenarios of the weights assigned to these cost categories and were able to confirm that the feature category selection process does change accordingly and does make sense given constrains imposed on the system (for a more detailed results, please see the respective notebook).
ADAS Individual Scores
We used a Gradient Boosted Model and a Logistic Regression to perform an analysis of the individual questions of the ADAS-13 assessment. It was shown by the partial dependence plots of the GBM that only questions 1, 4, 7, and 14 displayed responses to all three of the diagnosis classes. It was further confirmed by a t-test of 100 bootstrapped Logistic Regression models that the same four questions were significant at the 90% confidence level, and three of these at the 95% level. Using this information, we trained new GBMs and Logistic regression models on only these four questions as features, and obtained marginally higher testing scores.
Further improvements could include more research in establishing realistic weights for the cost/accuracy optimization model, with the physician fine-tuning the weights to his/her particular clinic situation.
Additionally, a deep dive into the longitudinal nature of the dataset is a potential source of a significant improvement, as we believe the ability to e.g., track attrition of brain volume over time would likely exhibit a significant correlation with the progression of Alzheimer’s disease. This conjecture is based on our EDA, where we saw a significant decrease in hippocampus volume in patients with Dementia versus Congitively Normal patients as well as from one of the link
However, yet another improvement would be to fine-tune our model with the emphasis for an early detection as we believe this would be likely the best data-driven tool for clinicians providing actionable feedback and ultimately could improve patient health outcomes.