(a)
- [x] Split the dataset into 70% train / 30% test. Train models with default model parameters.
- [x] Compute accuracy on training and test sets.
- [ ] Report results by creating a table with columns training accuracy, test accuracy, and AUC.
(b)
- [ ] Use 10 repeated train/test splits.
- [ ] Report the mean and standard deviation of test accuracy for each model.
- [ ] Which model has the lowest/highest variance and why?
(c)
Plot train vs test accuracy for increasing model complexity:
-
[ ] For Decision Tree: vary max_depth
-
[ ] For Random Forest / Bagging: vary n_estimators
-
[ ] For AdaBoost: vary n_estimators
-
[ ] Interpret curves in terms of bias–variance trade-off.
(d)
- draw feature importance measures (mean decrease in impurity)
- draw feature importance measures (permutation importance)
- Plot them as bar charts sorted in descending order.
- Identify the top 3 most important features for each model.
- [ ] decision tree
- [ ] random forest
- [ ] AdaBoost