Pima Indians Diabetes
1 Introduction
This project involves reproducible research with Docker [1]. The Pima Indians Diabetes Dataset is a well-known collection of health information frequently used by students and professionals learning data science and machine learning; you’ll often find it used in examples on websites like Kaggle.
It originally came from a major health study conducted by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). The data focuses specifically on women, aged 21 and older, who are of Pima heritage and lived near Phoenix, Arizona. The main goal when using this dataset is typically to predict whether a person has diabetes (a ‘yes’ or ‘no’ outcome) based on several common medical measurements recorded for each individual.
These measurements include the number of times pregnant, blood glucose levels, blood pressure, body mass index (BMI), insulin levels, age, and a diabetes pedigree function score which relates to family history. This dataset is popular for practice because it’s easily available and contains realistic challenges found in real-world data.
One particular challenge it’s famous for is the presence of zero values in columns where that’s biologically impossible, like blood pressure, skin thickness, or BMI. This requires data scientists to figure out how to handle these seemingly incorrect or missing values before building predictive models.
2 Workflow
The workflow proceeds according to the chart below.
3 Raw
3.1 File Info
attributes | value |
---|---|
size | 23105 |
isdir | FALSE |
mode | 644 |
mtime | 2025-05-21 18:47:17 |
ctime | 2025-05-21 18:47:17 |
atime | 2025-05-21 18:47:17 |
uid | 0 |
gid | 0 |
uname | root |
grname | root |
4 Recoded Data
One of the problems with the Pima Indians Diabetes dataset is that a number of observations are coded as zero when they should have been coded “NA”. For example, a persons blood_pressure
was coded as 0 when that’s physically impossible to participate and have no blood pressure.
4.1 Summary
Normal
vars | mean | sd | median | trimmed | mad | min | max | range | |
---|---|---|---|---|---|---|---|---|---|
pregnancies | 1 | 3.30 | 3.02 | 2.00 | 2.88 | 2.97 | 0.00 | 13.00 | 13.00 |
glucose | 2 | 110.64 | 24.78 | 107.00 | 109.09 | 23.72 | 44.00 | 197.00 | 153.00 |
blood_pressure | 3 | 70.88 | 12.16 | 70.00 | 70.71 | 11.86 | 24.00 | 122.00 | 98.00 |
skin_thickness | 4 | 27.24 | 10.03 | 27.00 | 26.86 | 10.38 | 7.00 | 60.00 | 53.00 |
insulin | 5 | 130.29 | 102.48 | 102.50 | 112.27 | 63.01 | 15.00 | 744.00 | 729.00 |
bmi | 6 | 30.86 | 6.56 | 30.10 | 30.55 | 7.12 | 18.20 | 57.30 | 39.10 |
dbf | 7 | 0.43 | 0.30 | 0.34 | 0.38 | 0.22 | 0.08 | 2.33 | 2.25 |
age | 8 | 31.19 | 11.67 | 27.00 | 28.98 | 7.41 | 21.00 | 81.00 | 60.00 |
outcome | 9 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
Diabetic
vars | mean | sd | median | trimmed | mad | min | max | range | |
---|---|---|---|---|---|---|---|---|---|
pregnancies | 1 | 4.87 | 3.74 | 4.00 | 4.60 | 4.45 | 0.00 | 17.00 | 17.00 |
glucose | 2 | 142.32 | 29.60 | 140.00 | 142.04 | 34.84 | 78.00 | 199.00 | 121.00 |
blood_pressure | 3 | 75.32 | 12.30 | 74.50 | 75.22 | 11.12 | 30.00 | 114.00 | 84.00 |
skin_thickness | 4 | 33.00 | 10.33 | 32.00 | 32.74 | 8.90 | 7.00 | 99.00 | 92.00 |
insulin | 5 | 206.85 | 132.70 | 169.50 | 184.28 | 74.87 | 14.00 | 846.00 | 832.00 |
bmi | 6 | 35.41 | 6.61 | 34.30 | 34.91 | 5.71 | 22.90 | 67.10 | 44.20 |
dbf | 7 | 0.55 | 0.37 | 0.45 | 0.50 | 0.31 | 0.09 | 2.42 | 2.33 |
age | 8 | 37.07 | 10.97 | 36.00 | 36.28 | 11.86 | 21.00 | 70.00 | 49.00 |
outcome | 9 | 1.00 | 0.00 | 1.00 | 1.00 | 0.00 | 1.00 | 1.00 | 0.00 |
4.2 Outliers
4.3 Missing
5 Imputed Data
Any outliers beyond three standard deviations from the mean are set to NA. With the values that were improperly coded and the outliers set to NA, there were a total number of XXX NAs.
5.1 Summary
Normal
vars | mean | sd | median | trimmed | mad | min | max | range | |
---|---|---|---|---|---|---|---|---|---|
pregnancies | 1 | 4.73 | 3.55 | 4.00 | 4.52 | 4.45 | 0.00 | 13.00 | 13.00 |
glucose | 2 | 142.21 | 29.60 | 140.00 | 141.92 | 34.84 | 78.00 | 199.00 | 121.00 |
blood_pressure | 3 | 75.23 | 11.79 | 74.50 | 75.23 | 11.12 | 40.00 | 108.00 | 68.00 |
skin_thickness | 4 | 31.62 | 8.86 | 32.00 | 31.76 | 8.90 | 7.00 | 56.00 | 49.00 |
insulin | 5 | 197.44 | 112.11 | 173.00 | 183.91 | 86.73 | 14.00 | 510.00 | 496.00 |
bmi | 6 | 35.12 | 6.03 | 34.20 | 34.80 | 5.63 | 22.90 | 53.20 | 30.30 |
dbf | 7 | 0.53 | 0.31 | 0.44 | 0.49 | 0.30 | 0.09 | 1.39 | 1.31 |
age | 8 | 36.93 | 10.78 | 36.00 | 36.19 | 11.86 | 21.00 | 67.00 | 46.00 |
outcome* | 9 | 2.00 | 0.00 | 2.00 | 2.00 | 0.00 | 2.00 | 2.00 | 0.00 |
Diabetic
vars | mean | sd | median | trimmed | mad | min | max | range | |
---|---|---|---|---|---|---|---|---|---|
pregnancies | 1 | 3.30 | 3.02 | 2.00 | 2.88 | 2.97 | 0.00 | 13.00 | 13.00 |
glucose | 2 | 110.58 | 24.72 | 107.00 | 109.00 | 23.72 | 44.00 | 197.00 | 153.00 |
blood_pressure | 3 | 70.75 | 11.42 | 70.00 | 70.58 | 11.86 | 38.00 | 108.00 | 70.00 |
skin_thickness | 4 | 26.84 | 9.91 | 27.00 | 26.38 | 10.38 | 7.00 | 60.00 | 53.00 |
insulin | 5 | 119.50 | 78.85 | 100.00 | 107.70 | 59.30 | 15.00 | 510.00 | 495.00 |
bmi | 6 | 30.84 | 6.44 | 30.30 | 30.58 | 7.12 | 18.20 | 52.30 | 34.10 |
dbf | 7 | 0.41 | 0.25 | 0.33 | 0.38 | 0.21 | 0.08 | 1.46 | 1.38 |
age | 8 | 30.91 | 11.08 | 27.00 | 28.89 | 7.41 | 21.00 | 68.00 | 47.00 |
outcome* | 9 | 1.00 | 0.00 | 1.00 | 1.00 | 0.00 | 1.00 | 1.00 | 0.00 |
5.2 Outliers
5.3 Correlogram
5.4 Missing
6 Models
6.1 KNN
The K-Nearest Neighbors (KNN) algorithm is a foundational and intuitive method in supervised machine learning, applicable to both classification and regression problems, with roots tracing back to non-parametric statistics work by Fix and Hodges in 1951 and formal analysis by Cover and Hart in 1967. Its core theory rests on the simple idea that similar data points exist in close proximity in the feature space. KNN is considered an instance-based, or “lazy,” learning algorithm because it doesn’t build an explicit model during training; instead, it memorizes the entire labeled training dataset. When predicting an outcome for a new, unseen data point, the algorithm identifies the ‘K’ closest data points (the “nearest neighbors”) from the training set based on a chosen distance metric, typically Euclidean distance. For classification, the prediction is the most frequent class label among these K neighbors (majority vote), while for regression, the prediction is usually the average or median of the neighbors’ values. The performance of KNN hinges critically on the choice of ‘K’ (balancing bias and variance), the selection of an appropriate distance metric, and often requires feature scaling, as distance calculations are sensitive to the range and units of the input variables.
6.2 Logistic Regression
Logistic regression is a fundamental statistical method and supervised learning algorithm primarily used for binary classification problems, aiming to predict the probability of an observation belonging to one of two outcomes (e.g., yes/no, pass/fail). Developed significantly by statistician David Cox in the mid-20th century, it’s widely applied across various fields. Unlike linear regression, it models the probability (p) of the default class (typically ‘1’) using the logistic function (also known as the sigmoid function):
\[ p = \frac{1}{1 + \exp(-z)} \]
This S-shaped function takes an input z, which is a linear combination of the predictor variables , and transforms it into a probability value between 0 and 1. The linear combination input (z) actually represents the log-odds (or logit) of the event occurring. This means logistic regression models the log-odds as a linear function of the predictors:
\[ \log\left(\frac{p}{1-p}\right) = z = \beta_0 + \beta_1x_1 + \dots + \beta_nx_n \]
As a parametric model, logistic regression learns the coefficients \[(β i)\] from the training data, typically through an iterative optimization process called Maximum Likelihood Estimation (MLE). The final output probability (p) is then usually converted into a class prediction by applying a threshold (commonly 0.5).
6.3 MARS
Multivariate Adaptive Regression Splines (MARS) is a non-parametric regression technique adept at modeling complex, non-linear relationships between predictor variables and a response variable. The algorithm operates through a two-stage process: a forward pass that iteratively adds piecewise linear basis functions (hinge functions) to capture non-linearities and interactions. At the heart of most MARS algorithms is the model’s predictive equation, which takes the general form:
\[\hat{f}(x)=β0 +m=1∑M βm Bm (x)\]
where \(\hat{f}(x)\) is the predicted response, \(β0\) is the intercept, \(Bm (x)\) are the basis functions (often hinge functions or products thereof), and \(βm\) are their coefficients, determined through a process that typically leads to an initially overfit model. Subsequently, a backward pruning pass systematically removes less contributory terms, typically using Generalized Cross-Validation, to enhance model generalization and select the final set of M basis functions. This adaptive approach allows MARS to automatically identify important variables and their interactions, offering a flexible alternative to traditional linear models.
7 Results
7.1 Optimal Tuning Parameters
mtry | trees | tree_depth | learn_rate | loss_reduction | sample_size | .config |
---|---|---|---|---|---|---|
5 | 1750 | 2 | 0.004217 | 1e-07 | 0.2875 | Preprocessor1_Model14 |
7.2 Model Results
7.3 Receiver Operator Curve
The Receiver Operating Characteristic (ROC) curve is a fundamental tool in machine learning and statistics for evaluating the performance of binary classification models. Originating from signal detection theory developed during World War II to analyze radar signals, the ROC curve graphically illustrates a classifier’s diagnostic ability across all possible classification thresholds. It plots the True Positive Rate (TPR), also known as sensitivity or recall (the proportion of actual positives correctly identified), on the Y-axis against the False Positive Rate (FPR), which is equal to 1 minus specificity (the proportion of actual negatives incorrectly identified as positive), on the X-axis. Each point on the curve corresponds to a specific threshold used to convert the model’s continuous output (like a probability score) into a binary decision (0 or 1); varying this threshold traces the curve. A model with better discriminative power will have a curve that bows towards the top-left corner (representing 100% TPR and 0% FPR), while a model performing no better than random chance lies along the diagonal line (TPR = FPR). The overall performance across all thresholds is commonly summarized by the Area Under the Curve (AUC or AUROC), a value ranging from 0.5 (random chance) to 1.0 (perfect classification), representing the probability that the model ranks a randomly chosen positive instance higher than a randomly chosen negative one.
7.4 Model ROC Curves
7.5 Models Ranked
rank | wflow_id | .config | .metric | mean |
---|---|---|---|---|
1 | base_xgboost | Preprocessor1_Model14 | roc_auc | 0.8458820 |
2 | base_random_forest | Preprocessor1_Model19 | roc_auc | 0.8457684 |
3 | base_mars | Preprocessor1_Model19 | roc_auc | 0.8457522 |
4 | normalized_neural_network | Preprocessor1_Model09 | roc_auc | 0.8429870 |
5 | normalized_mlp_brulee | Preprocessor1_Model18 | roc_auc | 0.8386580 |
6 | normalized_log_regression | Preprocessor1_Model18 | roc_auc | 0.8341937 |
7 | base_k_nrst_nghbr | Preprocessor1_Model11 | roc_auc | 0.8320292 |
8 | normalized_k_nrst_nghbr | Preprocessor1_Model11 | roc_auc | 0.8320292 |
9 | base_rpart | Preprocessor1_Model14 | roc_auc | 0.8168669 |
7.6 Best Tuning Parameters
base_xgboost
mtry | trees | tree_depth | learn_rate | loss_reduction | sample_size | .config |
---|---|---|---|---|---|---|
5 | 1750 | 2 | 0.004217 | 1e-07 | 0.2875 | Preprocessor1_Model14 |
7.7 Confusion Matrix
base_xgboost
8 Conclusion
The purpose of this report was to create a reproducible pipeline such that it could be recreated. However, one note is that the results from the ML models were poor: the false negatives are way too high, in my opinion. In the real world, a number of people would have been told that they did not have diabetes when they in fact did. This would have to be addressed in the future modeling and hopefully the overall accuracy, both for the positives and negatives, would increase.
Report generated at 2025-05-21 19:13:54.801701
Appendix - Raw Data
pregnancies | glucose | blood_pressure | skin_thickness | insulin | bmi | dbf | age | outcome |
---|---|---|---|---|---|---|---|---|
6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
5 | 116 | 74 | 0 | 0 | 25.6 | 0.201 | 30 | 0 |
3 | 78 | 50 | 32 | 88 | 31.0 | 0.248 | 26 | 1 |
10 | 115 | 0 | 0 | 0 | 35.3 | 0.134 | 29 | 0 |
2 | 197 | 70 | 45 | 543 | 30.5 | 0.158 | 53 | 1 |
8 | 125 | 96 | 0 | 0 | 0.0 | 0.232 | 54 | 1 |
Appendix - Benchmarks
Altair Rapid Miner was used to double check both algorithm selection and performance. All algorithms performed worse than the tuned models above and there were no outstanding performers either.
Model | AUC | Standard Deviation | Gains | Total Time | Training Time (1,000 Rows) | Scoring Time (1,000 Rows) |
---|---|---|---|---|---|---|
Generalized Linear Model | 0.663 | 0.120 | 6 | 1045 | 50.781 | 22.801 |
Deep Learning | 0.660 | 0.095 | 0 | 2160 | 248.698 | 26.059 |
Naive Bayes | 0.658 | 0.120 | 0 | 1797 | 31.250 | 94.463 |
Random Forest | 0.646 | 0.152 | -6 | 4980 | 15.625 | 91.205 |
Decision Tree | 0.643 | 0.068 | 0 | 1023 | 14.323 | 16.287 |
Support Vector Machine | 0.595 | 0.075 | 0 | 4084 | 66.406 | 91.205 |
Gradient Boosted Trees | 0.591 | 0.129 | 0 | 7558 | 169.271 | 45.603 |
Fast Large Margin | 0.489 | 0.114 | 0 | 1269 | 33.854 | 22.801 |
Logistic Regression | 0.343 | 0.094 | 0 | 1052 | 54.688 | 22.801 |