Diabetes Predictor

In this project with healthcare data, I will explore deeply into the health of Pima Indian women.
Using Python, I first cleaned and explored the data. After, I used statistical methods and data visualization techniques to reveal hidden patterns, correlations, and insights within the data. This analysis cleared the path for predictive modeling. So, with machine learning, I took on the role of a predictor, trying to forecast possible diabetes diagnoses.

This notebook dives into The Pima Indian Diabetes Dataset.
Originally from the National Institute of Diabetes and Digestive and Kidney Diseases, contains information of 768 women from a population near Phoenix, Arizona, USA.
The database is available here: Link to Kaggle.
My objective is to organize the dataset through an exploratory data analysis (EDA), visualize the cleaned data and understand the statistical distribution, and creating a model to predict the diabetes for a new person, outside original data.


Exploratory Data Analysis and Machine Learning (Python)

We will use the following Python libraries:

  • pandas: A data manipulation and analysis library. We use it for loading the dataset into a DataFrame.
  • NumPy: A library for numerical computations.
  • scikit-learn: A machine learning library. We use it for creating and evaluating the linear regression model. Specifically, we use the RandomForestClassifier, train_test_split, confusion_matrix, and model evaluation classes and functions.
  • matplotlib: A data visualization library. We use it for creating plots to visualize the data.
  • seaborn: A data visualization library based on matplotlib. We use it for creating more complex plots.

1. Importing and Overview of the Dataset

We start by importing the necessary Python libraries and loading the dataset. The data is read into a pandas DataFrame, a two-dimensional tabular data structure with labeled axes, which is a common structure for statistical data.

The dataset comprises 768 rows and 9 columns.
The available columns are: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age, and Outcome.

2. Preliminary Data Exploration

We don’t have columns missing values.
But we have identified unrealistic zero values in Glucose, BloodPressure, SkinThickness, Insulin, and BMI.

Identified unrealistic zero values in certain columns that should naturally not have a zero. (biologically implausible (e.g., blood pressure of zero)).

3. Data Cleaning & Preprocessing

I decided to replace the zero values with the median of the respective columns.
Using the median for imputation is less sensitive to outliers than the mean.

4. Exploratory Data Analysis (EDA)

Histograms displayed for each feature to understand data distributions after replacing 0 values.

5. Statistics and Data Visualization

Pregnancies: Most of the data is clustered around the lower values, suggesting that a majority of the women in the dataset have fewer pregnancies. The outliers on the upper end indicate some women with a higher number of pregnancies than typical.
Glucose: The data distribution is fairly symmetric, centered around the median. There aren’t any noticeable outliers, indicating that glucose levels in the dataset are mostly within a typical range.
BloodPressure: The distribution is again reasonably symmetric. There are a few outliers on both the lower and upper ends, suggesting some instances of unusually low or high blood pressure.
SkinThickness: The majority of the data is concentrated in a particular range, but there are some noticeable outliers on the upper end. This suggests that while most women have skin thickness measurements within a typical range, some instances are unusually high.
Insulin: Most of the data is clustered towards the lower end, with several outliers on the higher side. This indicates that while many women in the dataset have lower insulin levels, there are also some with significantly elevated levels.
BMI: The data distribution appears fairly symmetric around the median. However, there are outliers on the upper end, suggesting some instances of very high BMI.
DiabetesPedigreeFunction: Most of the data is towards the lower range, but there are several outliers on the upper end. This indicates that while most women have a lower diabetes pedigree function score, some have a significantly higher score, possibly suggesting a stronger genetic disposition.
Age: The majority of the women in the dataset appear to be on the younger side, as indicated by the data clustering towards the lower range. There are outliers on the higher end, indicating the presence of older individuals in the dataset.

Glucose & Outcome: There’s a moderate correlation of 0.47, suggesting that glucose levels have a noticeable relationship with the diabetes outcome.
Age & Pregnancies: There’s a moderate correlation of 0.54, indicating that as age increases, the number of pregnancies tends to increase as well.
BMI & SkinThickness: There’s a moderate correlation of 0.39, suggesting a relationship between body mass index and skin fold thickness.
Most other correlations are relatively low, suggesting limited linear relationships between them. However, remember that correlation doesn’t imply causation, and non-linear relationships won’t be captured well by this metric.

I will now focus on the relation between Glucose & Outcome, discovered in our correlation analysis.

With Diabetes (Red): Individuals with diabetes often have glucose levels centered around 140-150.
Without Diabetes (Blue): Those without diabetes typically have glucose levels around 100-110.
Overlap: There’s an overlapping region, showing that while glucose is indicative, it’s not the sole marker for diabetes. Some with elevated glucose might be prediabetic or have other conditions.

6. Model selection and split

I Chose the Random Forest classifier, a robust learning method known for its high accuracy and ability to handle large datasets.
Split the data into training and test sets (80-20), and trained the model.

7. Prediction

Made predictions for two hypothetical individuals: a young woman who eats well and goes to the gym, and an older overweight woman.
As we can see the Young Woman is Predicted as non-diabetic, the Older Woman is Predicted as diabetic.

8. Model Evaluation

The model’s accuracy is 74.7%, meaning it makes correct predictions about three-quarters of the time.
While this is a decent rate, aiming for higher accuracy is always beneficial, especially in medical applications.

The Random Forest model’s performance on the test data is summarized as follows:

True Negatives (78): The model correctly confirmed that 78 individuals did not have diabetes. This means it’s doing a great job in many cases.
True Positives (37): It rightly identified 37 individuals with diabetes, showcasing its potential in catching actual cases.
False Positives (21): In 21 cases, the model was overly cautious and predicted diabetes when there wasn’t any. It’s always better to be safe, but we can fine-tune this further.
False Negatives (18): It missed 18 cases where diabetes was present. It’s a reminder that there’s room to make the model even better.