EDA With Python

Exploratory Data Analysis (EDA) is the first and most important step in the data analysis workflow.
It allows analysts and scientists to understand the patterns, relationships, and anomalies within the data, making it ready for a deeper analysis or the development of predictive models. This process involves a series of procedures designed to systematically understand the characteristics of the dataset, regardless of its size or complexity.

Choosing the Right Tools for EDA

An EDA can be performed using a variety of different tools, ranging from basic spreadsheet software to advanced data analysis platforms. Each tool has its strengths for EDA:
• Spreadsheet Software (e.g., Excel, Google Sheets): Accessible and straightforward, suitable for quick and small exploratory tasks.
• Statistical Software (e.g., R, SPSS, SAS): Offers powerful statistical analysis capabilities and is highly valued in certain industries, like academy.
• Business Intelligence Tools (e.g., Tableau, Power BI): The best tools for data visualization and also commonly used for business analytics.
• Programming Languages (e.g., Python, Julia): Provide flexibility and depth, enabling data manipulation, analysis, and modeling in one tool.

Personally, I always use Python for EDA. It is a favorite for many because it’s free, easy to learn, and has libraries like Pandas for data manipulation, Matplotlib, and Seaborn for visualization. It’s widely used by companies, with more of 57% (and increasing) of data scientist positions requiring proficiency in Python (source). Python, famous for its simplicity, versatility, and a vast library ecosystem, is the dominant tool in data analysis. Its growth is expected to jump by another 50% in 2024, indicating a big trend toward its adoption in various technological domains, including data science and artificial intelligence (source).
Python has a lot of influence in data analysis, its capabilities, joined with the extensive support and resources available to developers, make it an indispensable tool in data science and analytics.

Steps of EDA

For this post I will utilize a dataset called “Billionaires Statistics Dataset”, presents on Keggle. The dataset contains data about billionaires around the world and it’s compiled from various sources, including official reports, government databases, financial institutions, and reputable publications such as Forbes, Bloomberg, and World Bank. 

My goal is to show how to perform a basic EDA with Python and how to write comments to explain the results obtained, because is important to not only write the proper code and showcase the results of the analysis, but also to explain the findings to the public.

I will describe every line of code with an explanation before and a sample comment after.


Load the Data and Initial Review

The goal of this first section in Exploratory Data Analysis (EDA) is to load the dataset into a Python environment and get an initial sense of its content. This involves understanding the basic structure of the data, including the number of entries (rows), features (columns), and the type of data each feature contains (numeric, categorical, dates, etc.). This step is crucial as it lays the foundation for the entire EDA process, enabling you to plan subsequent steps effectively.

I will now load the dataset into the Python environment using a library called pandas, known for its efficiency and ease of use in handling structured data.
The read_csv function from pandas is typically used for this purpose, given that many datasets are in CSV (Comma Separated Values) format.

After loading, it’s important to visualize the first few entries of the dataset using the head() method. This provides a quick snapshot of the data, including sample values for each column.
This step helps in getting a preliminary idea of the data’s nature and the type of information each column holds.
To understand the data structure, the info() method is used to obtain a summary of the dataset, including the total number of entries, the number of non-null entries per column, and the data type of each column (e.g., integer, float, object for text, or boolean). This method helps in identifying if there are missing values in the dataset and understanding the kind of data manipulation or cleaning that might be needed in the future steps.

The “Billionaires Statistics Dataset” contains 2,640 entries and 35 columns.
It offers a wide range of data about billionaires, including their rank, net worth, personal details, and various statistics related to their country of residence.
Here are some key columns included in the dataset (you can always find information about the features from the source that provided the dataset):
– rank: The billionaire’s rank based on their net worth.
– finalWorth: The final worth of the billionaire in USD billion.
– personName: Name of the billionaire.
– age: Age of the billionaire.
– country: The billionaire’s country.
– city: The city where the billionaire resides.
– source: Source of wealth.
– industries: Industries the billionaire is involved in.
– selfMade: Indicates whether the billionaire is self-made.
– gender: Gender of the billionaire.
– birthDate: Birthdate of the billionaire.
– organization, title, state, residenceStateRegion: Various details about the billionaire’s business and residence.
– birthYear, birthMonth, birthDay: Components of the billionaire’s birthdate.
– education, networth: Additional details about the billionaire’s education and net worth.

Data Cleaning

Remove Unnecessary Columns

A preliminary (but not mandatory) step is to remove columns that are not necessary for the analysis, with drop().

I decided to remove some columns (10 of 35) that are not fundamental at the moment for the EDA.
This will simplify the process of data cleaning.

Remove Duplicates

The first step of Data Cleaning is always to check and remove duplicate rows that can skew the analysis, leading to misleading results. Use duplicated().sum() to identify and drop_duplicates() to remove duplicate entries.

There are no duplicated entries in the dataset and we can continue with data cleaning.

Handling Missing Values

Another mandatory step in this initial part of the EDA is to handle missing values, that can compromise the integrity of the analysis. We can check them using the function isnull().sum().
Is possible to drop them with dropna(), or input them with fillna(), using mean/median/mode or forward/backward fill, a placeholder or a prediction.

For organization: I decided to drop the column, there are too many missing values.
For country, city and first name I decided to maintain the columns and to replace missing values with “Unknown” to indicate that the data is not available.
Age and birthdate: these columns have some missing values, but for not interfering with next analysis I will leave them with NaN and NaT (missing indicators respectively for numbers and time).
For the rest of the columns: the rest of the columns with missing values are columns related to the country (ex. gdp_country or population_country), for the countries that are not unknown is possible to manually input the relative data, if is possible to find it on a trustable source. For the moment I will leave them with missing values.

Ensure Data Type Consistency

Ensuring data type consistency involves verifying and, if necessary, converting the data types of columns in your dataset to match their content and the needs of your analysis. Is possible to convert a column to its appropriate data type using the astype() method in pandas for basic type changes (e.g., converting to int, float, str, bool), or specific functions like pd.to_datetime for handling dates.
We already saw before the info of the dataset (step 1), so let’s jump into the code:

For optimizing our analysis I converted date column in a datetime format, and the categorical columns to a category format.
The column “gdp_country” was in a “object” format, incorrectly, I converted it into a float, after removing the dollar sign that was interfering with the transformation.

Summary Statistics

Generating descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution. Use describe() for an overview of the central tendencies, dispersion, and shape of the dataset’s distributions.

Comment (about the most important columns or the ones with weird values):
rank: Indicates the billionaire’s rank based on final worth. The ranking varies from 1 to 2540, which shows the dataset covers a broad range of billionaires from the richest to those at the lower end of the billionaire spectrum.
finalWorth: Reflects the net worth of each billionaire in millions of dollars. The mean net worth is approximately $4.6 billion, but the standard deviation is high, about $9.8 billion, indicating a significant disparity in the wealth of individuals within the list. The maximum net worth is an extraordinary $211 billion, highlighting the vast difference between the top and the average billionaires.
age: The ages range from 18 to 101 years, showcasing a wide diversity in the ages of billionaires. The median age is 65 years, which implies that half of the billionaires are younger than 65 and half are older, providing insights into the distribution of ages in the financial elite.

Data Visualization

Using visual methods to understand relationships, patterns, and trends within the data. Employ Matplotlib and Seaborn for visual exploration. Use histograms (hist()), boxplots (boxplot()), and scatter plots (scatterplot()) to examine distributions and relationships between variables.

The histogram shows the age distribution of billionaires. It appears slightly right-skewed, meaning there are more younger billionaires than older ones, but the majority are in their 60s and 70s.

This bar chart shows the countries with the highest number of billionaires. It provides a clear view of which countries dominate the billionaire list, is clear that GDP and population can influence this measure, as we will see later.

The pie chart indicates a significant disparity in the number of male and female billionaires, with males dominating the dataset. This discrepancy highlights gender representation in the context of global wealth distribution, suggesting that males are much more prevalent in the billionaire category than females.

The bar chart shows that “Finance & Investments” category is the one that produces the most billionaires, given its direct connection to wealth management and capital growth.
At the second place the “Manufacturing” category, traditionally a strong source of wealth, especially in economies transitioning from developing to industrialized status.
After, “Technology”, reflecting the massive wealth generated in this sector due to high-impact global companies and startups.

Outlier Identification

Detecting data points that deviate significantly from the norm, which can be crucial for certain analyses. Outliers can significantly affect your analysis. Visual tools (like boxplots), along with IQR (Interquartile Range) calculations, help identify and decide how to handle these data points.

The boxplot displays the distribution of the final worth of billionaires in millions of USD. It shows a significant number of outliers on the right, indicating that there are billionaires whose net worth is considerably higher than the majority.

Here, I created a function that permits to detect outliers with the interquartile range for every column.

– The first quartile (Q1) is 56 years.
– The third quartile (Q3) is 75 years.
– The Interquartile Range (IQR) is 19 years.
-The lower bound for outliers is 27.5 years, and the upper bound is 103.5 years.
– There are 9 outliers, indicating that there are 9 billionaires who are either significantly younger than the majority. The boxplot visually confirms the presence of outliers.

Correlation Analysis

Exploring the relationships between variables to understand how they interact with each other.
Correlation Matrix: Use DataFrame.corr() to calculate correlation coefficients between variables, identifying potential relationships.
Heatmaps: Visualize the correlation matrix with Seaborn’s heatmap() function to easily identify highly correlated variables or patterns.

The heatmap visualizes the correlation matrix of the numerical variables in the dataset. Each cell in the heatmap shows the correlation coefficient between the variables, ranging from -1 to 1:
A coefficient close to 1 indicates a strong positive correlation (as one variable increases, the other tends to increase as well).
A coefficient close to -1 indicates a strong negative correlation (as one variable increases, the other tends to decrease).
A coefficient around 0 indicates no linear relationship between the variables.
In this case we don’t have any strong positive or negatice correlation to the main variable “finalworth”.
But some correlations are noticeable like a strong negative one between “cpi_country” and “life_expectancy_country”

Leave a Reply

Your email address will not be published. Required fields are marked *