In this guide I will show the best way to find datasets for your projects to showcase your skills and include in your portfolio, I personally use these resources when choosing projects to work on.
15+ Dataset Examples Included.
Contents
The Importance of Discover the Right Dataset for Your Projects
Consider this: every significant project in data analytics began with a dataset.
Whether it is about predicting market trends, creating machine learning models, or uncovering hidden patterns in social behaviors, the journey always starts with a meticulous selection of data.
So, the research for the right dataset is not only a task, but also an opportunity to improve your skills, challenge your understanding and adapt to the complexities of real-world data analytics.
This search is crucial and you should you invest your time and effort into it, beacuse a carefully chosen dataset not only aligns with your project’s goals but also lets you understand what you can achieve. It challenges you to apply your skills in data cleaning, visualization, and machine learning in new and demanding contexts, and to understand where you can improve.
Setting Clear Project Objectives: Examples for Every Type of Project
The first step in finding the right dataset for your portfolio or personal projects is to set clear, actionable objectives for your data analytics idea. Before starting with the research, it’s important to define what you hope to achieve. Are you looking to build a machine learning model to predict future trends? Or are you interested in performing exploratory data analysis to uncover hidden patterns?
Having a clear goal in mind makes the search process significantly easier. It allows you to target your efforts towards finding datasets that are relevant to achieve your project’s objectives.
Choosing the right dataset involves considering the dataset’s relevance, quality, and the necessary preprocessing efforts. It’s also essential to assess whether the data aligns with your project’s objectives, such as the ones listed in the next sections.
Data Cleaning
Seek datasets with missing values, outliers, or incorrect entries. Examples include customer databases with incomplete profiles or sales records with anomalies.
Example here, 61 columns ready to be cleaned and analyzed: Food Choices – College students’ food and cooking preferences.
Data Visualization
Look for datasets with time series, geographical data, or categorical data. Ideal examples include weather patterns over the years or sales distribution across different regions.
Example here, a very simple dataset about the medals in the Summer Olympics in 2021 in Tokyo: Tokyo 2020 Olympic Medal Count.
Machine Learning
Classification
Datasets with labeled examples, such as email spam identification.
For example, the famous IRIS dataset, widely used in statistics and machine learning: IRIS Dataset.
Regression
Continuous data, like predicting house prices based on features.
Find an example here and try to predict the insurance cost: Insurance Forecast by using Linear Regression.
Clustering
Unlabeled data to identify patterns or groups, such as customer segmentation.
In this example you can aim to determine the types of customers (target customers) who can easily convert into loyal customers, using your favorite clustering method: Mall Customer Segmentation Data.
Exploratory Data Analysis (EDA)
Choose datasets with a mix of variables (numerical, categorical) to uncover underlying patterns. Healthcare data with patient metrics or retail data with purchase histories are excellent examples.
Example here: Superstore.
Natural Language Processing (NLP)
Text-rich datasets like social media posts for sentiment analysis or news articles for topic modeling.
Example here: IMBD Reviews (50k).
Time Series Analysis
Datasets with chronological entries, like stock prices or web traffic data, to forecast future trends. Example here: Bitcoin Price (2014-2023).
This targeted approach not only saves time but also ensures that the dataset you choose will be beneficial for your learning journey or portfolio development. Whether it’s selecting a dataset with a wide variety of features for machine learning or one with numerous entries for robust statistical analysis, defining your project goals is a crucial step in the dataset discovery process.
In the next section I will focus on describing my personal favorites platforms online where I ususally research for datasets for my projects or case studies.
Online Platforms and Data Repositories: My personal favorites
Kaggle is a database repository but also a community for data science enthusiasts offering competitions, datasets, and a platform for collaboration. Its vast collection spans numerous topics, making it an important resource for analytics projects.
It makes easy to search and filter based on various points like usability rating or file size.
Example: Pizza Restaurant Sales.
A specialized repository for machine learning datasets, it categorizes datasets by ML task, making it easier to find data suited for specific projects.
Example: Car Evaluation.
As the home to a big number of datasets from the US government (290k+datasets available), this site offers data across a wide range of fields. It’s particularly useful for projects with a focus on societal, environmental, or economic studies.
It has sources from organization like NASA, Universities, Cities, States, National Institutes, etc.
Example: Crimes – 2001 to Presents by City of Chicago.
This Reddit community is an excellent place for discovering diverse datasets, sharing resources, and discussing data with over 164,000 members. It’s a dynamic source for both finding and requesting datasets.
Example: Banned books across US state prisons.
This website has a dozen ready to use datasets. I personally suggest you use this website if you are a beginner, and to work on Insurance Policies and Call Center datasets.
You need to log in to data.world to see and download the datasets (it’s free).
Example: Call Center.
Specialized Data Collections
This chapter jumps into a curated selection of unique and powerful datasets tailored for specific analytical needs. These specialized data collections can empower your data analytics skills, opening new ways for exploration and insight.
Offers data from a crawl of over 5 billion web pages, ideal for projects requiring vast text data.
Provides moderate-resolution satellite images of Earth, useful for environmental and geographical analyses.
Contains comprehensive dumps of Wikipedia content, including edit history and activity data. This source is very useful for projects involving natural language processing or historical data analysis.
Known for its economic and financial datasets, Quandl supports projects aiming at predicting economic indicators or stock prices. This repository is a goldmine for those interested in finance and economics.
Offering datasets across various categories including artificial intelligence, cloud computing, and more. This portal is backed by one of the largest technical professional organizations globally.
Bonus: use AI to create a personalized dataset suitable for your needs (tutorial 2024)
I created a step-by-step tutorial on how to use the free version of ChatGPT to create a dataset with the variables decided by you and populated with the number of entries that you want. You can read it here and start today creating your first dataset with the help of AI: link to the tutorial.
Conclusion
The journey to find the right dataset for your data analytics project is an integral part of your growth in the field. By leveraging the resources and strategies outlined in this guide, you’re on your way to develop a fantastic portfolio that showcases your analytical skills. Each dataset you work with brings you one step closer to mastering the art of data science. Remember, the dataset you choose can significantly impact the insights you derive and the conclusions you draw and making this process a critical element of your project’s success.
I used to be able to find good information from your blog articles.
Wow, fantastic blog layout! How long have you been blogging for?
you made blogging look easy. The overall look of your website is great, let alone the content!
Thank you javtu!
Since around a year.
Hello there! Would you mind if I share your blog with my myspace group?
There’s a lot of folks that I think would really enjoy your
content. Please let me know. Thank you
Hello, sure, go for it.
Thank you!
Awesome article.
[…] If you need help to find the best dataset for your BI Portfolio, check this article on my blog: Finding The Best Dataset For Data Analytics Practice And Portfolio […]