Warsaw Weather Analysis

In this weather data analysis project, I analysed daily meteorological data from Warsaw, Poland from 1993 to 2022.

I used Python for initial data preprocessing, Exploratory Data Analysis (EDA), statistical analysis and I created a Machine Learning model to see if there is a relationship between the years and the temperature increase.

After, I decided to use Microsoft Excel for creating a dynamic and interactive dashboard visualization, to showcase that also with the common spreadsheet software is possible to obtain very good results in terms of data visualization and User Interface.

This project stands blends data science techniques and visualization tools applied to a topic of personal relevance, Warsaw is the city where I actually live.

The objective of this project is to analyze daily weather data from Warsaw, Poland, to identify trends and patterns. The dataset includes daily weather readings from 1993 to 2022, including the average, maximum, and minimum temperatures (TAVG, TMAX, TMIN), precipitation (PRCP), and snow depth (SNWD).
The data is publicly available on Kaggle (Here).

Contents

Exploratory Data Analysis, Machine Learning and Visualization (Python)

I will use the following Python libraries:

  • pandas: A data manipulation and analysis library. I used it for loading the dataset into a DataFrame, a two-dimensional tabular data structure with labeled axes. It also provides useful functions for preprocessing data, such as converting data types, handling missing values and grouping data.
  • NumPy: A library for numerical computations. I used it for calculating the absolute values of the Z-scores in the outlier detection part of the EDA.
  • scikit-learn: A machine learning library. In this project, I used it for creating and evaluating the linear regression model. Specifically, LinearRegression, train_test_split, mean_squared_error, and r2_score classes and functions.
  • matplotlib: A data visualization library. I used it for creating plots to visualize the data, such as the scatter plot of actual vs predicted average temperatures.
  • seaborn: A data visualization library based on matplotlib. I used it for creating more complex plots, such as the box-plots in the EDA, the line-plot in the seasonality analysis, and the heat map of monthly average temperatures.
  • scipy: A library for scientific computing. I used the stats module from this library for calculating the Z-scores in the outlier detection part of the EDA.

1. Importing Libraries and Loading Data

Let’s start by importing the necessary Python libraries and loading the dataset. The data is read into a pandas DataFrame, a two-dimensional tabular data structure with labeled axes, which is a common structure for statistical data.

2. Data Cleaning and Preprocessing

In this step, I preprocessed the data to prepare it for the subsequent analysis.
I converted the ‘DATE’ column to a datetime format and extract the year, month, and day into separate columns.

I now handled missing data by imputing the missing values with the median of the same month in the given year.

Is possible to notice that there are still some missing values, only for the ‘SNWD’ column (snow depth).

The data of snow is missing during summer. I specifically addressed the missing values in the ‘SNWD’ column (snow depth) for the summer months by filling them with 0, since it’s reasonable to assume that there is no snow in summer.

3. Exploratory Data Analysis (EDA)

Next, I will perform an exploratory data analysis to understand the data’s main characteristics.
I will generate descriptive statistics, such as count, mean, standard deviation, minimum, maximum, and quartiles for each variable.

In the next step, I detected and counted the number of outliers using the Z-score method, where an outlier is defined as a data point that is more than three standard deviations away from the mean.

Now let’s visualize the distribution of temperatures using box-plots.

From the EDA, is possible to observe that the variables have different ranges, with temperature variables having a wider range compared to precipitation and snow depth.
Also, several outliers in each variable are identified, which may be due to extreme weather events.

4. Machine Learning: Linear Regression Model

I will now use a linear regression model to predict the average temperature based on the year.

First, I grouped the data by year and calculate the mean temperature for each year.
I split the data into a training set and a test set, where the model learns from the training data and its performance is evaluated on the test data.
I then trained the model, make predictions using the test set, and evaluated the model’s performance by calculating the Root Mean Squared Error (RMSE) and the R-squared (R2) score.
After, I visualized the actual versus predicted average temperatures using a scatter plot.

The linear regression model shows that there is a positive correlation between the year and the average temperature, indicating a trend of increasing temperatures over time.
The model’s performance is reasonable, with an R2 score of about 0.38, suggesting that approximately 38% of the variation in average temperature can be explained by the year.

5. Seasonality Analysis

I will now conduct a seasonality analysis to understand how temperature patterns change throughout the year.

I grouped the data by month and calculate the mean temperature for each month.
After I visualized the average temperature over the months using a line plot.

I also created an heatmap to visualize the average temperature for each month across the years.

The seasonality analysis reveals a clear pattern of temperatures varying with the seasons, with higher temperatures in the summer months (June to September) and lower temperatures in the winter months (December to February).
The heatmap shows how the average temperature for each month has changed over the years, with some indication of rising temperatures over time.

6. Extreme Weather Events Analysis

Now I will analyze extreme weather events, such as extremely hot days, cold days and heavy rainfall days.

I decided to define a “hot day” a day when the temperature is above the 90th percentile.
I created a bar plot with a trend line for showing that the quantity of hot days has changed over the years, with some indication of rising temperatures over time.

The definition of “extreme” events in this analysis are events that fall in the top 1% of temperatures and precipitation. I calculated the number of these extreme events for each year and visualized the number of extreme events over time using line plots.

The analysis of extreme weather events shows that the number of extremely hot and cold days and heavy rainfall days varies from year to year.
It’s important to note that these extreme events could be related to climate change, but further analysis would be needed to confirm this.


By the end of this project, we have gained a deeper understanding of the weather trends and patterns in Warsaw, Poland, over nearly 30 years. This analysis can serve as a basis for further studies, such as climate change research, or can be used in applications like weather forecasting.

Dashboard Creation (Microsoft Excel)

(Google Drive link down below)

I decided to use Microsoft Excel to create a dashboard to showcase all the potential in the tool, not only for calculation, but also for data visualization.
I focused on creating an interactive dashboard (is possible to select every day or period from 1993 to 2022) visually appealing.

This dashboard was prepared to offer users a clear understanding of Warsaw’s weather patterns over the years.

Disclaimer:
Is necessary to download and open the workbook with Microsoft Excel.

  1. Overview:
    The “Dashboard_WarsawWeather.xlsx” Excel workbook delivers a comprehensive visualization and analysis of Warsaw’s daily weather data, spanning from 1993 to 2022.
    This file is an essential tool for understanding weather patterns, trends, and significant events in Warsaw over the years.
  2. Data Source:
    The weather data integrated into this dashboard originates from a dataset publicly accessible on Kaggle (Link). The dataset encompasses daily metrics such as average, maximum, and minimum temperatures, precipitation levels, and snow depths.
  3. Workbook Sheet Structure:
    For user convenience and easy navigation, the workbook is organized as follows:
    Data: Contains the raw data and the columns created for the project.
    Cards: Dedicated to generating the cards and the Timeline Slicer for the Detailed Time Analysis.
    Seasonality: Crafts the monthly temperature graph to illustrate seasonality.
    Probability: Analyzes precipitation and snow trends in the Seasonality Analysis.
    Timeline: Dedicated to the yearly trend analysis graphs.
    Colours: Displays the color palette employed in the dashboard.
  4. How to Use:
    Dashboard Viewing: Navigate to the “Dashboard” sheet to access the curated visual insights.
    Seasonality Analysis: Data here is categorized by month. The “Probability of Day with Snow per Month” graph represents the likelihood of encountering snow in Warsaw, not necessarily the probability of snowfall on any given day.
    Detailed Time Analysis: Use the Timeline Slicer to pinpoint specific daily/monthly/quarterly/yearly values from 01/01/1993 to 31/12/2022.
    Yearly Trend Analysis: This section groups data by year for a lucid understanding of evolving trends.
    If you wish to inspect or modify the raw data, head to the “Data” sheet. Please be cautious when altering data to maintain dashboard accuracy.
  5. Data Cleaning and Preprocessing:
    The data was explored and manipulate with Python.
  6. Notes & Considerations:
    The “warsaw_filled.xlsx” dashboard aims to provide a comprehensive view of Warsaw’s weather patterns, but it’s essential to approach the data with an analytical mindset and consider potential external factors that might affect the readings.
    The workbook is optimized for use in the latest versions of Microsoft Excel. Display or functionality might vary slightly in older versions or different spreadsheet software.
  7. Feedback & Suggestions:
    Your feedback is invaluable. If you have insights, corrections, or any queries regarding the data and its presentation, please refer back to this README or contact the creator.
  8. Credits & Acknowledgments:
    A big thank you to the original data providers on Kaggle for making this dataset available to the public.