Time Series Forecasting on COVID 19 using ARIMA

Time Series data is experimental data that has been observed

at different points in time (usually evenly spaced, like once a day).

For example, the data of airline ticket sales per day is a time series. However, just because a series of events has a time element does not automatically make it a time series, such as the dates of major airline disasters, which are randomly spaced and are not time series. These types of random processes are known as point process.

Time Series Forecasting on COVID 19 using ARIMA

Time Series Basics

Time Series data is experimental data that has been observed at different points in time (usually evenly spaced, like once a day). For example, the data of airline ticket sales per day is a time series. However, just because a series of events has a time element does not automatically make it a time series, such as the dates of major airline disasters, which are randomly spaced and are not time series. These types of random processes are known as point process.

Time Series have several key features such as trend, seasonality, and noise.

Variation

One of the most important features of a time series is variation. Variations are patterns in the times series data. A time series that has patterns that repeat over known and fixed periods of time is said to have seasonality. Seasonality is a general term for variations that periodically repeat in data. In general, we think of variations as 4 categories: Seasonal, Cyclic, Trend, and Irregular fluctuations.

Seasonal variation is usually defined as variation that is annual in period, such as sweaters sales being higher in winter and lower in summer. Cyclic Variation is a variation that occurs at other fixed periods, such as the daily variation in temperature. Both Seasonal and Cyclic variation would be examples of seasonality in a time series data set.

Trends are long-term changes in the mean level, relative to the number of observations.

Variation

One of the most important features of a time series is variation. Variations are patterns in the times series data. A time series that has patterns that repeat over known and fixed periods of time is said to have seasonality. Seasonality is a general term for variations that periodically repeat in data. In general, we think of variations as 4 categories: Seasonal, Cyclic, Trend, and Irregular fluctuations.

Seasonal variation is usually defined as variation that is annual in period, such as sweaters sales being higher in winter and lower in summer. Cyclic Variation is a variation that occurs at other fixed periods, such as the daily variation in temperature. Both Seasonal and Cyclic variation would be examples of seasonality in a time series data set.

Trends are long-term changes in the mean level, relative to the number of observations.

Steps for ARIMA implementation

The general steps to implement an ARIMA model are :

Load the data: The first step for model building is of course to load the dataset
Preprocessing: Depending on the dataset, the steps of preprocessing will be defined. This will include creating timestamps, converting the dtype of date/time column, making the series univariate, etc.
Make series stationary: In order to satisfy the assumption, it is necessary to make the series stationary.

This would include checking the stationarity of the series and performing required transformations

Determine d value: For making the series stationary, the number of times the difference operation

was performed will be taken as the d value

Determine the p and q values: Read the values of p and q using auto_arima model
Fit ARIMA model: Using the processed data and parameter values we calculated from the previous steps, fit the ARIMA model
Predict values on validation set: Predict the future values
Calculate MAPE: To check the performance of the model, check the MAPE value using the predictions and actual values on the validation set

Steps for ARIMA implementation

Let’s explore the COVID 19 Indian dataset statewise cases You can access this data here:

https://api.covid19india.org/csv/latest/state_wise_daily.csv

https://www.kaggle.com/sudalairajkumar/covid19-in-india

Import Packages

Processing the Data

Pandas makes this easy, let’s quickly check the head of the data (the first 5 rows) to see what default format it comes in:

Data contains statewise confirmed, recovered and death cases on daily basis count.

checking the null values in the dataset

Data contains statewise confirmed, recovered and death cases on daily basis count.

Converting the “Date_YMD” column into a proper timestamps format with the help of pandas

Setting up the “Date_YMD” as index, that way our forecasting analysis will be able to interpret these values.

We are predicting on daily basis, so resampling our data into Daily and taking the mean, which is similar to our original data. If we are predicting either weekly or monthly or yearly, then the mean value will change.

For example, resampling on weekly, monthly and yearly basis

y = df1 [‘KA’].resample (‘W’).mean ()

y = df1 [‘KA’].resample (‘M’).mean ()

y = df1 [‘KA’].resample (‘Y’).mean ()

Split into train and test datasets to build the model on the training dataset and forecast using the test dataset.

Plotly is a collaborative, web-based graphing and analytics platform. It allows users to import, copy and paste or stream data to be analyzed and visualized. For analysis and styling graphs, Plotly offers a Python sandbox (NumPy supported), datagrid, and GUI. Python scripts can be saved, shared, and collaboratively edited in Plotly.

Analysis

Fig: Statewise cases with detailed information

Analysis of a Trend

Fig: Trend Analysis

Analysis of Top 10 highest confirmed cases

Analysis of Top 10 highest cured cases

Analysis of Top 10 highest death cases

Analysis of Death, confirmed & death in percentage for Top 10 states

Fig: Top 10 states with highest death cases (in percentage)

Fig: Top 10 states with highest confirmed cases (in percentage)

Fig: Top 10 states with highest recovered cases (in percentage)

Analysis of male vaccinated

Analysis of female vaccinated

Analysis of total male and female administered for vaccines

Analysis of Covaxin vs Covishield

Auto ARIMA

Usually, in the basic ARIMA model, we need to provide the p, d, and q values which are essential. We use statistical techniques to generate these values by performing the difference to eliminate the non-stationarity and plotting ACF and PACF graphs. In Auto ARIMA, the model itself will generate the optimal p, d, and q values which would

be suitable for the data set to provide better forecasting.

Once fitting the whole dataset into auto_arima model, it will generate the best fit values of p, d and q which is suitable for our data

Finalized the best model – ARIMA (5, 1, 2) – (p, d, q)

Last 3 zeros (0, 0, 0) represents that, our data does not have any seasonality.

Fig: Summary of the model

Now, fitting the ARIMA model with whole data with best fit value of p, d and q to check how accurately our model is getting trained.

Fig: Graphical representation Actual vs Prediction

Fig: comparing actual cases vs prediction

Checking Accuracy and Error

Mean Absolute Percentage Error (MAPE)

The MAPE has an advantage over MAE or RMSE as it is unit-free and thus is safe to use for comparing performances of time series forecast values with different units.

Using ARIMA model, we achieved an accuracy of 91% which is a better model for time series prediction.

Prediction

Again, fitting the data into the arima model and get trained

Predicting the confirmed cases for next two weeks from the current date using the above trained model

Using the following above steps, we able to train and test our model for confirmed cases, death cases, cured cases, gender wise getting vaccination, brand wise vaccination (covaxin & covishield)

Let’s reorganize this set of predictions by creating a dataframe that contains our future forecast

Now that we’ve evaluated our data and satisfied with the performance, the next step would be to refit our model to all the states and then forecast into the real future.

LEADERSHIP

PARTNERS

VALUES

VISION AND MISSION

BLOG

CONTACT

Time Series Basics

Variation

Variation

Steps for ARIMA implementation

Steps for ARIMA implementation

Checking Accuracy and Error

The MAPE has an advantage over MAE or RMSE as it is unit-free and thus is safe to use for comparing performances of time series forecast values with different units.

Using ARIMA model, we achieved an accuracy of 91% which is a better model for time series prediction.

Prediction

Again, fitting the data into the arima model and get trained

Predicting the confirmed cases for next two weeks from the current date using the above trained model