Dengue — Machine learning prediction models using time-series weather data.

Authors : Roumaissa OMARI & Fadi EL CHEIKH TAHA

Project Managers : Max Cohen and Serhat YILDIRIM

Git Repo : https://github.com/HubHetic/DengAI

INTRODUCTION

What is Dengue?

Dengue is a mosquito-borne disease that occurs in tropical and subtropical regions of the world. In mild cases, symptoms are similar to the flu: fever, rash, and muscle and joint pain. In severe cases, dengue can cause severe bleeding, low blood pressure, and even death.

Dengue fever occurs mainly throughout the intertropical zone. According to current OMS estimates, there may be 50 to 100 million cases worldwide each year.

Mosquito carrying dengue fever Mosquito contamination cycle

The Aedes Aegypti mosquito carries many viruses, including Dengue fever, a disease that threatens between 1/3 and 1/2 of the world’s 7.5 billion people.

Because mosquitoes carry it, the transmission dynamics of dengue are related to climate variables such as temperature and precipitation. Although the relationship to climate is complex, a growing number of scientists argue that climate change is likely to produce distributional shifts that will have significant public health implications worldwide.

In recent years dengue fever has been spreading. Historically, the disease has been most prevalent in Southeast Asia and the Pacific islands. These days many of the nearly half-billion cases per year are occurring in Latin America.

GETTING THE DATA

Data was given by the Driven Data Competition as CSV including two sources :

  • Environmental data collected by various U.S. Federal Government agencies — from the Centers for Disease Control and Prevention to the National Oceanic
  • Atmospheric Administration in the U.S. Department of Commerce.

Our goal was to create a machine learning model able to accurately predict the number of weekly cases of Dengue that will occur at two locations : San Juan in Puerto Rico, and Iquitos in Peru.

San Juan Dengue Fever Total cases per week

San Juan Dengue Fever Total cases per week

  • We can see that the time series has seasonality. Seasonality refers to a periodic pattern within years related to the calendar day, month, quarter, etc…
  • We can see that the time series does not appear to have a trend. There is no long-run upward or downward direction in the series.

PREPROCESSING

The Data in this project is referred to as a time series. Time-series Data requires considerations and particular pre-processing for Machine Learning. Missing Values have to be handled carefully before fitting any predictive model. Here is a view of the variables of our data set, and their missing values.

We chose a method to fill missing values by the value before. Since we are working on time series, this one seems to be the best way to fill missing values

Since our Data set has no more missing values now, we can focus on preprocessing the Data using Standardization to use it with the models we have chosen to predict Dengue disease.

Consider columns as variables. If a column is standardized, a mean value of the column is subtracted from each value, and then values are divided by the standard deviation of the column. The resulting columns have a standard deviation of 1 and a mean that is very close to zero.

Distribution of Data before and after Standardization

Distribution of Data before and after Standardization

We end up having features (columns) that have almost a normal distribution.

We are choosing the variables with the best correlation rate with our target variable.

Correlation between our target variable and the others

Correlation between our target variable and the others

  • We can see that `reanalysis_specific_humidity_g_per_kg` and reanalysis_dew_point_temp_k are firmly correlated to total cases. These variables are related to the humidity of the climate.
  • Average and Min temperature is also strongly correlated to the target variable.

Then, the best features to keep are : reanalysis_specific_humidity_g_per_kg,
reanalysis_dew_point_te, station_avg_temp_c,
station_min_temp_c
.

BUILDING RANDOM FOREST MODEL

After all the data preparation work, we used Scikit-Learn libraries to construct our Random Forest Model model.

We import the random forest regression model from scikit-learn, X represents normalized features, and your target we want to predict, the total cases.

We instantiate the model and fit (scikit-learn’s name for training) the model on the training data.

This graph is a visualization of the predicted and actual cases.

Predictions vs Actual total cases

That looks pretty good! Our model has learned how to predict the total cases for each week on the training set with 93% accuracy.

Building RNN LSTM Model

We used the Keras and Tensorflow libraries to construct our model.

Long Short Term Memory Networks (LSTM) is an extension of recurrent neural networks extending their memory. Therefore, it is well suited for learning essential experiences that have a very long shift in between.

The units of an LSTM are used as building blocks for the layers of an RNN, which is then often called an LSTM network.

LSTMs allow RNNs to remember their inputs over a long period. This is because LSTMs hold their information in a memory, which is very similar to a computer’s memory. After all, the LSTM can read, write and delete data from its memory.

The base model we started with was as follows :

Then we experimented with adding a dropout layer of varying proportions, increasing the model’s performance significantly.

This lead to the final function to fit the LSTM model on the dataset :

This function takes a data set, and first, it applies a function to convert a time series for supervised learning and then normalizes and fits the values on our three layers RNN-LSTM model.

This model is using the mean absolute error to evaluate its performance.

MAKING PREDICTIONS

Here’s a chart showing the number of cases, predicted vs. actual cases :

Predictions vs Actual cases with RNN model

The mean absolute error of 0.014 (the average difference between actual cases and predicted Dengue cases) shows there is room for improvement. Still, the model accurately predicts the massive spike of over 150 cases around week 90.

Information like this can be extremely valuable to decision-makers in a region.

CONCLUSION

With these graphs, we have completed an entire end-to-end machine learning prediction ! If we want to improve our model, we could try different hyperparameters (settings), test more different algorithms, or, the best approach of all, gather more data !

Moreover, we hope everyone who made it through enjoyed reading us. We want to thank everyone who made this project possible, our school HETIC with the General Manager Frédéric Sitterlé who allowed Deepnet to offer us their project. A vast thanks to Max Cohen and Serhat Yildirim. They have been excellent mentors throughout this project and showed us how accessible machine learning was and made us more passionate about the potential of data science to make the world a better place !

Introduce students to innovation through impact projects allowing them to learn by doing : Data Science, Machine Learning, Deep Learning, Compute