Dengue — Machine learning prediction models using time-series weather data.

Authors : Roumaissa OMARI & Fadi EL CHEIKH TAHA

Project Managers : Max Cohen and Serhat YILDIRIM

Git Repo : https://github.com/HubHetic/DengAI

INTRODUCTION

Dengue is a mosquito-borne disease that occurs in tropical and subtropical regions of the world. In mild cases, symptoms are similar to the flu: fever, rash, and muscle and joint pain. In severe cases, dengue can cause severe bleeding, low blood pressure, and even death.

Dengue fever occurs mainly throughout the intertropical zone. According to current OMS estimates, there may be 50 to 100 million cases worldwide each year.

Mosquito carrying dengue fever Mosquito contamination cycle

The Aedes Aegypti mosquito carries many viruses, including Dengue fever, a disease that threatens between 1/3 and 1/2 of the world’s 7.5 billion people.

Because mosquitoes carry it, the transmission dynamics of dengue are related to climate variables such as temperature and precipitation. Although the relationship to climate is complex, a growing number of scientists argue that climate change is likely to produce distributional shifts that will have significant public health implications worldwide.

In recent years dengue fever has been spreading. Historically, the disease has been most prevalent in Southeast Asia and the Pacific islands. These days many of the nearly half-billion cases per year are occurring in Latin America.

GETTING THE DATA

  • Environmental data collected by various U.S. Federal Government agencies — from the Centers for Disease Control and Prevention to the National Oceanic
  • Atmospheric Administration in the U.S. Department of Commerce.

Our goal

Visualise the weekly reported cases in San Juan

San Juan Dengue Fever Total cases per week

San Juan Dengue Fever Total cases per week

  • We can see that the time series has seasonality. Seasonality refers to a periodic pattern within years related to the calendar day, month, quarter, etc…
  • We can see that the time series does not appear to have a trend. There is no long-run upward or downward direction in the series.

PREPROCESSING

We chose a method to fill missing values by the value before. Since we are working on time series, this one seems to be the best way to fill missing values

Since our Data set has no more missing values now, we can focus on preprocessing the Data using Standardization to use it with the models we have chosen to predict Dengue disease.

Consider columns as variables. If a column is standardized, a mean value of the column is subtracted from each value, and then values are divided by the standard deviation of the column. The resulting columns have a standard deviation of 1 and a mean that is very close to zero.

Distribution of Data before and after Standardization

Distribution of Data before and after Standardization

We end up having features (columns) that have almost a normal distribution.

We are choosing the variables with the best correlation rate with our target variable.

Correlation between our target variable and the others

Correlation between our target variable and the others

  • We can see that `reanalysis_specific_humidity_g_per_kg` and reanalysis_dew_point_temp_k are firmly correlated to total cases. These variables are related to the humidity of the climate.
  • Average and Min temperature is also strongly correlated to the target variable.

Then, the best features to keep are : reanalysis_specific_humidity_g_per_kg,
reanalysis_dew_point_te, station_avg_temp_c,
station_min_temp_c
.

BUILDING RANDOM FOREST MODEL

We import the random forest regression model from scikit-learn, X represents normalized features, and your target we want to predict, the total cases.

We instantiate the model and fit (scikit-learn’s name for training) the model on the training data.

This graph is a visualization of the predicted and actual cases.

Predictions vs Actual total cases

That looks pretty good! Our model has learned how to predict the total cases for each week on the training set with 93% accuracy.

Building RNN LSTM Model

Long Short Term Memory Networks (LSTM) is an extension of recurrent neural networks extending their memory. Therefore, it is well suited for learning essential experiences that have a very long shift in between.

The units of an LSTM are used as building blocks for the layers of an RNN, which is then often called an LSTM network.

LSTMs allow RNNs to remember their inputs over a long period. This is because LSTMs hold their information in a memory, which is very similar to a computer’s memory. After all, the LSTM can read, write and delete data from its memory.

The base model we started with was as follows :

Then we experimented with adding a dropout layer of varying proportions, increasing the model’s performance significantly.

This lead to the final function to fit the LSTM model on the dataset :

This function takes a data set, and first, it applies a function to convert a time series for supervised learning and then normalizes and fits the values on our three layers RNN-LSTM model.

This model is using the mean absolute error to evaluate its performance.

MAKING PREDICTIONS

Predictions vs Actual cases with RNN model

The mean absolute error of 0.014 (the average difference between actual cases and predicted Dengue cases) shows there is room for improvement. Still, the model accurately predicts the massive spike of over 150 cases around week 90.

Information like this can be extremely valuable to decision-makers in a region.

CONCLUSION

Moreover, we hope everyone who made it through enjoyed reading us. We want to thank everyone who made this project possible, our school HETIC with the General Manager Frédéric Sitterlé who allowed Deepnet to offer us their project. A vast thanks to Max Cohen and Serhat Yildirim. They have been excellent mentors throughout this project and showed us how accessible machine learning was and made us more passionate about the potential of data science to make the world a better place !

Introduce students to innovation through impact projects allowing them to learn by doing : Data Science, Machine Learning, Deep Learning, Compute

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store