Date : 29/04/2021
During our training at HETIC, our school’s association Hub HETIC proposed in January to our class, in collaboration with the association Deepnet, the development of four extra-curricular big data projects.
One of them was to develop a machine learning algorithm to find the information sheet of a food product on Open Food Fact from a photo. The idea is then to develop possible applications of this algorithm, such as for example: find the quality of a product thanks to an image like Yuka does with barcodes.
We worked 3 months on it to build a database with reliable images for the model and building an architecture that with a non-trained model gets results close to 15%.
After showing you the data, how we understood it and the research we made, we will present you our solution , Image Embedding, and how we adapted it to our solution.
Now, follow us from this project that made us passionate and sweat a lot.
A large and diverse data set
One of the first difficulties one can encounter in a machine learning project is to find usable data for our algorithms.
Open Food Fact is a free, online and crowdsourced database of food products. It gathers information and data of tens of thousands of food products from around the world.
After we downloaded the data from the website via its API (20 go for 1.8 millions observations), we analyzed it : Each observation has a unique barcode, a product name, an URL to the image product as well as other secondary information such as those related to nutrition or the ecological impact.
One of the problems with crowd-sourced data is that its quality depends on the users, not all of them have knowledge about data and problems to exploit it, some data may be mislabeled or missing. The images transmitted by users are no exception, so we can have 2 completely different product sheets as in figure 1 which has an image that we would like to avoid to train and test the model contrary to the image in figure 2.
As we need properly framed images for training and that we can’t automatically check the quality of the images we want to keep, we assumed that the more fields a product page has filled in, the better the quality of the associated image.
Quality, image shape for good and bad data
After building up a data set of 300,000 observations, we scrapped the images of the associated products. However, an eye analysis shows us that some products still have a poor quality frame. So we decided to focus only on the quality contributors of Open-food-fact.
We decided to take only the observations of a contributor, by the name of “kiliweb”, a former member of the Yuka app. We then obtain 120.000 images for 97.000 separate products. link to the page of kiliweb : https://fr.openfoodfacts.org/editeur/kiliweb
To test the performance of the different models, we need to have products present in the 97.000 products in the dataset without having the same images. Thanks to the “product name” field, we have recovered products with at least 3 different images from the dataset. We end up with a dataset of 107k images and a test dataset of 3k images.
If you want to understand in more detail how we sort our data, you can find the process available on GitHub at: https://github.com/HubHetic/Open-Food-Fact/blob/main/note book/cleaning_exploratio-n_data.ipynb
Now that we have our images, we need to extract the product information from its picture. However, a picture means nothing to a computer.
Understanding the length of the problematic
You said image recognition ?
For humans, image recognition is intuitive and obvious. For computers it is not so easy. A computer image is stored as a cluster of pixels, a grid where each pixel has information about contrast and color.
It’s not capable of making connections between each of them and deduce product-specific information in the image from it. It can see there is a difference between a Batman action figure and a bottle of milk, but it cannot tell which is which because nobody taught it to recognize a Batman action figure and a bottle of milk.
In addition, images can sometimes represent more than a megabyte in matrix form and therefore take a lot of computing time to compare them, as their size is not always equivalent to allow comparison.
However, there is a deepLearning method designed to extract specific information in each image and do machine learning with it, it is called Convolutional Neural Network.
Convolutional Neural Network
What is it and how it is useful ?
A convolutional neural network (ConvNet/CNN) is a deep learning algorithm that applies a filter to an input image to create a feature map. This map summarises the features specific to the product in the image which is then able to classify what it sees. It is divided into two parts: a convolutional part that represents the filter and a perceptron for classification.
The CNN filter is composed of several steps like convolution which we will detail. The convolution applies a matrix of numbers called kernel, of size smaller than the starting matrix, it runs through this matrix applying the kernel, the numbers on the kernel have a specific configuration so that the outgoing matrix has a characteristic of the image as in the following figure where the kernel “Vertical” gives the vertical lines of the starting matrix
If you want to know more about recent computer vision, I advise reading this article https://towardsdatascience.com/learning-computer-vision-41398ad9941f
The CNN has several positive features for our project, this technique has excellent accuracy, it automatically detects important features without any human supervision and it is computer efficient and can run on many different machines.
It also has some shortcomings such as not being able to explain how it chooses which features to keep (black box), it is a Supervised Learning model, it needs to be given labels to classify the outputs, part of the problem which we will come back to later.
One of the concerns with the CNN is to find the right combination of convolution steps, max-pooling that allows the best accuracy, there is no method to guess in advance what the best combination is, several models have been built and tested in the scientific community to find the best results, both in computation time and performance, we have selected 3 in particular VGG16, MobileNet and EfficientNet
VGG16, MobileNet, EfficientNet
There are already pre-made models available on the Internet. We can easily implement them with Keras or Tensorflow.
VGG16 or Oxfordnet is a deep convolutional network for large scale image recognition, the model reaches 92.7% accuracy in the top-5 tests of ImageNet in 2014, we started with this model for image retrieval, and then turned to faster and less memory consuming models.
MobileNet was developed by Google and is designed for mobile vision applications. It is a little less accurate than VGG16 but has a lower computational complexity
EfficientNet is also from Google, but it is newer (2020). It is faster and more efficient than MobileNet with a similar computation time, this is the model with MobileNet on which we have kept all along the project.
We also used EfficientNetB3, which is a more trained version of EfficientNet. it’s slower but more accurate.
One of the main problems encountered in this project was how to learn the CNN model on our data.. As previously stated, CNNs need labels to learn and distinguish images, the only consistent label data is the product name. But training a classification model with 97,000 different labels is extremely complicated if not impossible.
Moreover, our data was added by human hands. A Coca-Cola product can be written 10 different ways, or with specific product details in the title.
We also have a disparity problem, we have about 4000 product hits for the most present brands, but only 2000 for the least present.
Another label idea we considered was based on packaging (box, bottle, cardboard, plastic, shapes), but after analysis, this field is barely filled in on Open-Food-Fact and therefore not usable.
The problem seemed unsolvable, so we got around the labelling problem by implementing an unsupervised deep learning algorithm: Image Embedding!
What is it and how it is useful ?
based on the fact that objects with similar characteristics are more or less the same thing. This ability is innate in humans and allows us to easily distinguish between different types of objects. Take two cats, you can tell if they are the same breed just by their fur and eye colour, even if you don’t know the breed name. Image embedding applies the same principle.
We have therefore developed an architecture allowing the implementation of the different stages of Image Embedding. We have coded an algorithm allowing us to pass an image in CNN model mobile_net and efficient_net and to recover the flatten layer which compacts the image the most, it is in the form of a vector. Models are pre-trained via the ImageNet database.
We run each image of the train dataset (108,000 images) with this algorithm to then store the vectors obtained in a dataset with a column corresponding to the bar code of the image.
Now with the vector databases. We use the K-Nearest Neighbor (KNN) algorithm, KNN is a machine learning algorithm, it allows us to calculate the distances between vectors. So with a new vector, we can know what is the closest vector to it, find its via index in the database and return the most similar image.
Thanks to the indexes we find the barcodes of the closest images to either give the characters of the product on the image or the image itself.
Once KNN trained we made a “show_image” function allowing to take as input the path of an image, of a number nb_images and then display the image in the path as well as the most similar nb_images produced are the algorithm by Image integration
if you want to have fun testing our architecture you can find it on git via the link: https:/ github.com/HubHetic/Open-Food-Fact
Now that you know everything about our research and the theory behind our work, let us show you some concrete results !
We tested VGG16 at the beginning, but the process was way too long and the first results were not very conclusive. We used better performing CNN models.
We mainly worked with MobileNet and EfficientNetB3. MobileNet is expected to run faster, but is expected to have lower efficiency than EfficientNetB3.
First of all, we will discuss the calculation times for the different stages of the model, namely the vectorization of our images according to the CNN model and the creation of a dataset, as well as the training of the KNN on the resulting dataset, we specify. As we worked with a computer equipped with an Intel (R) Core (TM) i9–10900KF processor, we did not use a GPU. We had 108,979 images to vectorize.
Vectorize database :
- MobileNet : execution time : 6011.29s (≃1h15)
- EfficientNetB3 : execution time : 24865.8s (≃7h)
Training the knn :
- MobileNet : 41.6s
- EfficientNetB3 : 89.3s
As you can see, MobileNet is much faster than EfficientNetB3, the computation times probably seems huge but this is due to the use of a CPU, with a GPU we could divide the results between 20 and 50 times.
We are going to compare the results for several images from MobileNet and EfficientNet and draw conclusions from them.
For all of our screenshots, the first one is from MobileNet and then from EfficientNetB3. The first images are the ones we ask our model to predict, the 8 others are the predictions from the models i.e. the ones that most closely resemble our base image.
First let’s try with fries from Mc Cain.
The results obtained by mobile_net and efficient_net show that the algorithm manages to find products very similar to the image of the product, it does identify it as a packet of chips, however in the results obtained, the original product is not there. no, there is only one occurrence of the mark, it seems that the model favors photos taken in the same frame
And the best for last, Coca-Cola bottle :
As said before, both models favor product images with the same frame, which may reveal a lack of abstraction of CNN models from feature extraction.
Overall Efficient_netB3 is more efficient. They both gave the same result on the first image returned, but it has a lot more Coke bottle occurrences than mobile_net, which is a consistent result with what we expected.
We do have some strange appearances. Mainly for MobileNet and sometimes for EFFICIENTB3.
Last detail , our models actually work pretty well for consistent items with the same colors, as for those white beans. As we didn’t retrain our models, they are made for general object detection. We can guess that items of identical shape, colors and light will work pretty well.
Performance of models
To recover performance data from our models, we took from our dataset the name of items who appear more than 3 times. We took one occurrence and created our test dataset. Our goal was to measure the accuracy by trying to find the same name in our 10 nearest found pictures.
This is our results, we can see (left) the percentage to get at least one occurence in the N(down) nearest pictures.
As you can notice the results are pretty bad. We fell on our own trap. We knew that our data wasn’t labelable and we did it to measure our performance. Our results seen previously are decent, but here, it’s a total mess. I guess we proved our point : using labels is really a bad idea.
Finally, we succeeded in fulfilling the objective given to us: to develop an image recognition architecture that can find a product registered on Open Food Fact based on a photo. Using EfficientB3 with an image encoder is actually our best answer to the problem in terms of efficiency. If you want speed, you should use MobileNet.
As mentioned previously we didn’t and couldn’t traditionally train CNN. However there is a solution, use an autoencoder. The idea is simple: Take a picture in input, use the convolutive part of a CNN, re-encode backward and try to get the same picture as accurate as possible. No need for labels, and we can use the first part as our next vectorized solution.
It should answer our new problematic !
The main thing we’ll all remember, it’s the fact that it’s our first real big data project.
We had to work on something we were not sure if it would work. We had some compilations which needed more than 10 hours to compute, so we had to adapt, change our manners of working, try to go faster, optimize it, try different things, and of course fail. All of this in the aspect that our machine couldnt realy handle this. Only François’s machine was fast enough to compile. During all of the project, we had to keep this in mind.
One big part of the project was to teach ourselves how to learn. At the beginning we were stuck with our knowledge, we had to learn a lot of new things such as neural networks, CNNs and image embedding.
So we started reading scientific papers, found codes through git. At first we understood nothing, but with time and patience, slowly but surely, we started to connect the dots. To the point to be able to understand and apply this knowledge to our project. We also made a lot of progress in scrapping, data analysis and architectural construction
Creating those new habits is one of the main points of the project.
While gaining experience in this domain, we learnt more about what a data engineer/data scientist job was, what was he’s supposed to do. Even for our carriers it showed us different paths that made us think about our future.
Finally, working with Max and Serhat, who gave us professional feedback on things that seemed basic to us but were crucial in the professional world such as documenting correctly our codes or just how to do the git commit correctly. All of this for a better communication between members of a project.
What next ?
This step is only a start in the different functionalities / technologies that we would like to bring and develop on this project.
The Triplet loss function, another approach to image recognition that does not require a KNN algorithm, could make our model much better. With enough data preparation and additional research we could explore it.
An auto-encoder would allow us to train our model on our own data in an unsupervised way, increase performance and test our model on lower quality images, the flaw of this approach is the computation time.
We wanted to make an API of our application, but having little experience in the field, it would also have taken too much time. The idea is appealing in any case.
These are techs that we want to implement soon.
We’d also like to thank Max Cohen and Serhat Yildirim who led and helped us in this project. Working with them was a great opportunity to learn new things and experiment our ideas. Thanks again for giving us some of their time.
And we would like to extend our thanks to all the people who have supported us during this project directly or indirectly.
github Image embedding : https://github.com/rom1504/image_embeddings
github project: https://github.com/HubHetic/Open-Food-Fact
Romain Beaumont on Medium :https://rom1504.medium.com/image-embeddings-e d1b194d113e