Food Image Segmentation with fast.ai
Image segmentation is awesome! There are endless possibilities for application, and one of them is food segmentation. There are many machine learning models and architectures available to do this, such as FastFCN, Mask-R-CNN, U-Net etc.
I will explore the ease of which anybody can implement a ML model using the fastai library and obtain decent results on an image segmentation task. I focus on food image segmentation as it has many applications: calorie estimation, diet management, industrial compliance, consumer satisfaction (more of food leftover analysis).
Dataset quirks
This was based on fastai course v3 lesson 3 on applying U-Net to the CamVid dataset. The dataset used is the UNIMIB2016 Food Database, created by the University of Milano-Bicocca, Italy. It is one of the few publicly available, pixel segmented datasets on food. It contains 1,027 images of food trays, with 73 classes of food and 3,616 labelled instances of food. Also, the fast.ai v1 package was used in this article. However, one issue encountered was that many labels were highly similar to each other — pasta, for example.
Another concern was the relatively low number of occurrences of the following classes: 17 types of foods appeared less than 10 times out of the 3,616 food instances in the dataset!
We could clean up the dataset by finding the English translations for each food category, grouping similar labels together, and also removing/relabeling those with low occurrences. But that can be the topic of another article. Thankfully, Github user binayakpokhrel has provided his list of annotations at https://github.com/binayakpokhrel/datasets as part of his work at Leapfrog Technology.
Loading in Images and Annotations
The annotations are given in polygonal co-ordinates, so we need to convert it to what fast.ai expects: mask images. We do this by using draw.polygon from skimage. This fills in the pixels in the boundaries given by the co-ordinates with an integer corresponding to the class name.
After drawing, the mask can be saved using PIL or most imaging libraries to the .png format, to ensure that data is not lost. All the masks are saved to a separate folder with the same name as their corresponding images. This makes it easier to use fast.ai to collate the images and mask later on.
We load in the dataset in just 3 lines of code. img_path points to the folder containing all the original photos, while label_path points to the mask folder.
fast.ai automatically overlays the masks onto the corresponding images. Running data.show_batch displays a few images from the dataset, as seen below.
Model Training and Results
We use the U-Net architecture, which is provided in the fastai library. This model consists of an ‘encoder’ part on the left and the ‘decoder’ part on the right. At certain intervals, the outputs of the encoder segment are also directly passed to the decoder segments with similar image size (Gray arrows).
In a few lines of code, we can initialise unet_learner from the fastai library. The metrics are customized to this dataset, no_bg_acc removes the background pixels from the calculation of accuracy, as majority of the pixel labels are of the background in this dataset. This would give a better indication of the food segmentation accuracy. The other parameters are defined following the CamVid notebook from fastai.
Very quickly, we are able to begin training and validation. We use the fit_one_cycle policy for 5 epochs, and obtain an accuracy of almost 90%. That’s pretty good!
Running learn.show_results gives some samples of the predictions. On the left is are the ground truth labels, and on the right are the predicted masks. One thing we notice is that some foods are not even labelled in the ground truth images! This is because the modified annotations we used do not include them, probably due to a low number of occurrences.
Conclusion
We see how easy it is to apply a pre-trained, publicly available ML model to a publicly available dataset. With the food segmentation capability, we can go on to calculate calories per food tray, or even amount of food leftovers by looking at images of returned trays! The possibilities are endless.
References
All Images used are from my own work, unless otherwise stated.
Future work
- Clean up code and upload notebooks + data onto github
- Get more comprehensive ground truth labels by shrinking down the class labels manually.