Masks and face coverings are here to stay. This is a guide for training robust AI models without crossing the privacy line.
The Zenus co-founders demonstrating live facial analysis with masks.
Authors: Panos Moutafis, Ph.D. & Rakshak Talwar & Mary Lim
Masks and face coverings have been prevalent in many cultures and work environments for decades. But if you are reading this in the year 2021, we can read your mind — you are thinking about the pandemic! Masks became a must-have accessory in our daily lives due to Covid-19.
There is a lack of datasets with people wearing masks
Analyzing people’s faces has vast applications from retail stores to corporate campuses and experiential marketing. The question is how do we train robust AI models without having access to vast datasets of people wearing masks? If this keeps you up at night, we have excellent news!
Now, is she smiling or not?
Data Augmentation
Our team addressed the lack of datasets with people wearing masks by using data augmentation techniques. One may enhance their existing datasets (or publicly available ones) by overlaying masks on top of people’s faces. Training facial analysis models such as face detection and sex prediction become an easier task once you do this.
A researcher at Georgia Tech is managing an open-source project called “Mask The Face.” The source code is available on GitHub and can be used to convert face datasets into masked-face datasets.
Running the software package on your images is a cinch!
cd MaskTheFace
# Generic
python mask_the_face.py --path <path-to-file-or-dir> --mask_type <type-of-mask> --verbose --write_original_image
# Example
python mask_the_face.py --path 'data/office.jpg' --mask_type 'N95' --verbose --write_original_image
Source code and further details can be found on the GitHub project page.
Deviations
Using the software extensively is expected to produce a few inconsistent results. In some cases, it works extremely well and addresses variations in head pose and lighting conditions. In other cases, the algorithm will miss faces or misplace the mask in the presence of strong head pose variation and lighting aberrations. This is due to the performance of the face detector used in the project.
Left image: The face on the bottom right was detected and mask augmentation was applied. Right image: The face in the upper left corner was not detected, and thus, no mask augmentation was performed.
Even though some faces in the training set will not be masked this is generally okay because the overall dataset would comprise both masked and unmasked faces (see training section below). In addition, one may also use different detector options which are more robust for better results. All in all, the method is resilient and practical.
Illustration of a mask augmentation applied correctly with a variety of masks.
As you may see in the corresponding sample pictures, the referenced method produces realistic results. One may also choose from a variety of different masks to increase the diversity of face coverings in the dataset. The options include different patterns, colors, and intensity values.
Mimic real-world scenarios with varying colors and types of masks
Training
There are many different ways to train facial analysis models whether this pertains to detection, recognition, sex, age group, and/or sentiment. For the purposes of this guide we will focus on sex prediction assuming the images have already been cropped and aligned using a detection module which is already robust with occluded faces.
We trained our own classification task head which received feature maps from a battletested backbone. This is code for illustration purposes for what would be a small but important component of a much larger system. Nonetheless, the principles remain the same and allowed us to achieve high accuracy across a wide range of tasks.
import torch.nn as nn
from torchvision import models
# Set this to the number of classes your full model will classify forNUM_CLASSES = 2
# instantiate a model of your choice conveniently from torchvision
# In this example we use ResNet-152
# More can be found here https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.htmlmodel_to_finetune = models.resnet152(pretrained=True)num_features = model_to_finetune.fc.in_features # read the number of high-level features the backbone produces
# this is your final classification layer from which you obtain predictions
# in this toy example we just use a single fully connected layer, but you are free to use a more complex topologymodel_to_finetune.fc = nn.Linear(num_features, NUM_CLASSES)
# You can use the model within a comprehensive training script to leverage the pre-trained backbone of your choice
# This is much more efficient than building out and training a full network from scratch
This is sample code for illustration purposes only. It is not what Zenus has, does, or will use.
Discussion
The simple process of data augmentation results in a test accuracy of over 96% for masked faces. The real-world performance is also extremely reliable. We ran our algorithm on a short video to illustrate how well the sex prediction model performs.
Illustration of Zenus’ trained model working with and without face masks.
We have trained our algorithms to show us their predictions and also their confidence in the predictions they make. Because our system captures multiple impressions from the same face during inference we have configured it to use only high confidence predictions to further increase its robustness.
Considerations
Before we share our concluding remarks it is important to highlight a few key considerations.
First and foremost, there is a big difference between, say, age prediction and facial recognition. Identifying a person becomes more difficult when increasing the database size. This is not the case for other types of facial analysis. As a result, any loss of information will have a greater impact on recognition performance compared with the accuracy reduction when detecting demographic data.
Personal identification vs. Anonymous statistics
The implications for real-world use cases are also quite different. Facial recognition often focuses on security applications such as controlling access where small errors can be very costly. On the other hand, headcount and demographics analytics are leveraged for understanding the target audience on a high-level. Small deviations from the ground truth typically have negligible impact on these insights.
Can you tell which expressions correspond to a happier face with and without mask?
The impact of masks and face coverings is also less pronounced for classification tasks compared with regression problems. For example, a regression model would need to be trained in order to quantify the happiness level of a person based on their facial expressions. Such models produce scores at inference time which tend to behave differently for masked versus non-masked faces.
An AI system can be only as good as the data provided to it
Last but not least, we would like to emphasize the importance of people’s right to identify with different genders. A trained AI model can be only as good as the data provided to it and the underlying biological and biometrics differences. Excluding predictions whose confidence score is low (see discussion section) and focusing on the right use cases is extremely important.
What the future holds
Developing artificial intelligence models requires careful planning and exhaustive testing. This is particularly true when it comes to applications intertwined with intimate data such as a person’s face.
We are humans and still have many things to learn. Nonetheless, our team remains extremely bullish about the future of facial analysis even in the presence of masks and face coverings.
We will continue to work diligently and incorporate safeguards which protect people’s privacy. You are invited to support our mission and join our journey!
Stay safe. Stay positive. Test negative.
Some images are sourced from FairFace and used under the CC-BY-4.0 license. A few of the images were modified with mask augmentation.
Additional images and videos are sourced from Zenus, Wikimedia Commons, Pexels, and Unsplash. They may have been modified from their original form.
Comments