What do you learn from this blogpost?

This article teaches you how to use ML classifiers for image classification using pixel analysis. Additionally, it gives you insights in how to do ML classifier performance evaluation regarding accuracy and speed.


What’s the matter with masks?

Mask mandates have become a common practice in the efforts to tackle the Covid-19 pandemic. Enforcement, however, can be problematic without the help of digital tools. Using cameras and image classification in public places can help in several ways:

  • self control: showing people with TV screens whether they wear their mask correctly
  • peer control: making people who don’t adhere to the rules visible
  • group control: helping enforcers to find the people who don’t adhere to mandates, f.e. in a theater or a stadium

These three use cases are certainly debatable - but we believe that there are ethical ones in the realm of self- and group control. An assumed use case for such an ML-solution would be the monitoring of mask mandate compliance when entering a small event.


What do you need?

To tackle this problem you need a computer with a current Python version installed. A GitHub account if you want to work on the code in a team and a cloud service if you want to share the raw data (pictures) with other people. Moreover, you need the data, in our case portrait pictures of people with and without masks, for your approach.

Last but not least, we recommend to use Jupyter Notebooks for working on the code, as it enables you to give it more structure and annotations than a normal editor.



Getting Started

Before we start, we want to acknowledge Aurélien Géron’s book “Hands-On Machine Learning with Scikit-Learn, Keras, and Tensor Flow.” we have learned a lot working it through and his work often provided a starting point for our approaches.


Get your data

For this approach, you need labeled portrait pictures of people who wear masks and who do not wear them (correctly). While we used the following two datasets, feel free to use any other dataset that fits the requirements above.

Regarding the data it is important to check how the pictures are organized Are they sorted into different folders depending on (not) wearing a mask? Is an additional csv file attached that specifies whether the pictures are showing masks correctly? Depending on these questions, you will need to differently access the image files while coding.


Choose your method

After you have your data collected, you need to decide on the method you want to apply. We decided on the following process:

  1. Establishing a baseline with untuned ML algorithms
  2. Tuning the best performers on Accuracy by altering hyperparameters
  3. Tuning the best performers on Speed by PCA

For the beginning, you need to choose the algorithms that you want to use for your baseline. We chose the following six classifiers in our project:

  • Support Vector Machines
  • K-Nearest-Neighbours
  • Stochastic Gradient Descent
  • Random Forest
  • Desicion Trees
  • Logistic Regression


Choose your evaluation metrics

For the evaluation of the classifier performance, you need to choose evaluation metrics. In pour project, we chose the following ones…

…for Accuracy:

  • Cross Validation Scores show us how the classifier performs on different folds of the test set.
  • Confusion Matrices are graphical representations of how well the classifiers predict the true state. In our case, next to general accuracy, a very low number of false positives is especially important.
  • Precision & Recall Scores are the scores which determine the share of true positives in all instances classified as positive (precision) and in all positive instances in reality (recall). They are numerical representations based on the confusion matrix.

… for Speed:

  • A long Training Time is problematic in our kind of project as we are limited on computing resources.
  • A slow Prediction Speed of new Pictures would be problematic in the use case.



Data, Training and Evaluation

Within our project, we started with normal sequential coding. This of course lead to a lot of work, when f.e. changing the ML classifiers. At some point we switched to writing everything as a function in advance and creating pipelines in the end. We recommend this approach as it enables you to easily run your code with the classifiers and data you want. Moreover, this approach leads to less mistakes as you do not need to copy-paste a lot when finding minor bugs in code that is used several times.



In the following, however, we will not always display our full functions, but the relevant lines of code for the approach. We quite often simplified the code regarding additional options and our data and result storing, to better fit this blogpost. If you want to get an holistic insight into our work, check out our GitHub Repository


Import the packages needed

Before we walk you through our code, we want to have you all set. Therefore we recommend to import the following in advance.

# general imports
import numpy as np
import pandas as pd

# For Image Preprocessing
from PIL import Image as PIL_Image

# For Data Splitting into Train- and Testset
from sklearn.model_selection import train_test_split

# For Calculating Training Time
import timeit # To keep track of calculation time

# For metrics and validation methods
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.metrics import confusion_matrix, precision_score, recall_score, plot_confusion_matrix, precision_recall_curve

# The Classifiers
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm, tree
from sklearn.linear_model import SGDClassifier

# For Grid Search
from sklearn.model_selection import GridSearchCV


Preprocess the pictures

The first thing to code, is a function for picture preprocessing. As you need numerical inputs for ML we used the python package NumPy to create a numerical representation for all image files. As this process is time consuming for large pictures, we decided to reduce the image resolution to 24x24 pixels. Depending on your machine and the amount of your data, we recommend a similar image resolution.

The following code chunk shows how to resize the image resolution and how to create the numerical representations for all pictures in your filepath. Don’t forget to also create a vector with the corresponding labels (correct/incorrect or TRUE/ FALSE).

# Defining function to list all pictures to include
def list_files(dir):
    r = []
    for root, dirs, files in os.walk(dir):
        for name in files:
            r.append(os.path.join(root, name))
    return r

# Use the function on your filepath
filepaths = list_files([YOUR FILEPATH HERE])

# set target pixel
target_pixel = 24

# looping through all correctly worn mask pictures
for filename in filepaths:
      # open picture
      pic = PIL_Image.open(filename)
      # Reduce size from original format to target format
      pic_resized = pic.resize((target_pixel, target_pixel))
      # Extract RGB data
      pic_data = np.array(pic_resized)
      # Include help array to reshape 3D-array(e.g.: 1024, 1024, 3) into 1D array
      help_array = np.reshape(pic_data,(pic_data.size,))
      # Stack each array onto each other to have one larger array of shape (#obs,#pixels*3)
      rgb_data = np.vstack((rgb_data, help_array))

At this point we only show the relevant code regarding the image resolution reduction and the extraction of the RGB data into vectors. If you are interested in how we saved the data into dictionaries and pickle files, to make them easily accessible later, have a look in our data processing notebook on GitHub.


Load and Split your data

To test your trained and tuned classifiers in the end of your project, you want to have a testing set, that your classifiers have never seen before. We used the SciKitLearn function train_test_split with a test-size of 10% to split our data. As input, we used the rgb_data of the pictures as well as the corresponding labels.

# Split into test and training data set
rgb_data_train, rgb_data_test, labels_train, labels_test = train_test_split(rgb_data, labels, test_size=0.10, random_state=42)


Train the Classifiers

The next classic step is the training of the actual classifiers. While we don’t use it in the pipeline below, we wanted to show you how it’s done either way.

# Definition of function that trains classifier on specified data set
def train_clasf(classifier_x, rgb_train, labels_train):
    
    #Train model
    classifier_x.fit(rgb_train, labels_train)


Evaluate the Classifiers

Here, we code the above mentioned evaluation methods, which we imported from SciKitLearn. As an input to the function, we use the same three inputs as described above.

# Define Evaluation function
def eval_clasf(classifier_x, rgb_data, labels):

    # Evaluate classifier
    pred = cross_val_predict(classifier_x, rgb_data, labels, cv = 3) # cross value prediction with 3 folds

    # Running cross validation score
    cvs = cross_val_score(classifier_x, rgb_data, labels, cv=3, scoring="accuracy").round(3)
    
    # Initialize eval dictionary and store values
    evaluation_scores = {}
    evaluation_scores["Precision Score"] = precision_score(labels, pred).round(3)
    evaluation_scores["Recall Score"] = recall_score(labels, pred).round(3)
    evaluation_scores["Confusion matrix"] = confusion_matrix(labels, pred)
    evaluation_scores["Cross Validation Accuracy Scores"] = cvs
    evaluation_scores["Cross Validation Accuracy Score Mean"] = cvs.mean().round(3)
    evaluation_scores["Cross Validation Accuracy Score Std"] = cvs.std().round(5)
    
    return evaluation_scores

Storing the evaluation scores in a dictionary, we save and show them later on in our project. If you are interested in how we did it, please check our main notebook on GitHub.


Prepare your Pipeline

After having specified the functions for the different steps, you can prepare a function of your pipeline, where you run all the different functions for the classifiers and dataset that you want to use. We input the classifiers as a vector here and loop through the individual ones within a function.

# Defining pipeline function for load, training, evaluating of baseline
def train_eval(list_of_classifiers, rgb_data, labels):
  
    # define empty dictionary
    results = {}
    
    # Run the different classifiers
    for clasf in list_of_classifiers:
        eval_scores = eval_clasf(clasf, rgb_data_train, labels_train)
        results[str(clasf)] = eval_scores
      
    return results

In this function, you need to add code to store the results of your different classifiers, if not specified in your previous functions, as otherwise they’ll be overwritten. Check out the [main notebook] on GitHub(https://github.com/the-palmo/hertie-ml-project_face-mask-detection/blob/main/02_Face-Mask-Classification_Main-Project.ipynb) for further insights.


Run your Baseline

Now, at the end of the first part you can specify the classifiers to use and run the pipeline. As we ran into troubles with the KNN and Logistic Regression classifier due to several hours of runtime and an always dying kernel, we only ran the other four classifiers in the end. If you want to get the same results we have, don’t forget to set the random state to 42.

# Define classifiers to test in baseline
classifier_RandomForest = RandomForestClassifier(random_state=42)
classifier_DecTree = tree.DecisionTreeClassifier(random_state=42)
classifier_LinSVC = svm.LinearSVC(max_iter=4000, tol=1e-3, random_state=42) #linear as normal (c based) is impractical using large datasets
classifier_SGD = SGDClassifier(random_state=42)

# Combine classifiers to a list through which you can loop
classifiers_baseline= [classifier_RandomForest, classifier_DecTree, classifier_LinSVC, classifier_SGD]

Let’s run the pipeline.

# Actual running and evaluating of classifiers
training_results = train_eval(classifiers_baseline, rgb_data, labels)


Our Baseline results

From our Baseline, we got the following results.



Pick the best Performers

Based on these results, we picked the best performers for tuning: We only continued to investigate the Random Forest Classifier and the SGD Classifier as the former is the most precise and the latter the fastest classifier. We did not further investigate the Decision Tree Classifier, because in terms of accuracy it would only become a version of the Random Forest Classifier and presumably will not perform as good. We dropped the Linear SVC Classifier due to its extraordinary long run-time of evaluation as we were limited in time resources.



Tuning

After establishing our baseline we optimized our best performing classifiers. Therefore, again, we need to define functions.

In a first step we optimize the accuracy of our RF and SGD classifiers through hyperparameter tuning using a gridsearch. Then in a second step we take the optimized classifiers, reduce their training time and boost their prediction speed.


Tune on Accuracy

As the results of our baseline for the RF and SGD are already very good, tuning the hyperparameter will only result in minor improvements. Nonetheless we perform a gridsearch that tests 18 different hyperparameter configurations on 3 folds of the training set - keep in mind that this step is very ressource intensive.

The following code chunk shows the function for hyperparameter tuning, which scoring we use, and how to save the best parameters. The inputs for the function are the classifier to tune, the parameter grid for the grid search, the rgb_train data and the labels_train.

def hyper_tune_grid_search(classifier_x, param_grid, rgb_data_train, labels_train):

    # Perform Grid-Search
    grid_search = GridSearchCV(classifier_x, param_grid, cv=3,
                            scoring=["precision", "recall", "accuracy"],
                            refit = "precision",
                            n_jobs = 4,
                            verbose = 3,
                            return_train_score = True)
    
    grid_search.fit(rgb_data_train, labels_train)
    
    # Extract best parameters
    best_params_dict = {}
    best_params_dict[Str(classifier_x)] = grid_search.best_params_
    
    return best_params_dict

The next step is to define the varying hyperparameters for the RF and SDG classifiers respectively before calling the above function. At this point, we selected only hyperparameters that affect accuracy and chose a computationally processable number of different parameters, as we were constrained on computational power.

Feel free to experiment with the grid search parameters if you have the time and computational power!

# Random Forest Classifier
param_grid_RandomForest = [
    {"n_estimators": [10, 100, 250],
     "max_features": ["auto", "log2", None],
     "bootstrap": [True, False]}    
]

# SGD
param_grid_SGD = [
    {"penalty": ["l1", "l2"],
     "alpha": [1e-4, 1e-2, 1],
     "max_iter": [100, 1000, 10000]}    
]

Let’s tune the classifiers on Accuracy now!

classifier_RandomForest_hyper = RandomForestClassifier(random_state=42)
best_params_rf = hyper_tune_grid_search(classifier_RandomForest_hyper, param_grid_RandomForest, rgb_data_train, labels_train)

classifier_SGD_hyper = SGDClassifier(random_state=42)
best_params_sgd = hyper_tune_grid_search(classifier_SGD_hyper, param_grid_SGD, rgb_data_train, labels_train)


Tune on Speed

Now that we boosted the accuracy lets turn to improve these tuned algorithms with respect to processing speed.

Images are prone to having a lot of features. As every pixel in our case is represented by 3 RGB values we have in the baseline configuration 24 x 24 x 3=1782 features. Presumably, a lot of these features - especially at the edges - do not play a major role in detecting a mask. Below we plot the feature importance of the Random Forest Classifier - which confirms that in fact only a few features play an important role.


# Defining the function to plot the digits of feature importance
def plot_digit(data):
    image = data.reshape(target_pixel, target_pixel)
    plt.imshow(image, cmap = mpl.cm.hot,
               interpolation="nearest")
    plt.axis("off")
    
# Adding feature importances from 3 RGB values to one pixel
feature_imp_sum = np.empty([(int(len(classifier_RandomForest_dim_red.feature_importances_)/3)),])

for itr in range(int(len(classifier_RandomForest_dim_red.feature_importances_)/3)):
    r = int(itr*3)
    g = int(r+1)
    b = int(g+1)
    feature_imp_sum[itr] = classifier_RandomForest_dim_red.feature_importances_[r] + classifier_RandomForest_dim_red.feature_importances_[g] + classifier_RandomForest_dim_red.feature_importances_[b]
    
    
# Plotting feature importance sum for every pixel to a plot and save it
plot_digit(feature_imp_sum)

cbar = plt.colorbar(ticks=[feature_imp_sum.min(), feature_imp_sum.max()])
cbar.ax.set_yticklabels(['Not important', 'Very important'])

Given that we find only a limited set of features to be important, we perform a Principle Component Analysis to reduce the dimensionality of the dataset without losing too much variance. Indeed this reduces our set of features to only 184 dimensions that are necessary to explain 95% of the dataset variance.

# Defining how much variance you want to have explained:
threshold = 0.95
pca = PCA(n_components=threshold)
rgb_data_reduced = pca.fit_transform(rgb_data)



Run Classifiers on the Test Set

We defined a function to train all the different classifiers on the training set and running them on the test set afterwards for the evaluation metrics.

def test_clasf(classifier_x, rgb_data_train, rgb_data_test, labels_train, labels_test):
    # Training
    start_time_train = timeit.default_timer()
    classifier_x.fit(rgb_data_train, labels_train)
    time_elapsed_train = timeit.default_timer() - start_time_train

    # Predictions
    start_time_pred = timeit.default_timer()
    pred=classifier_x.predict(rgb_data_test)
    time_elapsed_pred = timeit.default_timer() - start_time_pred

    # Calculate evaluation metrics
    evaluation_scores = {}
    evaluation_scores["Precision Score"] = precision_score(labels_test, pred).round(3)
    evaluation_scores["Recall Score"] = recall_score(labels_test, pred).round(3)
    evaluation_scores["Accuracy Score"] = accuracy_score(labels_test, pred).round(3)
    evaluation_scores["Confusion matrix"] = confusion_matrix(labels_test, pred)
    evaluation_scores["Prediction time in seconds"] = round(time_elapsed_pred, 3)
    evaluation_scores["Training time in seconds"] = round(time_elapsed_train, 3)
    
    return evaluation_scores

The following code chunk calls the above function and uses the differently tuned classifiers as inputs.

# Splitting dimensionality reduced data set
rgb_data_reduced_train, rgb_data_reduced_test, labels_train, labels_test = train_test_split(rgb_data_reduced, labels, random_state=42, test_size=0.10)

# Define new classifiers with hyperparameters from Grid Search
classifier_RandomForest_test_speed_tuned = RandomForestClassifier(random_state=42, n_estimators=250, max_features="auto", bootstrap=False)
classifier_SGD_test_speed_tuned = SGDClassifier(random_state=42, penalty="l2", alpha=0.01, max_iter=100)

# Train classifier and evaluate on test set
eval_scores_RF_speed_tuned = test_clasf(classifier_x = classifier_RandomForest_test_speed_tuned, 
                                       rgb_data_train=rgb_data_reduced_train, 
                                       rgb_data_test=rgb_data_reduced_test, 
                                       labels_train=labels_train, 
                                       labels_test=labels_test)

eval_scores_SGD_speed_tuned = test_clasf(classifier_x = classifier_SGD_test_speed_tuned, 
                                       rgb_data_train=rgb_data_reduced_train, 
                                       rgb_data_test=rgb_data_reduced_test, 
                                       labels_train=labels_train, 
                                       labels_test=labels_test)


Results on the Test Set

We get the following results:

Please note that the prediction time here is not displayed per image, but for predicting the whole test set (10% of the full data).

Conclusion(s)

This article shows how to successfully implement ML algorithms for face mask detection. It further indicates that ML algorithms based on image classification of pixel representations can indeed solve the problem at hand reliably and fast. We achieve accuracy scores of well above 99% with the tuned RF classifier trained on all features.

Additionally our comparision reveals that predicition time is a non-factro for either classifier while, when faced with resource constraints e.g.when the classifier is trained multiple times per day, the SGD classifier might be worth considering. It has a much better runtime and can be trained very fast and still has an overall accuracy of well above 96%. Especially considering the assumed usecase of monitoring gatherings this is beneficial, as the algorithm can be constantly updated and trained even several times per hour. This conclusion is of course limited by some factors described below.

Data limitations

As discussed above high quality data is pivotal for real-world application. We limited our approach to portrait style pictures consisting of a large set of images artificially augmented with masks and a smaller real-world set. Thus our classifiers are only performing on portrait style pictures and are not abled to identify masks in groups of people or if the input is not a portrait.

No use of ensemble methods

When analyzing and tuning multiple classifiers implementing an ensemble approach can be beneficial. However, we decided against this for two reasons. Firstly, our results were already very accurate and there was considerable overlap in the errors, only offering limited scope for improvement. Secondly, by design ensemble classifiers would yield longer prediction and training times.

Limited hyperparameter tuning

The project could be improved with additional runs of hyperparameter tuning and by including further hyperparameters.