Automating photo-ID of Whales and Dolphins

Machine Learning

We present a classical machine learning and an advanced deep learning approach to automate the photo-ID of whales and dolphins on the species and individual level.

Maren Rieker , Victor Möslein , Reed Garvin , Dinah Rabe
2022-05-11
Example photo-ID over time

Figure 1: Example photo-ID over time

Man-made climate change is one of the main challenges for humanity and of concern for many scientific disciplines investigating its multiple, complex impacts on nature and societies. One of the ecosystems impacted are marine ecosystems. Research concerning marine mammals makes an important contribution to our knowledge about changes in the oceans, as marine mammals are indicators of ocean ecosystem health. Therefore their protection and conservation is a crucial task. Up until today, most observation and tracking of individual animals and populations is done manually by humans - a time consuming and error-prone method. Motivated by a Kaggle competition funded by the research collaboration ”Happywhale”, we aim to contribute to the efforts by automating whale and dolphin photo-ID to significantly reduce image identification times. We investigated the use of conventional Machine Learning and a Deep Learning algorithm for this image classification task. As expected, we find that conventional machine learning classifiers under-perform in correctly distinguishing between subtle characteristics of the natural markings of whales and dolphins and therefore state-of-the-art deep learning models are needed for this complex task.

Background

For several years, the impacts of man-made climate change have been specifically visible through the alteration and destruction of the marine ecosystem. A worrying development is the change in the migratory behavior of whales, dolphins and sharks as it is a strong indicator for the destruction of the ecosystems they are living in. The protection and conservation of marine mammals is crucial to the balance and therefore the health of ecosystems. To be able to carry out meaningful conservation efforts, the first steps are understanding the status quo of different animal populations and their migration patterns by identification and monitoring of individual animals.

Currently, the majority of research institutions still rely on resource-intensive and sometimes inaccurate manual matching of photographs by the human eye. This slow and imprecise practice naturally limits the scope and impact of existing research. The automatized image classification of whales and dolphins could enable a scale of study previously unaffordable or impossible. There are first attempts of using Machine Learning in marine biology and environmental protection to address this challenge. Most of the research either uses recordings of dolphin or whale sounds for classification (Huang et al. 2016) or photographs of fins (Hughes and Burghardt 2017; Renò et al. 2020) given that these are the two most common types of documenting (sounds or images) wild animals.

Current methods of these automated classification attempts are mostly Deep Learning (DL) approaches using convolutional neural networks (CNN). Regarding standard Machine Learning (ML) techniques, however, there is not much research to be found. This might be due to the challenging task of distinguishing between unique – but often very subtle – characteristics of the natural markings of whales and dolphins.

In this article, we present a classical machine learning approach for automated photo-ID and test it against a the-state-of-the-art deep learning approach to see, if both approaches would be feasible. Our approaches focuses mainly on prediction precision and accuracy, and only secondary on prediction and training time, and is limited to images of dorsal fins and lateral body views. We broke down the task into three main challenges:

  1. Removing the animal from the rest of the picture using Image Segmentation,
  2. Using and testing ML classifiers for species classification and
  3. Implementing a DL model to extend the use case from the classification of species to individual animals.

The dataset was provided by the Kaggle competition initiator “Happywhale” and includes over 50.000 images of fins and lateral body views of dolphins and whales in different pixel sizes. To be able to focus on the models, we based our apporaches on a pre-cropped dataset provided by a fellow Kaggler. Based on the state-of-the-art deep learning methods for image segmentation we use Tracer to remove the background from the images. In a next step we established a baseline for species prediction with a Softmax Logistic Regression model. Thereafter, two advanced decision tree models, a Random Forest and a XGBoost, are implemented and tuned in terms of speed and classification precision and accuracy. As the final part of our project, a Deep Learning algorithm capable of predicting the species as well as individual animals was implemented.

Our results show that the subtle differences between 26 whale and dolphin species cannot be detected by classic ML classifiers with great precision. We demonstrate that while it is indeed possible to train the widely used ML models Logistic Regression, Random Forest and XGBoost on a large dataset of images, we show that the classifiers will never succeed in reaching precision and accuracy scores high enough for them to be applied on real-world tasks of predictions on image data with only minimal differences between classes. Consequently, our analysis finds that advanced Deep Learning models are needed, and that they achieve great accuracy and precision in computer vision tasks and are specifically well-suited to automate mundane image recognition jobs like photo-ID.

Automated photo-ID with ML & DL: The task of automating the identification of species or individuals through photo-ID is not entirely new to the research world. So far, most of the proposed solutions to this problem have been based on Convolutional Neural Networks (CNN) and related Deep Learning methods. Contributions addressing the effectiveness of conventional ML classifiers to challenges as complex as the one at hand remain scarce. Faaeq et al. (2018) apply Machine Leaning algorithms to animal classification of a data set of 19 different animals (Faaeq, Gürüler, and Peker 2018). Differently to our dataset, the animals in their data were seen in their entirety on the image, and were of entirely distinct species (e.g. Lion, Elephant, Deer). Logistic Regression produced the best accuracy in their case (0.98). Regarding speed, Random Forest delivered the best results, beating Logistic Regression by a factor of eight. However, the researchers conclude that more powerful image classification models, specifically Deep Learning methods, are needed for classification tasks on more complex image datasets. Dhar and Guha (2021) find XGBoost to perform best in predicting the correct fish species, scoring a significantly higher precision score than Random Forest, k-Nearest-Neighbor (kNN) and Support Vector Machines (SVM) (Dhar and Guha 2021). In their research, two different datasets containing 483 respectively 10 different fish species were used. Again, the images in the dataset depicted the entire fish, which have significant differences in shape, colors and other features, as opposed to only body parts with little differences, as in our dataset. However, these two papers provided us with guidance as to which Machine Learning models are most promising for our project.

Image Segmentation: In the past, image segmentation algorithms with different approaches have emerged, but the emergence of Deep Learning models has “caused a paradigm shift” (Minaee et al. 2021) in the field of image segmentation, because they perform so well. Image segmentation with Deep Learning is a widely researched field of computer vision with wide and diverse applications in real life, like autonomous driving, medical imaging, video surveillance, and saliency detection (Ghosh et al. 2019). There is a plethora of widely used algorithms for different fields of application which can be differentiated by being (un-)supervised, their type of learning and the level of segmentation (ibid.). For this part of the project, we used Tracer, a convolutional attention-guided tool, as it proved to perform remarkably well at a relatively low computation time (Lee, Shin, and Han 2021).

The Happywhale Dataset

Since our project is based on a Kaggle competition1, there was a pre-split dataset provided, containing 27,956 images in the test set and 51,033 images in the training set of whales and dolphins. In the course of the project we decided to use a pre-cropped dataset provided by a fellow Kaggle-competitor2 to focus on the implementation of the classifiers. Additionally, a .CSV-file is provided that contains filename (of the corresponding image), individual-ID and the species. Images focus on dorsal fins and lateral body views. As the corresponding labels to the test-set images were not published after the end of the Kaggle competition we could only use the 51,033 images in the training set as the basis for our ML and DL models. Important to note is that the dataset is quite unbalanced (see figure 2), both with regards to species as well as individuals. There are 8 (out of 26) species which, pre- splitting, were only represented less than 200 times compared to the top three classes that are represented each over 5000 times.

Distribution of Species across the dataset

Figure 2: Distribution of Species across the dataset

Data Preprocessing: Image Segmentation

Separating the animals from the background was an elementary step of the picture pre-processing. In our case, the background of the pictures was not relevant for our research interest and therefore did not have to be included in the images we trained our models with. Contrary, the background would only have been disturbing, inflated the data size and disrupted the ML/DL pipelines by making more elaborate calculations necessary. For the segmentation of the images, we used Tracer (Lee, Shin, and Han 2021) which is a deep learning model that works with already cropped images and a specified pixel size. Because our images had the size 512x512, we employed the TRACER-efficient-5 model. Tracer derives which elements in the training image are part of the object and which are part of the background and are going to be coloured white. In order to make this decision, the program sets thresholds in the variation in the pixels. Tracer repeatedly attempts (Epochs) to find the edge of the object. After the first attempts at finding the edge, this attempt is removed and compared to the final attempt at mass edged attention module, creating the segmented image. This comparison attempts for any loss that may have occurred to be accounted for, therefore producing a more accurate segmented image.

Example image pre and post segmentation

Figure 3: Example image pre and post segmentation

In addition the following data cleaning and wrangling steps had to be performed:

Experimental Setup

Conventional Machine Learning Approach

Originally, it was planned to tune, train and test the ML models first on the images before PCA is applied and then a second time after PCA has been applied on the data to validate the assumption that a large amount of white pixels can be dropped from the segmented images without loosing any relevant information for classification. But running any machine learning model on our available hardware (there was no constant availability of the Hertie Server) was impossible even for a reduced resolution of the images of 48 x 48 (which results in 6,912 features per image). For this reason, the application of PCA had to be moved to the beginning.

To establish a baseline, we are implementing a Softmax Regression using the Scikit Learn Multiclass Logistic Regression module. Researchers often rely on Softmax as a conventional supervised ML method for the good classification performance (see Related Work section). As Softmax delivers similarly accurate results as kNN and SVM and needs significantly less time when working with large datasets, it is the appropriate method for us.

Building on that baseline we implemented and tuned a Random Forest (RF) model and XGBoost with early stopping in a second step, to be able to compare the three selected ML models. RF classifications are popular for achieving similarly good prediction results as kNN, SVM and Softmax, while being significantly faster than those, which is why we are choosing it as a second ML model. Additionally to their good training speed, RF has a number of other advantages that make it useful for our classification task. Most importantly, RF is not sensitive to over-fitting and can handle unbalanced datasets, which enables trying out different hyperparameter configurations. The XGBoost model was selected due to its high popularity for classification tasks in the context of Kaggle competitions and to compare a bagging and a boosting approach to the problem at hand. While being generally slower than Random Forest and Logistic Regression, XGBoost is suitable for the application at hand since the low training speed is not the only crucial criterion and XGBoost is known for achieving better classification than Random Forest. Being also a tree-based method, using XGBoost allowed us to compare two similar ML classifiers on the same hyperparameters.

Due to constraints of the computational power available to us, we could neither use Scikit-Learn’s GridSearchCV nor RandomizedSearchCV. Therefore, a Learning Curve approach was chosen to overcome those computational barriers. This method also uses GridSearchCV, however with a pre-specified maximal depth of the trees and the only parameter to be tuned being the number of trees (i.e., number of estimators). The learning curve depicts the precision score vs. number of trees, as well as the training time vs. number of trees for both a training set and a testing set used in the cross validation. Therefore, the best number of trees can easily be identified, which delivers the best precision while not risking over-fitting and having a good training time.

Deep Learning Aproach

In addition, we used a deep learning approach to achieve two outcomes - one being our submission for the Kaggle competition, which looks at individuals classification, and secondly, species-based prediction for comparing our results with the non Deep Learning portion of this project.

The Deep Learning model we have employed is a combination of the features of a transformer and convolutions. The EfficientNet model uses a combined Convolutional Neural Network (CNN) and is a more efficient version of ResNet trained on Image Net. We have chosen this model for two reasons: the performance recommendation of other Kaggle users in the competition and how lightweight this model is. This ensures high performance, great accuracy and a model that can efficiently, in terms of computational capacity, be carried out on the Hertie Server.

To ensure peak learning of the model, we used two Keras packages that stop the learning when it is no longer improving. This helps to protect the model from over-fitting and ensures that it is as accurate and precise as possible. The overall training time was, on average, 8 hours when using individual IDs as classifiers, with each epoch taking an average of 40 seconds.

Evaluation

For the evaluation in our Machine Learning approach we used the classical metrics to evaluate classification models: the confusion matrix and the accuracy, precision, recall and F1 score. To have an initial check for over-fitting we evaluate the classifiers both on the training and a separate evaluation set. For our purposes, it is most desirable to have a good ratio of true positive outcomes among positive classified images. In line with this description of the importance of a low number of false positives over the low number of false negatives, a higher precision score is more important than a higher recall score regarding the quality of our outcome. We, additionally, use the F1-score, which is a sub-contrary mean of recall and precision, to combine them. Overall, the accuracy and precision are the two most important metrics for us, as these are the ones also used in the Kaggle competition, our Deep Learning classification, and in most image classification literature.

For the evaluation of our Deep Learning classifier we used Categorical Crossentropy which is a loss function that is used in mutually-exclusive multi-class classification tasks. These are tasks where an example can only belong to one out of many possible categories, and the model must decide which one. For evaluating the results of predicting individuals, the Mean Average Precision score at 5 was calculated by Kaggle. This allowed us to predict five IDs per image, whereby a score of one (best score) is reached if one of those five predictions is correct. For the purpose of species prediction, the accuracy score and precision scores were used as the primary metric, to make the results of the ML models comparable.

Results

For the ML models the following results on our evaluation scores were obtained after predicting the labels in the test set. The tuned XGBoost Classifier yielded the best scores in every category except for the training time, with a top precision score of 11.8% and accuracy of 17%. The training time of 26 minutes 48 seconds is significantly higher than that of the Softmax (LG) and Random Forest models. An overview of the evaluation results is given in the following table:

Results of the Machine Learing Models

Figure 4: Results of the Machine Learing Models

For the ML models it was expected to achieve a low performance. In that regard, the scores of our models achieve better results than expected, especially as the precision score is above 10% for each classifier. An issue that persists across all our models is the high degree of over-fitting to the training data. A highly uncommon result was obtained during the tuning of hyperparameters using the Learning Curve, namely that the precision decreased with increasing number of trees.

Learning Curve of Random Forest Classifier

Figure 5: Learning Curve of Random Forest Classifier

Figure 5 illustrates on the left side that the best precision is obtained with just 10 trees. The line for the training set is not visible, but follows a similar behavior. The precision for the training set with 10 trees is at 0.69, indicating over-fitting. However, the precision on both the training and the test set drops dramatically already for 50 trees while the training time increases significantly when increasing the number of trees. This unusual behavior can only be explained because of the complexity of the dataset and the incapacity of Random Forest to handle the subtle differences between the species.

The results in the Deep Learning models were varied. The Baseline model had a mean average precision score of 0.00005 after 200 epochs, but by switching from EfficientNetB0 to B1 and increasing the number of epochs to 500, we were able to increase the score to 11%. Running the adjusted model, however, more than doubled the running time from around 3 to around 8 hours. The loss of the baseline model is around 1.5, and for the tuned model around 1. Our model in predicting species was already quite high in accuracy at 60% after only two epochs, showing its performance when presented with 26 vs over 11,000 classes. Its loss score was 1.3 and it ran in only three minutes.

A summary of the evaluation results is provided in the following table:

Results of the Deep Learing Model

Figure 6: Results of the Deep Learing Model

Figure 7 below shows the loss score of the model over time as we searched for individuals over a span of 500 epochs. As you can see, it does not change the outcome much after 450 epochs so for future research it would be recommended to switch to a B2 EfficientNet.

Loss Graph of the Individual IDs Tuned Model

Figure 7: Loss Graph of the Individual IDs Tuned Model

The Deep Learning models were expected to achieve higher precision than 0.00005, however after changing our parameters after our Baseline results by increasing the number of epochs and switching to a different version of EfficientNet we were able to increase both our accuracy and precision score, while decreasing our loss. As a result, we were able to deduce that increasing the number of epochs would increase precision by comparing our results to those of other Kagglers.

Analysis and Limitations

Next to our quantitative assessment we looked more deeply at errors and their origin in the dataset. We, firstly, created error tables that enable us to identify the image IDs for pictures that were predicted wrongly. Next, a confusion matrix was used to get a better understanding of the error distribution. Considering the multiple steps of our project, the first possible source for errors is the pre-cropped dataset which includes distorted and blurred images that could be considered outliers. Furthermore, considering the imbalance of our dataset, there is a large amount of errors stemming from an under-representation of species. There are multiple classes which the ML classifiers were not able predict correctly at all. This could be due to the fact that there are some species which do not have very unique or distinguishable fins (or no fins at all, such as beluga whales). Therefore separately training and setting the thresholds for each species could be a way forward to improve prediction success.

Our research was bound by certain limitations such as resource constraints, especially in terms of time and computational power. Keeping that in mind, in the following we discuss where there is room for possible improvements. The project could have been improved with more extensive hyperparameter tuning, and by including further hyperparameters. We do not expect the results for the conventional ML models to increase in a way that they are comparable to the DL model, but certainly some improvements in prediction performance could be reached. As discussed, DL models dominate the literature, therefore we would suggest to focus with improvements on the DL model. For the implemented Deep Learning model, it would have been beneficial to have more GPU power and training time to train it more extensively. The leading participants of the Kaggle competition submitted multiple hundreds of attempts and retrained their models accordingly.

Conclusion

This project confirms current research practice that conventional ML algorithms are not optimized for solving complex classification tasks in a very precise and quick manner, and that Deep Learning approaches, such as CNN, are needed for tasks including but not limited to automating photo-ID and matching. We achieved and accuracy scores of 17% with the conventional ML models on species identification, but accuracy of over 60% with a only limited trained Deep Learning model. Especially when aiming for the original goal of the Kaggle Competition to identify individual animals, a extensively trained DL model is needed to build on the 11% accuracy achieved.

WHEN THERE IS A WHALE, THERE IS A WAY!

Acknowledgments

As explained above, we decided in the course of the project to rely on a pre-cropped dataset. We are therefore grateful to the Kaggle User Phalanx for providing it. In addition, we thank Kaggle User Adnan Pen for inspiring us to use Tracer for the Image Segmentation part.

Dhar, Prashengit, and Sunanda Guha. 2021. “Fish Image Classification by XgBoost Based on Gist and GLCM Features.” International Journal Information Technology and Computer Science 4: 17–23.
Faaeq, Ainuddin, Hüseyin Gürüler, and Musa Peker. 2018. “Image Classification Using Manifold Learning Based Non-Linear Dimensionality Reduction.” In 2018 26th Signal Processing and Communications Applications Conference (SIU), 1–4. Ieee.
Ghosh, Swarnendu, Nibaran Das, Ishita Das, and Ujjwal Maulik. 2019. “Understanding Deep Learning Techniques for Image Segmentation.” ACM Comput. Surv. 52 (4). https://doi.org/10.1145/3329784.
Huang, Ho Chun, John Joseph, Ming Jer Huang, and Tetyana Margolina. 2016. “Automated Detection and Identification of Blue and Fin Whale Foraging Calls by Combining Pattern Recognition and Machine Learning Techniques.” In OCEANS 2016 MTS/IEEE Monterey, 1–7. IEEE.
Hughes, Benjamin, and Tilo Burghardt. 2017. “Automated Visual Fin Identification of Individual Great White Sharks.” International Journal of Computer Vision 122 (3): 542–57.
Lee, Min Seok, WooSeok Shin, and Sung Won Han. 2021. “TRACER: Extreme Attention Guided Salient Object Tracing Network.” arXiv Preprint arXiv:2112.07380.
Minaee, Shervin, Yuri Y. Boykov, Fatih Porikli, Antonio J Plaza, Nasser Kehtarnavaz, and Demetri Terzopoulos. 2021. “Image Segmentation Using Deep Learning: A Survey.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–1. https://doi.org/10.1109/TPAMI.2021.3059968.
Renò, Vito, Gennaro Gala, Pierluigi Dibari, Roberto Carlucci, Carmelo Fanizza, Giovanna Castellano, Gennaro Vessio, Giovanni Dimauro, and Rosalia Maglietta. 2020. “Innovative Classification of Dolphins Using Deep Neural Networks and GrabCut.” In. https://doi.org/10.23919/SpliTech49282.2020.9243820.
Tan, Mingxing, and Quoc V. Le. 2019. “EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks.” CoRR abs/1905.11946. http://arxiv.org/abs/1905.11946.

  1. The competition can be found here: https://www.kaggle.com/c/happy-whale-and-dolphin↩︎

  2. The pre-cropped dataset can be found here: https://www.kaggle.com/datasets/phalanx/whale2-cropped-dataset↩︎

References

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/Whale-way, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".