\documentclass[a4paper]{article} % To compile PDF run: latexmk -pdf {filename}.tex \usepackage{graphicx} % Used to insert images into the paper \usepackage{float} \usepackage{caption} \interfootnotelinepenalty=10000 % Stops footnotes overflowing onto the newt page \usepackage[justification=centering]{caption} % Used for captions \captionsetup[figure]{font=small} % Makes captions small \newcommand\tab[1][0.5cm]{\hspace*{#1}} % Defines a new command to use 'tab' in text \usepackage[comma, numbers]{natbib} % Used for the bibliography \usepackage{amsmath} % Math package % Enable that parameters of \cref{}, \ref{}, \cite{}, ... are linked so that a reader can click on the number an jump to the target in the document \usepackage{hyperref} %enable \cref{...} and \Cref{...} instead of \ref: Type of reference included in the link \usepackage[capitalise,nameinlink]{cleveref} % UTF-8 encoding \usepackage[T1]{fontenc} \usepackage[utf8]{inputenc} %support umlauts in the input % Easier compilation \usepackage{bookmark} \usepackage{natbib} \usepackage{xcolor} \newcommand{\todo}[1]{\marginpar{{\textsf{TODO}}}{\textbf{\color{red}[#1]}}} \begin{document} \title{What is Waldo?} \author{Kelvin Davis \and Jip J. Dekker \and Anthony Silvestere} \maketitle \begin{abstract} The famous brand of picture puzzles ``Where's Waldo?'' relates well to many unsolved image classification problem. This offers us the opportunity to test different image classification methods on a data set that is both small enough to compute in a reasonable time span and easy for humans to understand. In this report we compare the well known machine learning methods Naive Bayes, Support Vector Machines, $k$-Nearest Neighbors, and Random Forest against the Neural Network Architectures LeNet, Fully Convolutional Neural Networks, and Fully Convolutional Neural Networks. Our comparison shows that, although the different neural networks architectures have the highest accuracy, some other methods come close with only a fraction of the training time. \end{abstract} \section{Introduction} \tab Almost every child around the world knows about ``Where's Waldo?'', also known as ``Where's Wally?'' in some countries. This famous puzzle book has spread its way across the world and is published in more than 25 different languages. The idea behind the books is to find the character Waldo, shown in \Cref{fig:waldo}, in the different pictures in the book. This is, however, not as easy as it sounds. Every picture in the book is full of tiny details and Waldo is only one out of many. The puzzle is made even harder by the fact that Waldo is not always fully depicted, sometimes it is just his head or his torso popping out from behind something else. Lastly, the reason that even adults will have trouble spotting Waldo is the fact that the pictures are full of ``Red Herrings'': things that look like (or are colored as) Waldo, but are not actually Waldo. \\ \begin{figure}[ht] \includegraphics[scale=0.35]{waldo.png} \centering \caption{ A headshot of the character Waldo, or Wally. Pictures of Waldo copyrighted by Martin Handford and are used under the fair-use policy. } \label{fig:waldo} \end{figure} \par The task of finding Waldo is something that relates to a lot of real life image recognition tasks. Fields like mining, astronomy, surveillance, radiology, and microbiology often have to analyse images (or scans) to find the tiniest details, sometimes undetectable by the human eye. These tasks are especially hard when the thing(s) you are looking for are similar to the rest of the images. These tasks are thus generally performed using computers to identify possible matches. \\ \par ``Where's Waldo?'' offers us a great tool to study this kind of problem in a setting that is humanly tangible. In this report we will try to identify Waldo in the puzzle images using different classification methods. Every image will be split into different segments and every segment will have to be classified as either being Waldo or not Waldo. We will compare various different classification methods from more classical machine learning, like naive Bayes classifiers, to the currently state of the art, Neural Networks. In \Cref{sec:background} we will introduce the different classification methods, \Cref{sec:method} will explain the way in which these methods are trained and how they will be evaluated, in \Cref{sec:results} will discuss the results, and \Cref{sec:conclusion} will offer our final conclusions. \section{Background} \label{sec:background} \tab The classification methods used can separated into two separate groups: classical machine learning methods and neural network architectures. Many of the classical machine learning algorithms have variations and improvements for various purposes; however, for this report we will be using their only their basic versions. In contrast, we will use different neural network architectures, as this method is currently the most used for image classification. \subsection{Classical Machine Learning Methods} \tab The following paragraphs will give only brief descriptions of the different classical machine learning methods used in this reports. For further reading we recommend reading ``Supervised machine learning: A review of classification techniques'' \cite{MLReview}. \paragraph{Naive Bayes Classifier} \cite{naivebayes} is a classification method according to Bayes' theorem, shown in \Cref{eq:bayes}. Bayes' theorem allows us to calculate the probability of an event taking into account prior knowledge of conditions of the event in question. In classification this allows us to calculate the probability that a new instance has a certain class based its features. We then assign the class that has the highest probability. \begin{equation} \label{eq:bayes} P(A\mid B)=\frac {P(B\mid A)\,P(A)}{P(B)} \end{equation} \paragraph{$k$-Nearest Neighbors} ($k$-NN) \cite{knn} is one of the simplest machine learning algorithms. It classifies a new instance based on its ``distance'' to the known instances. It will find the $k$ closest instances to the new instance and assign the new instance the class that the majority of the $k$ closest instances has. The method has to be configured in several ways: the number of $k$, the distance measure, and (depending on $k$) a tie breaking measure all have to be chosen. \paragraph{Support Vector Machine} (SVM) \cite{svm} has been very successful in many classification tasks. The method is based on finding boundaries between the different classes. The boundaries are defined as functions on the features of the instances. The boundaries are optimized to have the most amount of space between the boundaries and the training instances on both sides. Originally the boundaries where linear functions, but more recent development allows for the training of non-linear boundaries~\cite{svmnonlinear}. Once the training has defined the boundaries new instances are classified according to on which side of the boundary they belong. \paragraph{Random Forest} \cite{randomforest} is a method that is based on classifications decision trees. In a decision tree a new instances is classified by going down a (binary) tree. Each non-leaf node contain a selection criteria to its branches. Every leaf node contains the class that will be assigned to the instance if the node is reached. In other training methods, decision trees have the tendency to overfit\footnote{Overfitting occurs when a model learns from the data too specifically, and loses its ability to generalise its predictions for new data (resulting in loss of prediction accuracy)}, but in random forest a multitude of decision tree is trained with a certain degree of randomness and the mean of these trees is used which avoids this problem. \subsection{Neural Network Architectures} \tab There are many well established architectures for Neural Networks depending on the task being performed. In this paper, the focus is placed on convolution neural networks, which have been proven to effectively classify images \cite{NIPS2012_4824}. One of the pioneering works in the field, the LeNet \cite{726791}architecture, will be implemented to compare against two rudimentary networks with more depth. These networks have been constructed to improve on the LeNet architecture by extracting more features, condensing image information, and allowing for more parameters in the network. The difference between the two network use of convolutional and dense layers. The convolutional neural network contains dense layers in the final stages of the network. The Fully Convolutional Network (FCN) contains only one dense layer for the final binary classification step. The FCN instead consists of an extra convolutional layer, resulting in an increased ability for the network to abstract the input data relative to the other two configurations. \\ \begin{figure}[H] \includegraphics[scale=0.50]{LeNet} \centering \captionsetup{width=0.90\textwidth} \caption{Representation of the LeNet Neural Network model architecture including convolutional layers and pooling (subsampling) layers\cite{726791}} \label{fig:LeNet} \end{figure} \section{Method} \label{sec:method} \tab In order to effectively utilize the aforementioned modeling and classification techniques, a key consideration is the data they are acting on. A dataset containing Waldo and non-Waldo images was obtained from an Open Database\footnote{``The Open Database License (ODbL) is a license agreement intended to allow users to freely share, modify, and use [a] Database while maintaining [the] same freedom for others"\cite{openData}}hosted on the predictive modeling and analytics competition framework, Kaggle~\cite{kaggle}. The distinction between images containing Waldo, and those that do not, was provided by the separation of the images in different sub-directories. It was therefore necessary to preprocess these images before they could be utilized by the proposed machine learning algorithms. \subsection{Image Processing} \label{imageProcessing} The Waldo image database consists of images of size 64$\times$64, 128$\times$128, and 256$\times$256 pixels obtained by dividing complete ``Where's Waldo?'' puzzles. Within each set of images, those containing Waldo are located in a folder called \texttt{waldo}, and those not containing Waldo, in a folder called \texttt{not\_waldo}. Since ``Where's Waldo?'' puzzles are usually densely populated and contain fine details, the 64$\times$64 pixel set of images were selected to train and evaluate the machine learning models. These images provide the added benefit of containing the most individual images of the three size groups. \\ Each of the 64$\times$64 pixel images were inserted into a NumPy~\cite{numpy} array of images, and a binary value was inserted into a separate list at the same index. These binary values form the labels for each image (Waldo or not Waldo). Color normalization was performed on each so that artifacts in an image's color profile correspond to meaningful features of the image (rather than photographic method).\\ Each original puzzle is broken down into many images, and only contains one Waldo. Although Waldo might span multiple 64$\times$64 pixel squares, this means that the non-Waldo data far outnumbers the Waldo data. To combat the bias introduced by the skewed data, all Waldo images were artificially augmented by performing random rotations, reflections, and introducing random noise in the image to produce news images. In this way, each original Waldo image was used to produce an additional 10 variations of the image, inserted into the image array. This provided more variation in the true positives of the data set and assists in the development of more robust methods by exposing each technique to variations of the image during the training phase. \\ Despite the additional data, there were still ten times more non-Waldo images than Waldo images. Therefore, it was necessary to cull the non-Waldo data, so that there was an even split of Waldo and non-Waldo images, improving the representation of true positives in the image data set. Following preprocessing, the images (and associated labels) were divided into a training and a test set with a 3:1 split. \\ \subsection{Neural Network Training}\label{nnTraining} \tab The neural networks used to classify the images were supervised learning models; requiring training on a dataset of typical images. Each network was trained using the preprocessed training dataset and labels for 25 epochs (one forward and backward pass of all data) in batches of 150. The number of epochs was chosen to maximize training time and prevent overfitting of the training data, given current model parameters. The batch size is the number of images sent through each pass of the network. Using the entire dataset would train the network quickly, but decrease the network's ability to learn unique features from the data. Passing one image at a time may allow the model to learn more about each image, however it would also increase the training time and risk of overfitting the data. Therefore the batch size was chosen to maintain training accuracy while minimizing training time. \subsection{Neural Network Testing}\label{nnTesting} \tab After training each network, a separate test set of images (and labels) was used to evaluate the models. The result of this testing was expressed primarily in the form of an accuracy (percentage). These results as well as the other methods presented in this paper are given in Table \ref{tab:results}. % Kelvin Start \subsection{Benchmarking}\label{benchmarking} \tab In order to benchmark the Neural Networks, the performance of these algorithms are evaluated against other Machine Learning algorithms. We use Support Vector Machines, K-Nearest Neighbors (\(K=5\)), Naive Bayes and Random Forest classifiers, as provided in Scikit-Learn~\cite{scikit-learn}. \subsection{Performance Metrics}\label{performance-metrics} \tab To evaluate the performance of the models, we record the time taken by each model to train, based on the training data and the accuracy with which the model makes predictions. We calculate accuracy as \[a = \frac{|correct\ predictions|}{|predictions|} = \frac{tp + tn}{tp + tn + fp + fn}\] where \(tp\) is the number of true positives, \(tn\) is the number of true negatives, \(fp\) is the number of false positives, and \(tp\) is the number of false negatives. \section{Results} \label{sec:results} \tab The time taken to train each of the neural networks and traditional approaches was measured and recorded alongside their accuracy (evaluated using a separate test dataset) in Table \ref{tab:results}. % Annealing image and caption \begin{table}[H] \centering \renewcommand{\arraystretch}{1.5} % Adds some space to the table \begin{tabular}{|c|c|c|} \hline \textbf{Method} & \textbf{Test Accuracy} & \textbf{Training Time (s)}\\ \hline LeNet & 89.81\% & 58.13\\ \hline CNN & \textbf{95.63\%} & 113.81\\ \hline FCN & 94.66\% & 117.69\\ \hline Support Vector Machine & 84.47\% & 7.87\\ \hline K Nearest Neighbours & 70.87\% & 0.25\\ \hline Gaussian Naive Bayes & 82.52\% & \textbf{0.13}\\ \hline Random Forest & 95.14\% & 0.28\\ \hline \end{tabular} \captionsetup{width=0.80\textwidth} \caption{Comparison of the accuracy and training time of each neural network and traditional machine learning technique} \label{tab:results} \end{table} \par We can see in these results that Deep Neural Networks outperform our benchmark classification models in terms of the accuracy they achieve. However, the time required to train these networks is significantly greater. An additional consideration is the extra layer of abstraction present in the FCN and not the CNN. This may indicate that the FCN can achieve better accuracies, given more training time (epochs). \\ \par Of the benchmark classifiers we see the best performance with Random Forests and the worst performance with K Nearest Neighbours. This is consistent with the models' abilities to learn hidden relationships within the data (and K Nearest Neighbours lack thereof). The accuracy of the random forest approach was unexpected, as the neural networks have had success in image classification task previously. However, this may be due to the random forest methods ability to avoid overfitting. The low training time of the classical methods could be due to the their requirement of only one pass of the data to train the model. Neural networks require more passes the more they abstract the data (e.g. though convolutions). \section{Conclusion} \label{sec:conclusion} \tab Image from the ``Where's Waldo?'' puzzle books are ideal images to test image classification techniques. Their tendency for hidden objects and ``red herrings'' make them challenging to classify, and the density of detail they contain makes them interesting to approach with machine learning. \\ \par In our experiments we show a comparison of machine learning methods, including deep learning, for the task of classifying an image as containing Waldo or not. The convolutional neural network architecture performed best at this task with an accuracy of 95.63\% followed closely by the random forest approach with an accuracy of 95.14\%. The random forest however, had a much lower training time of 0.27. Considering the training time, the random forest approach would appear to be most suited to the task. \\ \par It would be interesting to investigate various of these methods further, including further varying the hyperparameter in the neural networks. However, there may also be much more insight to be gained by exploring the classical algorithms. \clearpage % Ensures that the references are on a separate page \pagebreak \bibliographystyle{alpha} \bibliography{references} \end{document}