347 lines
18 KiB
TeX
347 lines
18 KiB
TeX
\documentclass[a4paper]{article}
|
|
% To compile PDF run: latexmk -pdf {filename}.tex
|
|
|
|
\usepackage{graphicx} % Used to insert images into the paper
|
|
\usepackage{float}
|
|
\usepackage{caption}
|
|
\interfootnotelinepenalty=10000 % Stops footnotes overflowing onto the newt page
|
|
\usepackage[justification=centering]{caption} % Used for captions
|
|
\captionsetup[figure]{font=small} % Makes captions small
|
|
\newcommand\tab[1][0.5cm]{\hspace*{#1}} % Defines a new command to use 'tab' in text
|
|
\usepackage[comma, numbers]{natbib} % Used for the bibliography
|
|
\usepackage{amsmath} % Math package
|
|
% Enable that parameters of \cref{}, \ref{}, \cite{}, ... are linked so that a reader can click on the number an jump to the target in the document
|
|
\usepackage{hyperref}
|
|
%enable \cref{...} and \Cref{...} instead of \ref: Type of reference included in the link
|
|
\usepackage[capitalise,nameinlink]{cleveref}
|
|
% UTF-8 encoding
|
|
\usepackage[T1]{fontenc}
|
|
\usepackage[utf8]{inputenc} %support umlauts in the input
|
|
% Easier compilation
|
|
\usepackage{bookmark}
|
|
\usepackage{natbib}
|
|
|
|
\usepackage{xcolor}
|
|
\newcommand{\todo}[1]{\marginpar{{\textsf{TODO}}}{\textbf{\color{red}[#1]}}}
|
|
|
|
\begin{document}
|
|
\title{What is Waldo?}
|
|
\author{Kelvin Davis \and Jip J. Dekker \and Anthony Silvestere}
|
|
\maketitle
|
|
|
|
\begin{abstract}
|
|
|
|
The famous brand of picture puzzles ``Where's Waldo?'' relates well to many
|
|
unsolved image classification problem. This offers us the opportunity to
|
|
test different image classification methods on a data set that is both small
|
|
enough to compute in a reasonable time span and easy for humans to
|
|
understand. In this report we compare the well known machine learning
|
|
methods Naive Bayes, Support Vector Machines, $k$-Nearest Neighbors, and
|
|
Random Forest against the Neural Network Architectures LeNet, Fully
|
|
Convolutional Neural Networks, and Fully Convolutional Neural Networks. Our
|
|
comparison shows that, although the different neural networks architectures
|
|
have the highest accuracy, some other methods come close with only a
|
|
fraction of the training time.
|
|
|
|
\end{abstract}
|
|
|
|
\section{Introduction}
|
|
|
|
Almost every child around the world knows about ``Where's Waldo?'', also
|
|
known as ``Where's Wally?'' in some countries. This famous puzzle book has
|
|
spread its way across the world and is published in more than 25 different
|
|
languages. The idea behind the books is to find the character Waldo,
|
|
shown in \Cref{fig:waldo}, in the different pictures in the book. This is,
|
|
however, not as easy as it sounds. Every picture in the book is full of tiny
|
|
details and Waldo is only one out of many. The puzzle is made even harder by
|
|
the fact that Waldo is not always fully depicted, sometimes it is just his
|
|
head or his torso popping out from behind something else. Lastly, the reason
|
|
that even adults will have trouble spotting Waldo is the fact that the
|
|
pictures are full of ``Red Herrings'': things that look like (or are colored
|
|
as) Waldo, but are not actually Waldo.
|
|
|
|
\begin{figure}[ht]
|
|
\includegraphics[scale=0.35]{waldo.png}
|
|
\centering
|
|
\caption{
|
|
A headshot of the character Waldo, or Wally. Pictures of Waldo
|
|
copyrighted by Martin Handford and are used under the fair-use policy.
|
|
}
|
|
\label{fig:waldo}
|
|
\end{figure}
|
|
|
|
The task of finding Waldo is something that relates to a lot of real life
|
|
image recognition tasks. Fields like mining, astronomy, surveillance,
|
|
radiology, and microbiology often have to analyse images (or scans) to find
|
|
the tiniest details, sometimes undetectable by the human eye. These tasks
|
|
are especially hard when the thing(s) you are looking for are similar to the
|
|
rest of the images. These tasks are thus generally performed using computers
|
|
to identify possible matches.
|
|
|
|
``Where's Waldo?'' offers us a great tool to study this kind of problem in a
|
|
setting that is humanly tangible. In this report we will try to identify
|
|
Waldo in the puzzle images using different classification methods. Every
|
|
image will be split into different segments and every segment will have to
|
|
be classified as either being Waldo or not Waldo. We will compare
|
|
various different classification methods from more classical machine
|
|
learning, like naive Bayes classifiers, to the currently state of the art,
|
|
Neural Networks. In \Cref{sec:background} we will introduce the different
|
|
classification methods, \Cref{sec:method} will explain the way in which
|
|
these methods are trained and how they will be evaluated, in
|
|
\Cref{sec:results} will discuss the results, and \Cref{sec:conclusion} will
|
|
offer our final conclusions.
|
|
|
|
\section{Background} \label{sec:background}
|
|
|
|
The classification methods used can separated into two separate groups:
|
|
classical machine learning methods and neural network architectures. Many of
|
|
the classical machine learning algorithms have variations and improvements
|
|
for various purposes; however, for this report we will be using their only
|
|
their basic versions. In contrast, we will use different neural network
|
|
architectures, as this method is currently the most used for image
|
|
classification.
|
|
|
|
\subsection{Classical Machine Learning Methods}
|
|
|
|
The following paragraphs will give only brief descriptions of the different
|
|
classical machine learning methods used in this reports. For further reading
|
|
we recommend reading ``Supervised machine learning: A review of
|
|
classification techniques'' \cite{MLReview}.
|
|
|
|
\paragraph{Naive Bayes Classifier}
|
|
|
|
\cite{naivebayes} is a classification method according to Bayes' theorem,
|
|
shown in \Cref{eq:bayes}. Bayes' theorem allows us to calculate the
|
|
probability of an event taking into account prior knowledge of conditions of
|
|
the event in question. In classification this allows us to calculate the
|
|
probability that a new instance has a certain class based its features. We
|
|
then assign the class that has the highest probability.
|
|
|
|
\begin{equation}
|
|
\label{eq:bayes}
|
|
P(A\mid B)=\frac {P(B\mid A)\,P(A)}{P(B)}
|
|
\end{equation}
|
|
|
|
\paragraph{$k$-Nearest Neighbors}
|
|
|
|
($k$-NN) \cite{knn} is one of the simplest machine learning algorithms. It
|
|
classifies a new instance based on its ``distance'' to the known instances.
|
|
It will find the $k$ closest instances to the new instance and assign the
|
|
new instance the class that the majority of the $k$ closest instances has.
|
|
The method has to be configured in several ways: the number of $k$, the
|
|
distance measure, and (depending on $k$) a tie breaking measure all have to
|
|
be chosen.
|
|
|
|
\paragraph{Support Vector Machine}
|
|
|
|
(SVM) \cite{svm} has been very successful in many classification tasks. The
|
|
method is based on finding boundaries between the different classes. The
|
|
boundaries are defined as functions on the features of the instances. The
|
|
boundaries are optimized to have the most amount of space between the
|
|
boundaries and the training instances on both sides. Originally the
|
|
boundaries where linear functions, but more recent development allows for
|
|
the training of non-linear boundaries~\cite{svmnonlinear}. Once the training
|
|
has defined the boundaries new instances are classified according to on
|
|
which side of the boundary they belong.
|
|
|
|
\paragraph{Random Forest}
|
|
|
|
\cite{randomforest} is a method that is based on classifications decision
|
|
trees. In a decision tree a new instances is classified by going down a
|
|
(binary) tree. Each non-leaf node contain a selection criteria to its
|
|
branches. Every leaf node contains the class that will be assigned to the
|
|
instance if the node is reached. In other training methods, decision trees
|
|
have the tendency to overfit\footnote{Overfitting occurs when a model learns
|
|
from the data too specifically, and loses its ability to generalise its
|
|
predictions for new data (resulting in loss of prediction accuracy)}, but in
|
|
random forest a multitude of decision tree is trained with a certain degree
|
|
of randomness and the mean of these trees is used which avoids this problem.
|
|
|
|
\subsection{Neural Network Architectures}
|
|
|
|
There are many well established architectures for Neural Networks depending
|
|
on the task being performed. In this paper, the focus is placed on
|
|
convolution neural networks, which have been proven to effectively classify
|
|
images \cite{NIPS2012_4824}. One of the pioneering works in the field, the
|
|
LeNet \cite{726791}architecture, will be implemented to compare against two
|
|
rudimentary networks with more depth. These networks have been constructed
|
|
to improve on the LeNet architecture by extracting more features, condensing
|
|
image information, and allowing for more parameters in the network. The
|
|
difference between the two network use of convolutional and dense layers.
|
|
The convolutional neural network contains dense layers in the final stages
|
|
of the network. The Fully Convolutional Network (FCN) contains only one
|
|
dense layer for the final binary classification step. The FCN instead
|
|
consists of an extra convolutional layer, resulting in an increased ability
|
|
for the network to abstract the input data relative to the other two
|
|
configurations. \\
|
|
|
|
\begin{figure}[H]
|
|
\includegraphics[scale=0.50]{LeNet}
|
|
\centering
|
|
\captionsetup{width=0.90\textwidth}
|
|
\caption{Representation of the LeNet Neural Network model architecture including convolutional layers and pooling (subsampling) layers\cite{726791}}
|
|
\label{fig:LeNet}
|
|
\end{figure}
|
|
|
|
\section{Method} \label{sec:method}
|
|
|
|
In order to effectively utilize the aforementioned modeling and
|
|
classification techniques, a key consideration is the data they are acting
|
|
on. A dataset containing Waldo and non-Waldo images was obtained from an
|
|
Open Database\footnote{``The Open Database License (ODbL) is a license
|
|
agreement intended to allow users to freely share, modify, and use [a]
|
|
Database while maintaining [the] same freedom for
|
|
others"\cite{openData}}hosted on the predictive modeling and analytics
|
|
competition framework, Kaggle~\cite{kaggle}. The distinction between images
|
|
containing Waldo, and those that do not, was provided by the separation of
|
|
the images in different sub-directories. It was therefore necessary to
|
|
preprocess these images before they could be utilized by the proposed
|
|
machine learning algorithms.
|
|
|
|
\subsection{Image Processing} \label{imageProcessing}
|
|
|
|
The Waldo image database consists of images of size 64$\times$64,
|
|
128$\times$128, and 256$\times$256 pixels obtained by dividing complete
|
|
``Where's Waldo?'' puzzles. Within each set of images, those containing
|
|
Waldo are located in a folder called \texttt{waldo}, and those not containing
|
|
Waldo, in a folder called \texttt{not\_waldo}. Since ``Where's Waldo?''
|
|
puzzles are usually densely populated and contain fine details, the
|
|
64$\times$64 pixel set of images were selected to train and evaluate the
|
|
machine learning models. These images provide the added benefit of
|
|
containing the most individual images of the three size groups. \\
|
|
|
|
Each of the 64$\times$64 pixel images were inserted into a
|
|
NumPy~\cite{numpy} array of images, and a binary value was inserted into a
|
|
separate list at the same index. These binary values form the labels for
|
|
each image (Waldo or not Waldo). Color normalization was performed
|
|
on each so that artifacts in an image's color profile correspond to
|
|
meaningful features of the image (rather than photographic method).\\
|
|
|
|
Each original puzzle is broken down into many images, and only contains one
|
|
Waldo. Although Waldo might span multiple 64$\times$64 pixel squares, this
|
|
means that the non-Waldo data far outnumbers the Waldo data. To
|
|
combat the bias introduced by the skewed data, all Waldo images were
|
|
artificially augmented by performing random rotations, reflections, and
|
|
introducing random noise in the image to produce news images. In this way,
|
|
each original Waldo image was used to produce an additional 10 variations of
|
|
the image, inserted into the image array. This provided more variation in
|
|
the true positives of the data set and assists in the development of more
|
|
robust methods by exposing each technique to variations of the image during
|
|
the training phase. \\
|
|
|
|
Despite the additional data, there were still ten times more non-Waldo
|
|
images than Waldo images. Therefore, it was necessary to cull the
|
|
non-Waldo data, so that there was an even split of Waldo and
|
|
non-Waldo images, improving the representation of true positives in the
|
|
image data set. Following preprocessing, the images (and associated labels)
|
|
were divided into a training and a test set with a 3:1 split. \\
|
|
|
|
\subsection{Neural Network Training}\label{nnTraining}
|
|
|
|
The neural networks used to classify the images were supervised learning
|
|
models; requiring training on a dataset of typical images. Each network was
|
|
trained using the preprocessed training dataset and labels for 25 epochs
|
|
(one forward and backward pass of all data) in batches of 150. The number of
|
|
epochs was chosen to maximize training time and prevent overfitting of the
|
|
training data, given current model parameters. The batch size is the number
|
|
of images sent through each pass of the network. Using the entire dataset
|
|
would train the network quickly, but decrease the network's ability to learn
|
|
unique features from the data. Passing one image at a time may allow the
|
|
model to learn more about each image, however it would also increase the
|
|
training time and risk of overfitting the data. Therefore the batch size was
|
|
chosen to maintain training accuracy while minimizing training time.
|
|
|
|
\subsection{Neural Network Testing}\label{nnTesting}
|
|
\tab After training each network, a separate test set of images (and labels) was used to evaluate the models.
|
|
The result of this testing was expressed primarily in the form of an accuracy (percentage).
|
|
These results as well as the other methods presented in this paper are given in Table \ref{tab:results}.
|
|
% Kelvin Start
|
|
\subsection{Benchmarking}\label{benchmarking}
|
|
|
|
In order to benchmark the Neural Networks, the performance of these
|
|
algorithms are evaluated against other Machine Learning algorithms. We use
|
|
Support Vector Machines, K-Nearest Neighbors (\(K=5\)), Naive Bayes and
|
|
Random Forest classifiers, as provided in Scikit-Learn~\cite{scikit-learn}.
|
|
|
|
\subsection{Performance Metrics}\label{performance-metrics}
|
|
|
|
To evaluate the performance of the models, we record the time taken by
|
|
each model to train, based on the training data and the accuracy with which
|
|
the model makes predictions. We calculate accuracy as
|
|
\[a = \frac{|correct\ predictions|}{|predictions|} = \frac{tp + tn}{tp + tn + fp + fn}\]
|
|
where \(tp\) is the number of true positives, \(tn\) is the number of true
|
|
negatives, \(fp\) is the number of false positives, and \(tp\) is the number
|
|
of false negatives.
|
|
|
|
\section{Results} \label{sec:results}
|
|
|
|
The time taken to train each of the neural networks and traditional
|
|
approaches was measured and recorded alongside their accuracy (evaluated
|
|
using a separate test dataset) in Table \ref{tab:results}.
|
|
|
|
% Annealing image and caption
|
|
\begin{table}[H]
|
|
\centering
|
|
\renewcommand{\arraystretch}{1.5} % Adds some space to the table
|
|
\begin{tabular}{|c|c|c|}
|
|
\hline
|
|
\textbf{Method} & \textbf{Test Accuracy} & \textbf{Training Time (s)}\\
|
|
\hline
|
|
LeNet & 89.81\% & 58.13\\
|
|
\hline
|
|
CNN & \textbf{95.63\%} & 113.81\\
|
|
\hline
|
|
FCN & 94.66\% & 117.69\\
|
|
\hline
|
|
Support Vector Machine & 83.50\% & 5.90\\
|
|
\hline
|
|
K Nearest Neighbours & 67.96\% & 0.22\\
|
|
\hline
|
|
Gaussian Naive Bayes & 85.44\% & \textbf{0.15}\\
|
|
\hline
|
|
Random Forest & 92.23\% & 0.92\\
|
|
\hline
|
|
\end{tabular}
|
|
\captionsetup{width=0.80\textwidth}
|
|
\caption{Comparison of the accuracy and training time of each neural
|
|
network and traditional machine learning technique}
|
|
\label{tab:results}
|
|
\end{table}
|
|
|
|
\par
|
|
We can see in these results that Deep Neural Networks outperform our benchmark classification models in terms of the accuracy they achieve.
|
|
However, the time required to train these networks is significantly greater.
|
|
An additional consideration is the extra layer of abstraction present in the FCN and not the CNN.
|
|
This may indicate that the FCN can achieve better accuracies, given more training time (epochs).
|
|
\\
|
|
\par
|
|
Of the benchmark classifiers we see the best performance with Random
|
|
Forests and the worst performance with K Nearest Neighbours. This is consistent with the models' abilities to learn hidden relationships within the data (and K Nearest Neighbours lack thereof).
|
|
The accuracy of the random forest approach was unexpected, as the neural networks have had success in image classification task previously.
|
|
However, this may be due to the random forest methods ability to avoid overfitting.
|
|
The low training time of the classical methods could be due to the their requirement of only one pass of the data to train the model.
|
|
Neural networks require more passes the more they abstract the data (e.g. though convolutions).
|
|
|
|
\section{Conclusion} \label{sec:conclusion}
|
|
|
|
Image from the ``Where's Waldo?'' puzzle books are ideal images to test
|
|
image classification techniques. Their tendency for hidden objects and ``red
|
|
herrings'' make them challenging to classify, but because they are drawings
|
|
they remain tangible for the human eye.
|
|
|
|
In our experiments we show that, given unspecialized methods, Neural
|
|
Networks perform best on this kind of image classification task. No matter
|
|
which architecture their accuracy is very high. One has to note though that
|
|
random forest performed surprisingly well, coming close to the performance
|
|
of the better Neural Networks. Especially when training time is taking into
|
|
account it is the clear winner.
|
|
|
|
It would be interesting to investigate various of these methods further.
|
|
There might be quite a lot of ground that could be gained by using
|
|
specialized variants of these clustering algorithms.
|
|
\clearpage % Ensures that the references are on a separate page
|
|
\pagebreak
|
|
\bibliographystyle{alpha}
|
|
\bibliography{references}
|
|
\end{document}
|