315 lines
16 KiB
TeX
315 lines
16 KiB
TeX
\documentclass[a4paper]{article}
|
|
% To compile PDF run: latexmk -pdf {filename}.tex
|
|
|
|
\usepackage{graphicx} % Used to insert images into the paper
|
|
\usepackage{float}
|
|
\usepackage[justification=centering]{caption} % Used for captions
|
|
\captionsetup[figure]{font=small} % Makes captions small
|
|
\newcommand\tab[1][0.5cm]{\hspace*{#1}} % Defines a new command to use 'tab' in text
|
|
\usepackage[comma, numbers]{natbib} % Used for the bibliography
|
|
\usepackage{amsmath} % Math package
|
|
% Enable that parameters of \cref{}, \ref{}, \cite{}, ... are linked so that a reader can click on the number an jump to the target in the document
|
|
\usepackage{hyperref}
|
|
%enable \cref{...} and \Cref{...} instead of \ref: Type of reference included in the link
|
|
\usepackage[capitalise,nameinlink]{cleveref}
|
|
% UTF-8 encoding
|
|
\usepackage[T1]{fontenc}
|
|
\usepackage[utf8]{inputenc} %support umlauts in the input
|
|
% Easier compilation
|
|
\usepackage{bookmark}
|
|
\usepackage{natbib}
|
|
|
|
\usepackage{xcolor}
|
|
\newcommand{\todo}[1]{\marginpar{{\textsf{TODO}}}{\textbf{\color{red}[#1]}}}
|
|
|
|
\begin{document}
|
|
\title{What is Waldo?}
|
|
\author{Kelvin Davis \and Jip J. Dekker \and Anthony Silvestere}
|
|
\maketitle
|
|
|
|
\begin{abstract}
|
|
|
|
The famous brand of picture puzzles ``Where's Waldo?'' relates well to many
|
|
unsolved image classification problem. This offers us the opportunity to
|
|
test different image classification methods on a data set that is both small
|
|
enough to compute in a reasonable time span and easy for humans to
|
|
understand. In this report we compare the well known machine learning
|
|
methods Naive Bayes, Support Vector Machines, $k$-Nearest Neighbors, and
|
|
Random Forest against the Neural Network Architectures LeNet, Fully
|
|
Convolutional Neural Networks, and Fully Convolutional Neural Networks.
|
|
\todo{I don't like this big summation but I think it is the important
|
|
information}
|
|
Our comparison shows that \todo{...}
|
|
|
|
\end{abstract}
|
|
|
|
\section{Introduction}
|
|
|
|
Almost every child around the world knows about ``Where's Waldo?'', also
|
|
known as ``Where's Wally?'' in some countries. This famous puzzle book has
|
|
spread its way across the world and is published in more than 25 different
|
|
languages. The idea behind the books is to find the character ``Waldo'',
|
|
shown in \Cref{fig:waldo}, in the different pictures in the book. This is,
|
|
however, not as easy as it sounds. Every picture in the book is full of tiny
|
|
details and Waldo is only one out of many. The puzzle is made even harder by
|
|
the fact that Waldo is not always fully depicted, sometimes it is just his
|
|
head or his torso popping out from behind something else. Lastly, the reason
|
|
that even adults will have trouble spotting Waldo is the fact that the
|
|
pictures are full of ``Red Herrings'': things that look like (or are colored
|
|
as) Waldo, but are not actually Waldo.
|
|
|
|
\begin{figure}[ht]
|
|
\includegraphics[scale=0.35]{waldo.png}
|
|
\centering
|
|
\caption{
|
|
A headshot of the character ``Waldo'', or ``Wally''. Pictures of Waldo
|
|
copyrighted by Martin Handford and are used under the fair-use policy.
|
|
}
|
|
\label{fig:waldo}
|
|
\end{figure}
|
|
|
|
The task of finding Waldo is something that relates to a lot of real life
|
|
image recognition tasks. Fields like mining, astronomy, surveillance,
|
|
radiology, and microbiology often have to analyse images (or scans) to find
|
|
the tiniest details, sometimes undetectable by the human eye. These tasks
|
|
are especially hard when the thing(s) you are looking for are similar to the
|
|
rest of the images. These tasks are thus generally performed using computers
|
|
to identify possible matches.
|
|
|
|
``Where's Waldo?'' offers us a great tool to study this kind of problem in a
|
|
setting that is humanly tangible. In this report we will try to identify
|
|
Waldo in the puzzle images using different classification methods. Every
|
|
image will be split into different segments and every segment will have to
|
|
be classified as either being ``Waldo'' or ``not Waldo''. We will compare
|
|
various different classification methods from more classical machine
|
|
learning, like naive Bayes classifiers, to the currently state of the art,
|
|
Neural Networks. In \Cref{sec:background} we will introduce the different
|
|
classification methods, \Cref{sec:method} will explain the way in which
|
|
these methods are trained and how they will be evaluated, in
|
|
\Cref{sec:results} will discuss the results, and \Cref{sec:conclusion} will
|
|
offer our final conclusions.
|
|
|
|
\section{Background} \label{sec:background}
|
|
|
|
The classification methods used can separated into two separate groups:
|
|
classical machine learning methods and neural network architectures. Many of
|
|
the classical machine learning algorithms have variations and improvements
|
|
for various purposes; however, for this report we will be using their only
|
|
their basic versions. In contrast, we will use different neural network
|
|
architectures, as this method is currently the most used for image
|
|
classification.
|
|
|
|
\subsection{Classical Machine Learning Methods}
|
|
|
|
The following paragraphs will give only brief descriptions of the different
|
|
classical machine learning methods used in this reports. For further reading
|
|
we recommend reading ``Supervised machine learning: A review of
|
|
classification techniques'' \cite{MLReview}.
|
|
|
|
\paragraph{Naive Bayes Classifier}
|
|
|
|
\cite{naivebayes} is a classification method according to Bayes' theorem,
|
|
shown in \Cref{eq:bayes}. Bayes' theorem allows us to calculate the
|
|
probability of an event taking into account prior knowledge of conditions of
|
|
the event in question. In classification this allows us to calculate the
|
|
probability that a new instance has a certain class based its features. We
|
|
then assign the class that has the highest probability.
|
|
|
|
\begin{equation}
|
|
\label{eq:bayes}
|
|
P(A\mid B)=\frac {P(B\mid A)\,P(A)}{P(B)}
|
|
\end{equation}
|
|
|
|
\paragraph{$k$-Nearest Neighbors}
|
|
|
|
($k$-NN) \cite{knn} is one of the simplest machine learning algorithms. It
|
|
classifies a new instance based on its ``distance'' to the known instances.
|
|
It will find the $k$ closest instances to the new instance and assign the
|
|
new instance the class that the majority of the $k$ closest instances has.
|
|
The method has to be configured in several ways: the number of $k$, the
|
|
distance measure, and (depending on $k$) a tie breaking measure all have to
|
|
be chosen.
|
|
|
|
\paragraph{Support Vector Machine}
|
|
|
|
(SVM) \cite{svm} has been very successful in many classification tasks. The
|
|
method is based on finding boundaries between the different classes. The
|
|
boundaries are defined as functions on the features of the instances. The
|
|
boundaries are optimized to have the most amount of space between the
|
|
boundaries and the training instances on both sides. Originally the
|
|
boundaries where linear functions, but more recent development allows for
|
|
the training of non-linear boundaries~\cite{svmnonlinear}. Once the training
|
|
has defined the boundaries new instances are classified according to on
|
|
which side of the boundary they belong.
|
|
|
|
\paragraph{Random Forest}
|
|
|
|
\cite{randomforest} is a method that is based on classifications decision
|
|
trees. In a decision tree a new instances is classified by going down a
|
|
(binary) tree. Each non-leaf node contain a selection criteria to its
|
|
branches. Every leaf node contains the class that will be assigned to the
|
|
instance if the node is reached. In other training methods, decision trees
|
|
have the tendency to overfit, but in random forest a multitude of decision
|
|
tree is trained with a certain degree of randomness and the mean of these
|
|
trees is used which avoids this problem.
|
|
|
|
\subsection{Neural Network Architectures}
|
|
|
|
There are many well established architectures for Neural Networks depending
|
|
on the task being performed. In this paper, the focus is placed on
|
|
convolution neural networks, which have been proven to effectively classify
|
|
images \cite{NIPS2012_4824}. One of the pioneering works in the field, the
|
|
LeNet architecture~\cite{726791}, will be implemented to compare against two
|
|
rudimentary networks with more depth. These networks have been constructed
|
|
to improve on the LeNet architecture by extracting more features, condensing
|
|
image information, and allowing for more parameters in the network. The
|
|
difference between the two network use of convolutional and dense layers.
|
|
The convolutional neural network contains dense layers in the final stages
|
|
of the network. The Fully Convolutional Network (FCN) contains only one
|
|
dense layer for the final binary classification step. The FCN instead
|
|
consists of an extra convolutional layer, resulting in an increased ability
|
|
for the network to abstract the input data relative to the other two
|
|
configurations.
|
|
|
|
\todo{Insert image of LeNet from slides}
|
|
|
|
\section{Method} \label{sec:method}
|
|
|
|
In order to effectively utilize the aforementioned modeling and
|
|
classification techniques, a key consideration is the data they are acting
|
|
on. A dataset containing Waldo and non-Waldo images was obtained from an
|
|
Open Database\footnote{``The Open Database License (ODbL) is a license
|
|
agreement intended to allow users to freely share, modify, and use [a]
|
|
Database while maintaining [the] same freedom for
|
|
others"\cite{openData}}hosted on the predictive modeling and analytics
|
|
competition framework, Kaggle. The distinction between images containing
|
|
Waldo, and those that do not, was provided by the separation of the images
|
|
in different sub-directories. It was therefore necessary to preprocess these
|
|
images before they could be utilized by the proposed machine learning
|
|
algorithms.
|
|
|
|
\subsection{Image Processing} \label{imageProcessing}
|
|
|
|
The Waldo image database consists of images of size 64$\times$64,
|
|
128$\times$128, and 256$\times$256 pixels obtained by dividing complete
|
|
Where's Waldo? puzzles. Within each set of images, those containing Waldo
|
|
are located in a folder called `waldo', and those not containing Waldo, in a
|
|
folder called `not\_waldo'. Since Where's Waldo? puzzles are usually densely
|
|
populated and contain fine details, the 64$\times$64 pixel set of images
|
|
were selected to train and evaluate the machine learning models. These
|
|
images provide the added benefit of containing the most individual images of
|
|
the three size groups. \\
|
|
|
|
Each of the 64$\times$64 pixel images were inserted into a
|
|
Numpy~\cite{numpy} array of images, and a binary value was inserted into a
|
|
seperate list at the same index. These binary values form the labels for
|
|
each image (waldo or not waldo). Colour normalisation was performed on each
|
|
so that artefacts in an image's colour profile correspond to meaningful
|
|
features of the image (rather than photographic method).\\
|
|
|
|
|
|
Each original puzzle is broken down into many images, and only contains one
|
|
Waldo. Although Waldo might span multiple 64$\times$64 pixel squares, this
|
|
means that the non-Waldo data far outnumbers the Waldo data. To combat the
|
|
bias introduced by the skewed data, all Waldo images were artificially
|
|
augmented by performing random rotations, reflections, and introducing
|
|
random noise in the image to produce news images. In this way, each original
|
|
Waldo image was used to produce an additional 10 variations of the image,
|
|
inserted into the image array. This provided more variation in the true
|
|
positives of the data set and assists in the development of more robust
|
|
methods by exposing each technique to variations of the image during the
|
|
training phase. \\
|
|
|
|
Despite the additional data, there were still over ten times as many
|
|
non-Waldo images than Waldo images. Therefore, it was necessary to cull the
|
|
no-Waldo data, so that there was an even split of Waldo and non-Waldo
|
|
images, improving the representation of true positives in the image data
|
|
set. Following preprocessing, the images (and associated labels) were
|
|
divided into a training and a test set with a 3:1 split. \\
|
|
|
|
\subsection{Neural Network Training}\label{nnTraining}
|
|
|
|
The neural networks used to classify the images were supervised learning
|
|
models; requiring training on a dataset of typical images. Each network was
|
|
trained using the preprocessed training dataset and labels, for 25 epochs
|
|
(one forward and backward pass of all data) in batches of 150. The number of
|
|
epochs was chosen to maximise training time and prevent
|
|
overfitting\footnote{Overfitting occurs when a model learns from the data
|
|
too specifically, and loses its ability to generalise its predictions for
|
|
new data (resulting in loss of prediction accuracy)} of the training data,
|
|
given current model parameters. The batch size is the number of images sent
|
|
through each pass of the network. Using the entire dataset would train the
|
|
network quickly, but decrease the network's ability to learn unique features
|
|
from the data. Passing one image at a time may allow the model to learn more
|
|
about each image, however it would also increase the training time and risk
|
|
of overfitting the data. Therefore the batch size was chosen to maintain
|
|
training accuracy while minimising training time.
|
|
|
|
\subsection{Neural Network Testing}\label{nnTesting}
|
|
|
|
After training each network, a separate test set of images (and labels) was
|
|
used to evaluate the models. The result of this testing was expressed
|
|
primarily in the form of an accuracy (percentage). These results as well as
|
|
the other methods presented in this paper are given in Figure
|
|
\todo{insert ref to results here} of the Results section.
|
|
\todo{***********}
|
|
|
|
\subsection{Benchmarking}\label{benchmarking}
|
|
|
|
In order to benchmark the Neural Networks, the performance of these
|
|
algorithms are evaluated against other Machine Learning algorithms. We use
|
|
Support Vector Machines, K-Nearest Neighbors (\(K=5\)), Naive Bayes and
|
|
Random Forest classifiers, as provided in Scikit-Learn~\cite{scikit-learn}.
|
|
|
|
\subsection{Performance Metrics}\label{performance-metrics}
|
|
|
|
To evaluate the performance of the models, we record the time taken by
|
|
each model to train, based on the training data and statistics about the
|
|
predictions the models make on the test data. These prediction
|
|
statistics include:
|
|
|
|
\begin{itemize}
|
|
\item
|
|
\textbf{Accuracy:}
|
|
\[a = \dfrac{|correct\ predictions|}{|predictions|} = \dfrac{tp + tn}{tp + tn + fp + fn}\]
|
|
\item
|
|
\textbf{Precision:}
|
|
\[p = \dfrac{|Waldo\ predicted\ as\ Waldo|}{|predicted\ as\ Waldo|} = \dfrac{tp}{tp + fp}\]
|
|
\item
|
|
\textbf{Recall:}
|
|
\[r = \dfrac{|Waldo\ predicted\ as\ Waldo|}{|actually\ Waldo|} = \dfrac{tp}{tp + fn}\]
|
|
\item
|
|
\textbf{F1 Measure:} \[f1 = \dfrac{2pr}{p + r}\] where \(tp\) is the
|
|
number of true positives, \(tn\) is the number of true negatives,
|
|
\(fp\) is the number of false positives, and \(tp\) is the number of
|
|
false negatives.
|
|
\end{itemize}
|
|
|
|
\emph{Accuracy} is a common performance metric used in Machine Learning,
|
|
however in classification problems where the training data is heavily
|
|
biased toward one category, sometimes a model will learn to optimize its
|
|
accuracy by classifying all instances as one category. I.e. the
|
|
classifier will classify all images that do not contain Waldo as not
|
|
containing Waldo, but will also classify all images containing Waldo as
|
|
not containing Waldo. Thus we use, other metrics to measure performance
|
|
as well.
|
|
\\
|
|
\par
|
|
\emph{Precision} returns the percentage of classifications of Waldo that
|
|
are actually Waldo. \emph{Recall} returns the percentage of Waldos that
|
|
were actually predicted as Waldo. In the case of a classifier that
|
|
classifies all things as Waldo, the recall would be 0. \emph{F1-Measure}
|
|
returns a combination of precision and recall that heavily penalizes
|
|
classifiers that perform poorly in either precision or recall.
|
|
% Kelvin End
|
|
|
|
\section{Results} \label{sec:results}
|
|
|
|
\section{Conclusion} \label{sec:conclusion}
|
|
|
|
\clearpage % Ensures that the references are on a seperate page
|
|
\pagebreak
|
|
\bibliographystyle{alpha}
|
|
\bibliography{references}
|
|
\end{document}
|