ResearchMethods/mini_proj/report/waldo.tex

\documentclass[a4paper]{article}
	% To compile PDF run: latexmk -pdf {filename}.tex

	\usepackage{graphicx}				% Used to insert images into the paper
	\usepackage{float}
	\usepackage[justification=centering]{caption}	% Used for captions
	\captionsetup[figure]{font=small}	% Makes captions small
	\newcommand\tab[1][0.5cm]{\hspace*{#1}}		% Defines a new command to use 'tab' in text
	\usepackage[comma, numbers]{natbib}	% Used for the bibliography
	\usepackage{amsmath}				% Math package
	% Enable that parameters of \cref{}, \ref{}, \cite{}, ... are linked so that a reader can click on the number an jump to the target in the document
	\usepackage{hyperref}
	%enable \cref{...} and \Cref{...} instead of \ref: Type of reference included in the link
	\usepackage[capitalise,nameinlink]{cleveref}
	% UTF-8 encoding
	\usepackage[T1]{fontenc}
	\usepackage[utf8]{inputenc} %support umlauts in the input
	% Easier compilation
	\usepackage{bookmark}
	\usepackage{natbib}

	\usepackage{xcolor}
	\newcommand{\todo}[1]{\marginpar{{\textsf{TODO}}}{\textbf{\color{red}[#1]}}}

	\begin{document}
		\title{What is Waldo?}
		\author{Kelvin Davis \and Jip J. Dekker \and Anthony Silvestere}
		\maketitle

		\begin{abstract}

		The famous brand of picture puzzles ``Where's Waldo?'' relates well to many
		unsolved image classification problem. This offers us the opportunity to
		test different image classification methods on a data set that is both small
		enough to compute in a reasonable time span and easy for humans to
		understand. In this report we compare the well known machine learning
		methods Naive Bayes, Support Vector Machines, $k$-Nearest Neighbors, and
		Random Forest against the Neural Network Architectures LeNet, Fully
		Convolutional Neural Networks, and Fully Convolutional Neural Networks.
		\todo{I don't like this big summation but I think it is the important
		information}
		Our comparison shows that \todo{...}

		\end{abstract}

		\section{Introduction}

		Almost every child around the world knows about ``Where's Waldo?'', also
		known as ``Where's Wally?'' in some countries. This famous puzzle book has
		spread its way across the world and is published in more than 25 different
		languages. The idea behind the books is to find the character ``Waldo'',
		shown in \Cref{fig:waldo}, in the different pictures in the book. This is,
		however, not as easy as it sounds. Every picture in the book is full of tiny
		details and Waldo is only one out of many. The puzzle is made even harder by
		the fact that Waldo is not always fully depicted, sometimes it is just his
		head or his torso popping out from behind something else. Lastly, the reason
		that even adults will have trouble spotting Waldo is the fact that the
		pictures are full of ``Red Herrings'': things that look like (or are colored
		as) Waldo, but are not actually Waldo.

		\begin{figure}[ht]
			\includegraphics[scale=0.35]{waldo.png}
			\centering
			\caption{
				A headshot of the character ``Waldo'', or ``Wally''. Pictures of Waldo
				copyrighted by Martin Handford and are used under the fair-use policy.
			}
			\label{fig:waldo}
		\end{figure}

		The task of finding Waldo is something that relates to a lot of real life
		image recognition tasks. Fields like mining, astronomy, surveillance,
		radiology, and microbiology often have to analyse images (or scans) to find
		the tiniest details, sometimes undetectable by the human eye. These tasks
		are especially hard when the thing(s) you are looking for are similar to the
		rest of the images. These tasks are thus generally performed using computers
		to identify possible matches.

		``Where's Waldo?'' offers us a great tool to study this kind of problem in a
		setting that is humanly tangible. In this report we will try to identify
		Waldo in the puzzle images using different classification methods. Every
		image will be split into different segments and every segment will have to
		be classified as either being ``Waldo'' or ``not Waldo''. We will compare
		various different classification methods from more classical machine
		learning, like naive Bayes classifiers, to the currently state of the art,
		Neural Networks. In \Cref{sec:background} we will introduce the different
		classification methods, \Cref{sec:method} will explain the way in which
		these methods are trained and how they will be evaluated, in
		\Cref{sec:results} will discuss the results, and \Cref{sec:conclusion} will
		offer our final conclusions.

		\section{Background} \label{sec:background}

		The classification methods used can separated into two separate groups:
		classical machine learning methods and neural network architectures. Many of
		the classical machine learning algorithms have variations and improvements
		for various purposes; however, for this report we will be using their only
		their basic versions. In contrast, we will use different neural network
		architectures, as this method is currently the most used for image
		classification.

		\subsection{Classical Machine Learning Methods}

		The following paragraphs will give only brief descriptions of the different
		classical machine learning methods used in this reports. For further reading
		we recommend reading ``Supervised machine learning: A review of
		classification techniques'' \cite{MLReview}.

		\paragraph{Naive Bayes Classifier}

		\cite{naivebayes} is a classification method according to Bayes' theorem,
		shown in \Cref{eq:bayes}. Bayes' theorem allows us to calculate the
		probability of an event taking into account prior knowledge of conditions of
		the event in question. In classification this allows us to calculate the
		probability that a new instance has a certain class based its features. We
		then assign the class that has the highest probability.

		\begin{equation}
			\label{eq:bayes}
			P(A\mid B)=\frac {P(B\mid A)\,P(A)}{P(B)}
		\end{equation}

		\paragraph{$k$-Nearest Neighbors}

		($k$-NN) \cite{knn} is one of the simplest machine learning algorithms. It
		classifies a new instance based on its ``distance'' to the known instances.
		It will find the $k$ closest instances to the new instance and assign the
		new instance the class that the majority of the $k$ closest instances has.
		The method has to be configured in several ways:  the number of $k$, the
		distance measure, and (depending on $k$) a tie breaking measure all have to
		be chosen.

		\paragraph{Support Vector Machine}

		(SVM) \cite{svm} has been very successful in many classification tasks. The
		method is based on finding boundaries between the different classes. The
		boundaries are defined as functions on the features of the instances. The
		boundaries are optimized to have the most amount of space between the
		boundaries and the training instances on both sides. Originally the
		boundaries where linear functions, but more recent development allows for
		the training of non-linear boundaries~\cite{svmnonlinear}. Once the training
		has defined the boundaries new instances are classified according to on
		which side of the boundary they belong.

		\paragraph{Random Forest}

		\cite{randomforest} is a method that is based on classifications decision
		trees. In a decision tree a new instances is classified by going down a
		(binary) tree. Each non-leaf node contain a selection criteria to its
		branches. Every leaf node contains the class that will be assigned to the
		instance if the node is reached. In other training methods, decision trees
		have the tendency to overfit, but in random forest a multitude of decision
		tree is trained with a certain degree of randomness and the mean of these
		trees is used which avoids this problem.

		\subsection{Neural Network Architectures}

		There are many well established architectures for Neural Networks depending
		on the task being performed. In this paper, the focus is placed on
		convolution neural networks, which have been proven to effectively classify
		images \cite{NIPS2012_4824}. One of the pioneering works in the field, the
		LeNet architecture~\cite{726791}, will be implemented to compare against two
		rudimentary networks with more depth. These networks have been constructed
		to improve on the LeNet architecture by extracting more features, condensing
		image information, and allowing for more parameters in the network. The
		difference between the two network use of convolutional and dense layers.
		The convolutional neural network contains dense layers in the final stages
		of the network. The Fully Convolutional Network (FCN) contains only one
		dense layer for the final binary classification step. The FCN instead
		consists of an extra convolutional layer, resulting in an increased ability
		for the network to abstract the input data relative to the other two
		configurations.

		\todo{Insert image of LeNet from slides}

		\section{Method} \label{sec:method}

		In order to effectively utilize the aforementioned modeling and
		classification techniques, a key consideration is the data they are acting
		on. A dataset containing Waldo and non-Waldo images was obtained from an
		Open Database\footnote{``The Open Database License (ODbL) is a license
		agreement intended to allow users to freely share, modify, and use [a]
		Database while maintaining [the] same freedom for
		others"\cite{openData}}hosted on the predictive modeling and analytics
		competition framework, Kaggle. The distinction between images containing
		Waldo, and those that do not, was provided by the separation of the images
		in different sub-directories. It was therefore necessary to preprocess these
		images before they could be utilized by the proposed machine learning
		algorithms.

		\subsection{Image Processing} \label{imageProcessing}

		The Waldo image database consists of images of size	64$\times$64,
		128$\times$128, and 256$\times$256 pixels obtained by dividing complete
		Where's Waldo? puzzles. Within each set of images, those containing Waldo
		are located in a folder called `waldo', and those not containing Waldo, in a
		folder called `not\_waldo'. Since Where's Waldo? puzzles are usually densely
		populated and contain fine details, the 64$\times$64 pixel set of images
		were selected to train and evaluate the machine learning models. These
		images provide the added benefit of containing the most individual images of
		the three size groups. \\

		Each of the 64$\times$64 pixel images were inserted into a
		Numpy~\cite{numpy} array of images, and a binary value was inserted into a
		seperate list at the same index. These binary values form the labels for
		each image (waldo or not waldo). Colour normalisation was performed on each
		so that artefacts in an image's colour profile correspond to meaningful
		features of the image (rather than photographic method).\\


		Each original puzzle is broken down into many images, and only contains one
		Waldo. Although Waldo might span multiple 64$\times$64 pixel squares, this
		means that the non-Waldo data far outnumbers the Waldo data. To combat the
		bias introduced by the skewed data, all Waldo images were artificially
		augmented by performing random rotations, reflections, and introducing
		random noise in the image to produce news images. In this way, each original
		Waldo image was used to produce an additional 10 variations of the image,
		inserted into the image array. This provided more variation in the true
		positives of the data set and assists in the development of more robust
		methods by exposing each technique to variations of the image during the
		training phase. \\

		Despite the additional data, there were still over ten times as many
		non-Waldo images than Waldo images. Therefore, it was necessary to cull the
		no-Waldo data, so that there was an even split of Waldo and non-Waldo
		images, improving the representation of true positives in the image data
		set. Following preprocessing, the images (and associated labels) were
		divided into a training and a test set with a 3:1 split. \\

		\subsection{Neural Network Training}\label{nnTraining}

		The neural networks used to classify the images were supervised learning
		models; requiring training on a dataset of typical images. Each network was
		trained using the preprocessed training dataset and labels, for 25 epochs
		(one forward and backward pass of all data) in batches of 150. The number of
		epochs was chosen to maximise training time and prevent
		overfitting\footnote{Overfitting occurs when a model learns from the data
		too specifically, and loses its ability to generalise its predictions for
		new data (resulting in loss of prediction accuracy)} of the training data,
		given current model parameters. The batch size is the number of images sent
		through each pass of the network. Using the entire dataset would train the
		network quickly, but decrease the network's ability to learn unique features
		from the data. Passing one image at a time may allow the model to learn more
		about each image, however it would also increase the training time and risk
		of overfitting the data. Therefore the batch size was chosen to maintain
		training accuracy while minimising training time.

		\subsection{Neural Network Testing}\label{nnTesting}

		After training each network, a separate test set of images (and labels) was
		used to evaluate the models. The result of this testing was expressed
		primarily in the form of an accuracy (percentage). These results as well as
		the other methods presented in this paper are given in Figure
		\todo{insert ref to results here} of the Results section.
		\todo{***********}

		\subsection{Benchmarking}\label{benchmarking}

		In order to benchmark the Neural Networks, the performance of these
		algorithms are evaluated against other Machine Learning algorithms. We use
		Support Vector Machines, K-Nearest Neighbors (\(K=5\)), Naive Bayes and
		Random Forest classifiers, as provided in Scikit-Learn~\cite{scikit-learn}.

		\subsection{Performance Metrics}\label{performance-metrics}

		To evaluate the performance of the models, we record the time taken by
		each model to train, based on the training data and statistics about the
		predictions the models make on the test data. These prediction
		statistics include:

		\begin{itemize}
		\item
		\textbf{Accuracy:}
		\[a = \dfrac{|correct\ predictions|}{|predictions|} = \dfrac{tp + tn}{tp + tn + fp + fn}\]
		\item
		\textbf{Precision:}
		\[p = \dfrac{|Waldo\ predicted\ as\ Waldo|}{|predicted\ as\ Waldo|} = \dfrac{tp}{tp + fp}\]
		\item
		\textbf{Recall:}
		\[r = \dfrac{|Waldo\ predicted\ as\ Waldo|}{|actually\ Waldo|} = \dfrac{tp}{tp + fn}\]
		\item
		\textbf{F1 Measure:} \[f1 = \dfrac{2pr}{p + r}\] where \(tp\) is the
		number of true positives, \(tn\) is the number of true negatives,
		\(fp\) is the number of false positives, and \(tp\) is the number of
		false negatives.
		\end{itemize}

		\emph{Accuracy} is a common performance metric used in Machine Learning,
		however in classification problems where the training data is heavily
		biased toward one category, sometimes a model will learn to optimize its
		accuracy by classifying all instances as one category. I.e. the
		classifier will classify all images that do not contain Waldo as not
		containing Waldo, but will also classify all images containing Waldo as
		not containing Waldo. Thus we use, other metrics to measure performance
		as well.
		\\
		\par
		\emph{Precision} returns the percentage of classifications of Waldo that
		are actually Waldo. \emph{Recall} returns the percentage of Waldos that
		were actually predicted as Waldo. In the case of a classifier that
		classifies all things as Waldo, the recall would be 0. \emph{F1-Measure}
		returns a combination of precision and recall that heavily penalizes
		classifiers that perform poorly in either precision or recall.
		% Kelvin End

		\section{Results} \label{sec:results}

		\section{Conclusion} \label{sec:conclusion}

		\clearpage          % Ensures that the references are on a seperate page
		\pagebreak
		\bibliographystyle{alpha}
		\bibliography{references}
	\end{document}