diff --git a/wk8/week8.tex b/wk8/week8.tex index f07edd2..c574cc4 100644 --- a/wk8/week8.tex +++ b/wk8/week8.tex @@ -12,6 +12,8 @@ \usepackage[utf8]{inputenc} %support umlauts in the input % Easier compilation \usepackage{bookmark} +\usepackage{natbib} +\usepackage{graphicx} \begin{document} \title{Week 8 - Quantitative data analysis} @@ -25,8 +27,161 @@ \section{Method} \label{sec:method} + The purpose of this report is to re-analyse the data presented in the paper by + \cite{dong2018methods}, which investigates the effect that protests (as an + example of disruptive social behaviours in general) have on consumer + behaviours. \cite{dong2018methods} hypothesise that protests decrease + consumer behaviour in the surrounding area of the event, and suggest that + consumer spending could be used as an additional non-traditional economic + indicator and as a gauge of consumer sentiment. Consumer spending was analysed + using credit card transaction data from a metropolitan area within a country + that is part of The Organisation for Economic Co-operation and Development + (OECD). Although \cite{dong2018methods} investigate temporal and spatial + effects on consumer spending, for the purposes of this analysis, only the + spatial effect of variables (with relation to the geographical distance from + the event) is considered. The dataset consists of variables measured as a + function of the distance from the event (in km), including: the number of + customers, the median spending amount, the number of transactions, and the + total sales amount. + + The re-analysis is conducted on the data provided in the + paper\cite{dong2018methods}, using Python in conjunction with packages such as + pandas, matplotlib, numpy and seaborn, to process and visualise the data. As + aformentioned, only spatial data and the variables mentioned above are + considered, for the reference days and the change occuring Day 62 (day of + first socially disruptive event). The distribution of the difference between + the reference period and Day 62 is visualised by plotting a histogram for each + variable. Since the decrease of each the variables from the reference period + to Day 62 is provided, the mean and the median of these distributions can be + used to perform a one-sample (as we have are given the difference) hypothesis + test to assess whether the protests on Day 62 had a discernable effect. + + Assuming the mean of each variable over the reference period is the midpoint + between their respective maximum and minimum values, we can reconstruct + approximate actual values for Day 62 (given the decrease in value on Day 62 + from the reference period). By comparing these value to the range over the + reference period, another assessment can be made to determine whether the data + presents a discernible effect on consumer spending as a result of social + discuption, scaling with distance. + + Although time series data was not explicitely provided, by extrapolating + information from a graph in \cite{dong2018methods} we can quantify the decrease + in number of customers and median spending on Day 62 using information about the + reference days (from 43 to 61). After collecting the values for each of the + reference days (43-61), the mean and standard deviation of this sample can be + calculated. Assuming a normal distribution of the data, we can calculate a + z-score for each observation on Day 62, and use this to assess the original + hypothesis. + + By performing each of the above test, a re-analysis will be conducted on + \cite{dong2018methods}'s paper hypothesising that consumer spending decreases + as a result of social events such as protests. In the Results section, we will + perform the statistical analyses described above. The results of these tests + will then be explored in the Discussion section, along with assumptions and + limitations of the tests and what can be conclused from them. + \section{Results} \label{sec:results} + For each of the variables in the given data (number of customers, median + spending amount, number of transactions, and sales totals) we construct a + histogram of the decrease of each (on Day 62). We then compute the mean and + median of the data so we can proceed to perform a one-sample hypothesis test. + + \begin{figure}[ht] + \centering + \label{fig:distr} + \includegraphics[width=\textwidth]{distr.png} + \caption{Distribution of each of the variables recorded in the data, as a function of the distance from an event} + \end{figure} + + Using a mean/median of the reference period, obtained by taking the midpoint of the minimum and maximum values over for each distance measure, a value can be reconstructed for the measurement on Day 62 (for each location) using: + + \begin{equation} + \textrm{value} = \frac{\textrm{min} + \text{max}}{2} - \textrm{decrease.} + \tag{1} + \end{equation} +\\ + We can then plot the maximum and minimum values for the reference period, as well as the reconstructed Day 62 variables to observe the behaviour of consumer spending after the event. + + \begin{figure}[ht] + \centering + \label{fig:effect} + \includegraphics[width=\textwidth]{effect.png} + \caption{The reconstructed values for Day 62 of each variable plotted against their respective minimums and maximums over the reference period} + \end{figure} + + Using the data recorded, for each of the three distance recorded, the mean and standard deviation of the reference period can be calculated. The z-score for each observed value on Day 62 can be computed using: + + \begin{equation} + \textrm{Z} = \frac{\textrm{X} - \mu}{\sigma}, + \tag{2} + \end{equation} +\\ + where X is the observed value, $\mu$ and $\sigma$ are the mean and standard deviation (respectively) of the reference period. + + \begin{table}[ht] + \centering + \label{my-label} + \begin{tabular}{|l|l|r|r|} + \hline + \textbf{Variable} & \textbf{Distance} & \textbf{X} & \textbf{Z} \\ + \hline + \textbf{Customers} & \textless 2km & -0.600 & 6.87798 \\ + \textbf{Customers} & 2km - 4km & -0.200 & -3.33253 \\ + \textbf{Customers} & \textgreater 4km & -0.100 & -3.70740 \\ + \textbf{Median Spending} & \textless 2km & -0.200 & -3.05849 \\ + \textbf{Median Spending} & 2km - 4km & -0.100 & -1.46508 \\ + \textbf{Median Spending} & \textgreater 4km & -0.035 & -1.99199 \\ + \hline + \end{tabular} + \caption{The $Z$ score computed using equation 2 and the temporal data} + \end{table} + \section{Discussion} \label{sec:discussion} + As shown in each of the subplots of Figure 1, the mean and median values of + the decrease in each of the distributions are greater than zero (note: higher + values of the decrease variable indicate a larger decrease/negative change). + These mean and median values can be used to perform a one-sample hypothesis + tests, which finds that since each of the mean/median values is greater than + zero, we can infer that the event had a net decreasing affect on the number of + customers, median spending amount, number of transactions, and total sales + amount. + + In Figure \ref{fig:effect} values were approximated for each variable on Day + 62, using Equation 1, and plotted against the minimum and maximum values of + the respective variables. This allows us to visually assess whether the + reconstructed value for Day 62 lies outside the range of recorded values for + the reference period, and presents uncharacteristic behaviour. A decrease is + evident in each of the variables after the event has occurred (on Day 62) + within a distance of approximately 2 km, and appears to stabilise thereafter. + This provides support to \cite{dong2018methods}'s hypothesis that consumer + spending is affected by socially disruptive events, and also provides evidence + to the notion of spatial scaling of this effect (based on the event location). + It is important to note that the approximation used in this technique is + subject to a level of error due to the ideal calculation of the mean/median of + the reference data as the midpoint between the minimum and maximum values + provided. + + Extrapolating data from a graph in \cite{dong2018methods} provided time series + data (divided into three radius') to analyse. This data was collected by + visually estimating the values from the graph which will inherently introduce + a source of error. However, by computing the z-score as described in Equation + 2, the table provided in Figure 3 was constructed. Each of the z-score values + in the table are negative, indicating a decrease in both the number of + customers and median spending on Day 62. The much larger magnitude of z-scores + for the <2km distance ring for both variables is in agreement with earlier + discussion, strengthening the hypothesis of the spatial correlation of + consumer spending. + + Each of the above tests have agreed on the spatial and temporal correlation of + consumer spending and socially disruptive events. With the limited data + available, we can therefore concur with the hypothesis of Dong et al. that + consumer spending decreases in the area around disruptive social behaviour, + after finding the temporal correlation on Day 62, as well as the spatially + decreasing effect further from the event. + + \bibliographystyle{humannat} + \bibliography{references} + \end{document}