-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmultiway_structural_entropic_distance.tex
340 lines (307 loc) · 21 KB
/
multiway_structural_entropic_distance.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
%************************************************
\chapter{Multiway Structural Entropic Distance}\label{ch:multiway_structural_entropic_distance}
%************************************************
%\begin{flushleft}{\slshape
% At the time, Nixon was normalizing relations with China.
% I figured that if he could normalize relations, then so could I} \\ \medskip
% --- Edgar Codd
%\end{flushleft}
%
Much of the research on similarity search focuses on the similarity (or distance) between two objects. For many situations, however, ascertaining the mutual similarity of a set of objects would be useful. Applications could be, for example, similarity joins, clustering, cluster analysis, and potentially many more that are dependent upon these techniques.
The calculation of density, or multi-way divergence as the analogue to distance, has been rarely used and is typically calculated through some compound function over the set of pair-wise distances within the set of objects: for example, the mean inter-set distance or the mean distance from each object to a centroid.
%If the divergence function is used frequently, a large number of calls may be required of the, typically expensive, distance metric, making this type of operation prohibitively expensive.
In this chapter, we derive a function to calculate the divergence of a set of objects that is based on the structural entropic distance discussed in chapter \ref{ch:sim_structured_data}. This new metric, which is grounded in information theory, avoids the problem of repeated calls to the distance metric through a direct, calculable notion of multi-way divergence without relying on approximation. It reuses the notion of complexity offered in the definition of the original distance metric and reverts to this definition when the size of the set is two. It is bounded, giving a maximum value when there is no commonality, allowing absolute comparisons to be made. And finally, it is only a little more expensive than a single distance to evaluate.
We show that multi-way divergence offers a cheaper alternative to compound functions whilst giving similar, or better, semantic properties. We believe it could be applied usefully to construct new metric indices or to execute more complex queries, such as similarity joins, efficiently.
\section{Multiway distances}
The notion of a multi-way distance metric is not new, motivation coming from geometry and topology. Recently a few papers have analysed the generalisation to multi-way for any existing metric\cite{Warrens2009, Warrens2010, Deza2000797}. They consider whether various axioms are observed in these generalisations. For example, a metric space with a simple dyadic metric has the following axioms:
%
\begin{equation*}
\begin{aligned}
d(x, y) &= 0 \Leftrightarrow x = y \\
d(x, y) &= d(y, x) \\
d(x, y) &\leq d(x, z) + d(y, z)
\end{aligned}
\end{equation*}
%
The first of these, is simply generalised to $D(x_1, \ldots, x_n) = 0$ if and only if all $x_i$ are equal; the second to $D(x_{\pi(1)}, \ldots, x_{\pi(n)}) = D(x_1, \ldots, x_n)$ for every permutation $\pi$ of $\{1, 2, \ldots, n\}$; while the third, triangle inequality, has been generalised in numerous ways\cite{Warrens2010}, such as polyhedron inequality:
%
\begin{equation*}
(n - 1) \cdot D(x_1, \ldots, x_n) \leq \sum_{i = 1}^n D(x_1, \ldots, x_{i-1}, x_{i + 1},\ldots, x_{n+1})
\end{equation*}
%
These generalised axioms are interesting in their own right, and the first two are desirable properties of a multi-way divergence function, but it is not clear if polyhedron inequality aids similarity search at this stage.
\subsection{Compound metrics}
Deza and Rosenberg introduced the multi-way extension of the star distance in \cite{Deza2000797}, while perimeter distance\cite{Warrens2010} gives a geometrical ``average distance''. Both of these, however, are compound functions and do not fundamentally look at generalising any specific metric.
\subsection{Multiway structural entropic distance (MSED)}
Let $X_1, \ldots X_n$ be the sources emitting the streams $E_1, \ldots E_n$; let $Y$ be the notional source emitting the merged stream $F = \{f_1, f_2, \ldots, f_m\}$ where $\forall E_i, E_i \subseteq F$, then
\begin{equation}
P_Y(f_i) = \frac{\sum_{j=1}^n P_{X_j}(f_i)}{n}
\end{equation}
Recall from chapter \ref{ch:sim_structured_data} that the complexity of a source $X_i$ is given by the logarithmic base $b$ raised to the power of the entropy $H_b(X_i) = -\sum_{e \in E_i} P_{X_i}(e) \log_b P_{X_i}(e)$ of that source, $C(X_i) = b^{H_b(X_i)}$.
Using the same principle as before of applying the ratio of the complexity of the merged stream to the geometric mean of individual complexities, we arrive at the ratio
\begin{equation}
\frac{C(Y)}{\sqrt[n]{\prod_{i=1}^n C(X_i)}}
\end{equation}
which gives a value in the range $[1, n]$; we scale this to $[0,1]$ to give the multiway structural entropic distance:
\begin{equation}
MSED(X_1, \ldots X_n) = \frac{1}{n - 1} \cdot \left(\frac{C(Y)}{\sqrt[n]{\prod_{i=1}^n C(X_i)}} - 1\right)
\end{equation}
As it should, the joint complexity equals the individual complexities when all structures are identical, and equals the sum of the individual
complexities when there is no common structure.
Using the same mapping to vector spaces as before, this generalised function remains a ratio of the complexity of a centroid to the geometric mean of its neighbours' complexities, and has the following properties:
\begin{itemize}
\item If all vectors are identical it gives 0
\item If all vectors are different it gives 1
\item All other inputs give a value between these bounds
\end{itemize}
Consider now the metric space axioms described at the beginning of this section; we have stated that the lower bound is achieved when all elements are the same, so the first axiom holds. Since all operations involved in both the numerator and denominator are associative, total symmetry holds too. We do not yet know whether polyhedron inequality holds.
\subsection{Relationship to Jensen-Shannon divergence}
As we saw in the last chapter, SED is related to Jensen-Shannon. A similar relationship exists between the two multi-way versions. Recall that Jensen-Shannon may be expressed in vector notation as:
\begin{equation}
JSD(\mathbf{x}, \mathbf{y}) = H_b(\frac{1}{2} (\mathbf{x} + \mathbf{y})) - \frac{1}{2}(H_b(\mathbf{x}) + H_b(\mathbf{y}))
\end{equation}
When defining Jensen-Shannon divergence, Lin\cite{} also described a multi-way generalisation:
\begin{equation}
JSD(\mathbf{x^{(1)}},\ldots, \mathbf{x^{(n)}}) = H_b(\frac{1}{n}\sum_{i = 1}^n \mathbf{x^{(i)}}) - \frac{1}{n} \sum_{i = 1}^n H_b(\mathbf{x^{(i)}})
\end{equation}
\begin{mytheorem}{$MSED(\mathbf{x^{(1)}},\ldots, \mathbf{x^{(n)}}) = \frac{b^{JSD(\mathbf{x^{(1)}},\ldots, \mathbf{x^{(n)}})} - 1}{n - 1}$}
\begin{proof}\label{proof:JS}
\begin{align}
MSED(\mathbf{x^{(1)}},\ldots, \mathbf{x^{(n)}}) &= \frac{1}{n - 1} \cdot \left(\frac{C(\frac{1}{n}\sum_{i = 1}^n \mathbf{x^{(i)}})}{\sqrt[n]{\prod_{i=1}^n \strut C(\mathbf{x^{(i)}})}} - 1\right)\\
&= \frac{1}{n - 1} \cdot \left(\frac{b^{H_b(\frac{1}{n}\sum_{i = 1}^n \mathbf{x^{(i)}})}}{\sqrt[n]{\prod_{i=1}^n \strut b^{H_b(\mathbf{x^{(i)}})}}} - 1\right)\\
&= \frac{1}{n - 1} \cdot \left(\frac{b^{H_b(\frac{1}{n}\sum_{i = 1}^n \mathbf{x^{(i)}})}}{b^{\frac{1}{n} \sum_{i=1}^n H_b(\mathbf{x^{(i)}})}} - 1\right)\\
%
&= \frac{1}{n - 1} \cdot (b^{H_b(\frac{1}{n}\sum_{i = 1}^n \mathbf{x^{(i)}}) -\frac{1}{n}\sum_{i=1}^n H_b(\mathbf{x^{(i)}})} - 1)\\
&= \frac{b^{JSD(\mathbf{x^{(1)}},\ldots, \mathbf{x^{(n)}})} - 1}{n - 1}
\end{align}
\end{proof}
\end{mytheorem}
%\subsection{old-Generalisation}
%Consider the algebraic rewrite of SED introduced in chapter 3 which culminates with final equation:
%
%\begin{myMaths}
%d(X,Y) = 2 \cdot \left( \prod_{e \in X \cap Y}
%\left(u_e^{u_e} \cdot v_e^{v_e} \right)^{w}\right) - 1
%\end{myMaths}
%
%The product term itself is composed of three elements -- $u_e$, $v_e$, and $w_e$. $u_e$ and $v_e$ can be thought of themselves as relative frequencies of the particular frequency of term $e$, and $u_e + v_e = 1$. While $w_e$ is the mean of the relative frequencies of $e$.
%
%With these two points in mind, the formula generalises to more than two variables (i.e. $d(X_1, \ldots, X_n)$) as follows. First, since $u_e$ and $v_e$ may be thought of as relative frequencies, replace these with $u_{ie} = \frac{f_{X_i}(e)} {\sum_{X} f_{X}(e)}$. Then likewise with $w_e$, the mean, replace with $w_e = \frac{\sum_{X} f_{X}(e)}{n}$. Now the product term can be written as
%
%\begin{myMaths}
%\prod_{e \in \bigcap X} \prod_{i = 1}^{n} u_{ie}^{u_{ie}w_e}
%\end{myMaths}
%Next, notice that this product term is now in the range $[\frac{1}{n}, 1]$ and thus requires scaling to the range $[0,1]$ as before, giving the generalised multivariate distance function:
%
%\begin{myMaths} d(X_1, \ldots, X_n) = \frac{n \cdot \left(\prod_{e \in \bigcap X} \prod_{i = 1}^{n} u_{ie}^{u_{ie}w_e}\right) - 1 }{n - 1}
%\end{myMaths}
\section{Correlations}
\begin{figure}
\centering
\subfloat[triples with mean structural entropic distance]{
\includegraphics[width=0.5\textwidth]{gfx/correlations/2.png}
\label{fig:triples}
}%
~ %add desired spacing between images, e. g. ~, \quad, \qquad etc.
%(or a blank line to force the subfigure onto a new line)
\subfloat[5-tuples with mean euclidean distance to a centroid]{
\includegraphics[width=0.5\textwidth]{gfx/correlations/4.png}
\label{fig:5-tuples}
}
\subfloat[5-tuples with mean euclidean distance to a centroid]{
\includegraphics[width=0.5\textwidth]{gfx/correlations/8.png}
\label{fig:5-tuples}
}
~
\subfloat[5-tuples with mean euclidean distance to a centroid]{
\includegraphics[width=0.5\textwidth]{gfx/correlations/16.png}
\label{fig:5-tuples}
}
\caption[Correlation with divergence in 15-dimensional space]{Correlation with divergence in 15-dimensional space}\label{fig:corr-mean}
\end{figure}
%We compared our function to three other measures of multi-way divergence, based both on the SED metric and also on Euclidean distance. Set sizes of 3, 4 and 5 were chosen over randomly generated 5 and 15 dimensional probability vectors.
%\begin{table}
%
%\centering
%\begin{tabularx}{0.95\linewidth}{X l l l}
%\toprule
%\multicolumn{3}{r}{PCC} \\
%\cmidrule(r){2-4}
%5 dimension Structural entropic distance & 3-tuple & 4-tuple & 5-tuple\\
%\midrule
%mean intra-cluster distance & 0.991285 & 0.9918386 & 0.9943856\\
%mean distance to a centroid & 0.9950909 & 0.9952373 & 0.9959319\\
%max intra-cluster distance & 0.9365853 & 0.9027764 & 0.9021573\\
%\midrule
%\multicolumn{3}{l}{5 dimension Euclidean distance} \\
%\midrule
%mean intra-cluster distance & 0.9482936 & 0.9584011 & 0.9669735\\
%mean distance to a centroid & 0.9478603 & 0.9565286 & 0.9634166\\
%max intra-cluster distance & 0.9059921 & 0.8853846 & 0.8865668\\
%\midrule
%\multicolumn{3}{l}{15 dimension Structural entropic distance} \\
%\midrule
%mean intra-cluster distance & 0.9924866 & 0.990572 & 0.9915913\\
%mean distance to a centroid & 0.9958805 & 0.9950946 & 0.9956354\\
%max intra-cluster distance & 0.9166376 & 0.854482 & 0.8198081\\
%\midrule
%\multicolumn{3}{l}{15 dimension Euclidean distance} \\
%\midrule
%mean intra-cluster distance & 0.9645352 & 0.9657067 & 0.9677485\\
%mean distance to a centroid & 0.964599 & 0.9649238 & 0.9662083\\
%max intra-cluster distance & 0.8966909 & 0.8426229 & 0.8024388\\
%\bottomrule
%\end{tabularx}
%\caption{Pearson's correlation coefficients for multi-way divergence}
%\label{term_table}
%\end{table}%
%The maximum distance within each cluster is taken by running through all pair-wise combinations and selecting the largest.
The best correlation comes with the mean distance to a centroid. This method is calculated by making a centroid using the mean vector then averaging all distances in the cluster to it. Since multi-way divergence is a ratio of the complexity of a centroid to the geometric mean of individual complexities, there is much more in common with this definition. Even when the comparison distance metric is Euclidean distance, a very strong correlation exists (figure \ref{fig:5-tuples}).
The mean intra-cluster distance also correlates well with multi-way divergence. Figure \ref{fig:triples} shows the correlation with the mean intra-cluster distance, which appears to be strongest at the, more commonly used, lower end.
Multi-way divergence correlates -- less strongly than the other methods -- with both SED and Euclidean maximum intra-cluster distance, and the correlation drops as the cluster size increases. Rather than assessing the mutual similarity, the maximum distance simply describes the two farthest points in the cluster. These two points must lie on the cluster perimeter and describe the spread of points across the space. Using only two points, however, fails to account for the spread in other dimensions, further verified by the drop in correlation in the higher dimensional space.
%\begin{figure}
% \centering
% \begin{subfigure}[b]{0.3\textwidth}
% \centering
% \includegraphics[width=\textwidth]{images/triples-max.png}
% \caption{triples}
% \label{fig:triples}
% \end{subfigure}%
% ~ %add desired spacing between images, e. g. ~, \quad, \qquad etc.
% %(or a blank line to force the subfigure onto a new line)
% \begin{subfigure}[b]{0.3\textwidth}
% \centering
% \includegraphics[width=\textwidth]{images/4-tuples-max.png}
% \caption{4-tuples}
% \label{fig:4-tuples}
% \end{subfigure}
% ~ %add desired spacing between images, e. g. ~, \quad, \qquad etc.
% %(or a blank line to force the subfigure onto a new line)
% \begin{subfigure}[b]{0.3\textwidth}
% \centering
% \includegraphics[width=\textwidth]{images/5-tuples-max.png}
% \caption{5-tuples}
% \label{fig:5-tuples}
% \end{subfigure}
% \caption{Correlations with maximum inter-cluster distance joining 15-dimensional spaces}\label{fig:corr-max}
%\end{figure}
\begin{figure}
\centering
\subfloat[triples with mean structural entropic distance]{
\includegraphics[width=0.5\textwidth]{gfx/correlations/3_mean.png}
\label{fig:triples}
}%
~ %add desired spacing between images, e. g. ~, \quad, \qquad etc.
%(or a blank line to force the subfigure onto a new line)
\subfloat[5-tuples with mean euclidean distance to a centroid]{
\includegraphics[width=0.5\textwidth]{gfx/correlations/3_geometric.png}
\label{fig:5-tuples}
}
\subfloat[5-tuples with mean euclidean distance to a centroid]{
\includegraphics[width=0.5\textwidth]{gfx/correlations/3_harmonic.png}
\label{fig:5-tuples}
}
~
\subfloat[5-tuples with mean euclidean distance to a centroid]{
\includegraphics[width=0.5\textwidth]{gfx/correlations/3_centroid.png}
\label{fig:5-tuples}
}
\subfloat[5-tuples with mean euclidean distance to a centroid]{
\includegraphics[width=0.5\textwidth]{gfx/correlations/3_max.png}
\label{fig:5-tuples}
}
~
\subfloat[5-tuples with mean euclidean distance to a centroid]{
\includegraphics[width=0.5\textwidth]{gfx/correlations/3_min.png}
\label{fig:5-tuples}
}
\caption[Correlation with divergence in 15-dimensional space]{Correlation with divergence in 15-dimensional space}\label{fig:corr-mean}
\end{figure}
\begin{figure}
\centering
\subfloat[triples with mean structural entropic distance]{
\includegraphics[width=0.5\textwidth]{gfx/correlations/16_Mean.png}
\label{fig:triples}
}%
~ %add desired spacing between images, e. g. ~, \quad, \qquad etc.
%(or a blank line to force the subfigure onto a new line)
\subfloat[5-tuples with mean euclidean distance to a centroid]{
\includegraphics[width=0.5\textwidth]{gfx/correlations/16_Geometric.png}
\label{fig:5-tuples}
}
\subfloat[5-tuples with mean euclidean distance to a centroid]{
\includegraphics[width=0.5\textwidth]{gfx/correlations/16_Harmonic.png}
\label{fig:5-tuples}
}
~
\subfloat[5-tuples with mean euclidean distance to a centroid]{
\includegraphics[width=0.5\textwidth]{gfx/correlations/16_Centroid.png}
\label{fig:5-tuples}
}
\subfloat[5-tuples with mean euclidean distance to a centroid]{
\includegraphics[width=0.5\textwidth]{gfx/correlations/16_Max.png}
\label{fig:5-tuples}
}
~
\subfloat[5-tuples with mean euclidean distance to a centroid]{
\includegraphics[width=0.5\textwidth]{gfx/correlations/16_Min.png}
\label{fig:5-tuples}
}
\caption[Correlation with divergence in 15-dimensional space]{Correlation with divergence in 15-dimensional space}\label{fig:corr-mean}
\end{figure}
\begin{figure}
\centering
\subfloat[triples with mean structural entropic distance]{
\includegraphics[width=0.5\textwidth]{gfx/correlations/pearson.png}
\label{fig:triples}
}%
~ %add desired spacing between images, e. g. ~, \quad, \qquad etc.
%(or a blank line to force the subfigure onto a new line)
\subfloat[5-tuples with mean euclidean distance to a centroid]{
\includegraphics[width=0.5\textwidth]{gfx/correlations/spearman.png}
\label{fig:5-tuples}
}
\subfloat[5-tuples with mean euclidean distance to a centroid]{
\includegraphics[width=0.5\textwidth]{gfx/correlations/kendall.png}
\label{fig:5-tuples}
}
\caption[Correlation with divergence in 15-dimensional space]{Correlation with divergence in 15-dimensional space}\label{fig:corr-mean}
\end{figure}
%\section{Performance}
%When many calculations are performed over a given metric space, the complexity of individual vectors need only be calculated once and stored for later use, thus amortising the cost of the calculation when multiple calls to divergence are required. The only calculation required is the complexity of the centroid, followed by some simple arithmetic, making the calculation really quite cheap.
%
%The performance of all the metrics depends on the number of distance calls required: divergence is constant, mean distance to centroid is linear, and mean distance is quadratic. We measured this to see the practical impact, Figure \ref{fig:performance} shows the time in nanoseconds to evaluate each function. The mean inter-cluster distance is an order of magnitude slower than divergence when the set size is just 5.
%\begin{figure}
% \centering
% \subfloat[Performance vs SED]{
% \includegraphics[width=0.5\textwidth]{gfx/sed_times.png}
% \label{fig:triples}
% }%
% ~ %add desired spacing between images, e. g. ~, \quad, \qquad etc.
% %(or a blank line to force the subfigure onto a new line)
% \subfloat[Performance vs Euclidean Distance]{
% \includegraphics[width=0.5\textwidth]{gfx/euc_times.png}
% \label{fig:5-tuples}
% }
%
% \caption[Performance]{Performance}\label{fig:msed_performance}
%\end{figure}
%We have recently made the observation that the function we had derived as a distance function $d$ can be logically generalised to a definition over multiple operands; that is, it is a special case of a more general function $D$ over a set of ensembles, such that $d(e_1,e_2) = D(\{e_1,e_2\})$.
%
%Our original ensemble distance is defined in terms of complexity and merge functions ($C$ and $M$ respectively) over ensembles as follows:
%\[
%d(e_1,e_2) = \frac{C(M(e_1,e_2))}{\sqrt{C(e_1) \cdot C(e_2)}} - 1
%\]
%%
%The notional merge operator%
%\footnote{the name derives from ensembles used to model structured data} is an averaging operation over ensembles: $M(e_1,e_2)$ forms a new ensemble, with events drawn from $e_1 \cup e_2$, each assigned the mean probability of that event in $e_1$ and $e_2$. The complexity function $C(e)$ is $x^{H_x(e)}$ where $H_x(e)$ is the Shannon entropy \cite{}%
%\footnote{\cite{} gives much deeper mathematical insight of great value in this context} of the ensemble $e$ using base $x$ logarithms.
%
%Our primary observation is that the semantics underlying the derivation of this form can be generalised to a set $E$ of ensembles, with $n$ elements denoted by $e_i$:
%%
%\[
%D(E) = \left(\frac{C(M'(E))}{\sqrt[n]{ \prod_i {C(e_i) }}} - 1\right) \cdot \frac{1}{n-1}
%\]
%%
%from which it can be seen how the special case of $n=2$ reverts to the original distance function.
\section{Summary}