Authors:
(1) Andrew Draganov, Aarhus University and All authors contributed equally to this research;
(2) David Saulpic, Université Paris Cité & CNRS;
(3) Chris Schwiegelshohn, Aarhus University.
2 Preliminaries and Related Work
2.3 Coresets for Database Applications
4 Reducing the Impact of the Spread
4.1 Computing a crude upper-bound
4.2 From Approximate Solution to Reduced Spread
5 Fast Compression in Practice
5.1 Goal and Scope of the Empirical Analysis
5.3 Evaluating Sampling Strategies
5.4 Streaming Setting and 5.5 Takeaways
8 Proofs, Pseudo-Code, and Extensions and 8.1 Proof of Corollary 3.2
8.2 Reduction of k-means to k-median
8.3 Estimation of the Optimal Cost in a Tree
As discussed, we focus our study on linear- and sublinear-time sampling strategies. Generally speaking, we consider compression algorithms through the lens of three requirements:
• Finding the compression should not take much longer than the time to read in the dataset.
• The size of the compressed data should not depend on the size of the original dataset.
• Candidate solutions on the compressed data are provably good on the original data.
If these requirements are satisfied then, when analyzing runtimes on large datasets, it is always preferable to compress the dataset and then perform the task in question on the compressed representation.
Definition 2.1. A strong ε-coreset is a pair (Ω, w) such that for any candidate solution C,
Many of the advancements regarding coresets have sought the smallest coreset possible across metric spaces and downstream objectives. The most prominent examples are in Euclidean space [7, 40, 16, 42, 27], with much of the focus on obtaining the optimal size in the k-means and k-median setting. Recently, a lower bound [44] showed that the group sampling algorithm developed in [25, 27, 28] is optimal.
In terms of other linear-time methods with sensitivity sampling, we are only aware of the lightweight coresets approach [6], wherein one samples with respect to the candidate solution C = {µ}, i.e. the mean of the data set, instead of a candidate solution with k centers. This runs in O(nd) time but provides a weaker guarantee – one incurs an additive error of ε · cost(P, {µ}). We note that this can be generalized to performing sensitivity sampling using a C that has fewer than k centers. We discuss this in more depth in Section 5.2.
Lastly, the BICO coreset algorithm [38] utilizes the SIGMOD test-of-time winning BIRCH [58] clustering method to obtain k-means coresets while the Streamkm++ method [1] uses k-means++ [2] to obtain a coreset whose size depends on n and is exponential in d. While both were developed to perform quickly in a data-stream, we show in Section 5 that they do not provide strong coreset guarantees in practice for reasonable coreset sizes.
All efficient coreset constructions are probabilistic, making coresets difficult to evaluate. For example, it is co-NP-hard to check whether a candidate compression is a weak coreset [57] [4]. Therefore, although coreset algorithms succeed with some high probability, it is unclear how to computationally verify this. We refer to [57] for further discussion on this topic and discuss our evaluation metrics in Section 5.
MapReduce[5] is one of the most popular architectures for large scale data analysis (see for example [17, 29, 33] and references therein for MapReduce algorithms for clustering). Within this context, strong coresets are ‘embarrassingly parallel’ and have a natural synergy with the MapReduce framework due to the following two properties. First, if two coresets are computed on subsets of data then their union is a coreset for the union of the data subsets. Second, many coreset constructions produce a compression with size completely independent of n, allowing the coreset to be stored in memory-constrained environments.
Using these two properties, one can compress the data in a single round of MapReduce as follows [36]. The data is partitioned randomly among the m entities, each of which computes a coreset and sends it to a central host. By the first property, the union of these coresets is a coreset for the full dataset. Thus, the host now possesses a small summary of the full dataset with strong coreset guarantees. By the second property, this summary’s size does not depend on the size of the original dataset – in particular, the total size of the messages received by the host is independent of n. Lastly, by the coreset guarantee, any solution that is good w.r.t. the summary is good w.r.t. the original data (and vice versa).
This idea allows one to compute an approximate solution to k-means and k-median, and do so efficiently, in MapReduce: each computation entity needs to communicate merely O(k) bits, where k is the number of clusters. The computational burden can therefore be completely parallelized up to the time required to compute a coreset in each entity – precisely the focus of this paper. We explore similar aggregation strategies experimentally in Section 5.4.
A common techniques for designing fast algorithms is to embed the Euclidean space into a tree metric using quadtree embeddings. The central idea is that any hypercube in the input space can be split into 2 d sub-cubes. We can represent this in a tree-structure, where each sub-cube has the original hypercube as its parent. Centering randomly the original hypercube, and appropriately setting the weight of each branch then preserves the expected distance between points in different sub-cubes within an O( √ d log ∆) factor. Here, ∆ is the spread of the point set and is equal to the ratio of the largest distance over the smallest non-zero distance. Given this context, we now introduce quadtree embeddings more formally:
Lemma 2.2 (Lemma 11.9 in [39]). The distances in the tree metric dT satisfy ∀p, q, ∥p − q∥ ≤ E[dT (p, q)] ≤ O(d log ∆)∥p − q∥, where the expectation is taken over the random shift s of the decomposition.
This paper is available on arxiv under CC BY 4.0 DEED license.
[2] Indeed, sublinear algorithms always require some assumption on the input to provide guarantees, see [10, 30, 45, 51].
[3] For k-means an (α, β) bicriteria approximation is an algorithm that computes a solution C satisfying cost(P, C) ≤ α · OPT and |C| ≤ β · k