Testing Properties of Collections of Distributions

We propose a framework for studying property testing of collections of distributions, where the number of distributions in the collection is a parameter of the problem. Previous work on property testing of distributions considered single distributions or pairs of distributions. We suggest two models that differ in the way the algorithm is given access to samples from the distributions. In one model the algorithm may ask for a sample from any distribution of its choice, and in the other the choice of the distribution is random. Our main focus is on the basic problem of distinguishing between the case that all the distributions in the collection are the same (or very similar), and the case that it is necessary to modify the distributions in the collection in a non-negligible manner so as to obtain this property. We give almost tight upper and lower bounds for this testing problem, as well as study an extension to a clusterability property. One of our lower bounds directly implies a lower bound on testing independence of a joint distribution, a result which was left open by previous work.


Introduction
In recent years, several works have investigated the problem of testing various properties of data that is most naturally thought of as samples of an unknown distribution.More specifically, the goal in testing a specific property is to distinguish the case that the samples come from a distribution that has the property from the case that the samples come from a distribution that is far (usually in terms of 1 norm, but other norms have been studied as well) from any distribution that has the property.To give just a few examples, such tasks include testing whether a distribution is uniform [26,40] or similar to another known distribution [13], and testing whether a joint distribution is independent [11].Related tasks concern sublinear estimation of various measures of a distribution, such as its entropy [10,27] or its support size [42].Recently, general techniques have been designed to obtain nearly tight lower bounds on such testing and estimation problems [48,49].
In this work we consider the setting in which one receives data which is most naturally thought of as samples of several distributions, for example, when studying purchase patterns in several geographic locations, or the behavior of linguistic data among varied text sources.Such data could also be generated when samples of the distributions come from various sensors that are each part of a large sensor-net.In these examples, it may be reasonable to assume that the number of such distributions might be quite large, even on the order of a thousand or more.However, for the most part, previous research has considered properties of at most two distributions [12,48].We propose new models of property testing that apply to properties of several distributions.We then consider the complexity of testing properties within these models, beginning with properties that we view as basic and expect to be useful in constructing building blocks for future work.We focus on quantifying the dependence of the sample complexities of the testing algorithms in terms of the number of distributions that are being considered, as well as the size of the domain of the distributions.

Our contributions 1.The models
We begin by proposing two models that describe possible access patterns to multiple distributions D 1 , . . ., D m over the same domain [n].In these models there is no explicit description of the distribution-the algorithm is only given access to the distributions via samples.In the first model, referred to as the sampling model , at each time step, the algorithm receives a pair of the form (i, j) where i ∈ [n] is distributed according to D j and j is selected uniformly in [m].In the second model, referred to as the query model , at each time step, the algorithm is allowed to specify j ∈ [m] and receives i that is distributed according to D j .It is immediate that any algorithm in the sampling model can also be used in the query model.On the other hand, as is implied by our results, there are property testing problems which have a significantly larger sample complexity in the sampling model than in the query model.
In both models the task is to distinguish between the case that the tested distributions have the property and the case that they are -far from having the property, for a given distance parameter .Distance to the property is measured in terms of the average 1distance between the tested distributions and the closest collection of distributions that have the property.In all of our results, the dependence of the algorithms on the distance parameter is (inverse) polynomial.Hence, for the sake of succinctness, in all that follows we do not mention this dependence explicitly.We note that the sampling model can be extended to allow the choice of the distribution (that is, the index j) to be non-uniform (i.e., be determined by a weight w j ) and the distance measure is adapted accordingly.

Testing equivalence in the sampling model
One of the first properties of distributions studied in the property testing model is that of determining whether two distributions over domain [n] are identical (alternatively, very close) or far (according to the 1 -distance).In [13], an algorithm is given that uses Õ(n 2/3 ) samples and distinguishes between the case that the two distributions are -far and the case that they are O( / √ n)-close.This algorithm has been shown to be nearly tight (in terms of the dependence on n) by Valiant [49].Valiant also shows that in order to distinguish between the case that the distributions are -far and the case that they are β-close, for two constants and β, requires almost linear dependence on n.
Our main focus is on a natural generalization, which we refer to as the equivalence property of distributions D 1 , . . ., D m , in which the goal of the tester is to distinguish the case in which all distributions are the same (or, slightly more generally, that there is a distribution D * for which from the case in which there is no distribution To solve this problem in the (uniform) sampling model with sample complexity Õ(n 2/3 m) (which ensures with high probability that each distribution is sampled Ω(n 2/3 log m) times), one can make m − 1 calls to the algorithm of [13] to check that every distribution is close to D 1 .
Our algorithms.We show that one can get a better sample complexity dependence on m.Specifically, we give two algorithms, one with sample complexity Õ(n 2/3 m 1/3 + m) and the other with sample complexity Õ(n 1/2 m 1/2 + n).The first result in fact holds for the case that for each sample pair (i, j), the distribution D j (which generated i) is not selected necessarily uniformly, and furthermore, it is unknown according to what weight it is selected.The second result holds for the case where the selection is non-uniform, but the weights are known.Moreover, the second result extends to the case in which it is desired that the tester pass distributions that are close for each element, to within a multiplicative factor of (1 ± /c) for some constant c > 1, and for sufficiently large frequencies.Thus, starting from the known result for m = 2, as long as n ≥ m, the complexity grows as Both of our algorithms build on the close relation between testing equivalence and testing independence of a joint distribution over [n] × [m] which was studied in [11].The Õ(n 2/3 m 1/3 + m) algorithm follows from [11] after we fill in a certain gap in the analysis of their algorithm due to an imprecision of a claim given in [12].The Õ(n 1/2 m 1/2 + n) algorithm exploits the fact that j is selected uniformly (or, more generally, according to a known weight w j ) to improve on the Õ(n 2/3 m 1/3 + m) algorithm (in the case that m ≥ n).
Almost matching lower bounds.We show that the behavior of the upper bound on the sample complexity of the problem is not just an artifact of our algorithms, but rather (almost) captures the complexity of the problem.Namely, we give almost matching lower bounds of Ω(n 2/3 m 1/3 ) for n = Ω(m log m) and Ω(n 1/2 m 1/2 ) (for every n and m).The latter lower bound can be viewed as a generalization of a lower bound given in [13], but the analysis is somewhat more subtle.
Our lower bound of Ω(n 2/3 m 1/3 ) consists of two parts.The first is a general theorem concerning testing symmetric properties of collections of distributions.This theorem extends a central lemma of Valiant [49] on which he builds his lower bounds, and in particular the lower bound of Ω(n 2/3 ) for testing whether two distributions are identical or far from each other (i.e., the case of equivalence for m = 2).The second part is a construction of two collections of distributions to which the theorem is applied (where the construction is based on the one proposed in [11] for testing independence).As in [49], the lower bound is shown by focusing on the similarity between the typical collision statistics of a family of collections of distributions that have the property and a family of collections of distributions that are far from having the property.However, since many more types of collisions are expected to occur in the case of collections of distributions, our proof outline is more intricate and requires new ways of upper bounding the probabilities of certain types of events.

Testing clusterability in the query model
The second property that we consider is a natural generalization of the equivalence property.Namely, we ask whether the distributions can be partitioned into at most k subsets (clusters), such that within in cluster the distance between every two distributions is (very) small.We study this property in the query model, and give an algorithm whose complexity does not depend on the number of distributions and for which the dependence on n is Õ(n 2/3 ).The dependence on k is almost linear.The algorithms works by combining the diameter clustering algorithm of [3] (for points in a general metric space where the algorithm has access to the corresponding distance matrix) with the closeness of distributions tester of [13].Note that the results of [49] imply that this is tight to within polylogarithmic factors in n.

Implications of our results
As noted previously, in the course of proving the lower bound of Ω(n 2/3 m 1/3 ) for the equivalence property, we prove a general theorem concerning testability of symmetric properties of collections of distributions (which extends a lemma in [49]).This theorem may have applications to proving other lower bounds on collections of distributions.Further byproducts of our research regard the sample complexity of testing whether a joint distribution is independent, More precisely, the following question is considered in [13]: Let Q be a distribution over pairs of elements drawn from [n] × [m] (without loss of generality, assume n ≥ m); what is the sample complexity in terms of m and n required to distinguish independent joint distributions, from those that are far from the nearest independent joint distribution (in term of 1 distance)?The lower bound claimed in [11], contains a known gap in the proof.Similar gaps in the lower bounds of [13] for testing the closeness of distributions and of [10] for estimating the entropy of a distribution were settled by the work of [49], which applies to symmetric properties.Since independence is not a symmetric property, the work of [49] cannot be directly applied here.In this work, we show that the lower bound of Ω(n 2/3 m 1/3 ) indeed holds.Furthermore, by the aforementioned correction of the upper bound of Õ(n 2/3 m 1/3 ) from [11], we get nearly tight bounds on the complexity of testing independence.

Open problems and further research
There are many interesting directions to pursue concerning the testing of properties of collections of distributions, and because of the applicability of the model to a wide range of circumstances, we expect that new directions will present themselves.Here we give a few examples: One natural extension of our results is to give algorithms for testing the property of clusterability for k > 1 in the sampling model.One may also consider testing properties of collections of distributions that are defined by certain measures of distributions, and may be less sensitive to the exact form of the distributions.For example, a very basic measure is the mean (expected value) of the distribution, when we view the domain [n] as integers instead of element names, or when we consider other domains.Given this measure, we may consider testing whether the distributions all have similar means (or whether they should be modified significantly so that this holds).It is not hard to verify that this property can be quite easily tested in the query model by selecting Θ(1/ ) distributions uniformly and estimating the mean of each.On the other hand, in the sampling model an Ω( √ m) lower bound is quite immediate even for n = 2 (and a constant ).We are currently investigating whether the complexity of this problem (in the sampling model) is in fact higher, and it would be interesting to consider other measures as well.

Organization
In this extended abstract we focus on one result: the lower bound of Ω(n 2/3 m 1/3 ) for testing equivalence.We give all the details for this result, where the more technical parts can be found in the appendix.All other results are provided in the full version of this paper [33].

Preliminaries
Let [n] def = {1, . . ., n}, and let D = (D 1 , . . ., D m ) be a list of m distributions, where For a property P of lists of distributions and 0 ≤ ≤ 1, we say that D is -far from (hav- that has the property P (note that D j − D * j 1 is twice the the statistical distance between the two distributions).
Given a distance parameter , a testing algorithm for a property P should distinguish between the case that D has the property P and the case that it is -far from P. We consider two models within which this task is performed.
1.The Query Model.In this model the testing algorithm may indicate an index 1 ≤ j ≤ m of its choice and it gets a sample i distributed according to D j (i).
2. The Sampling Model.In this model the algorithm cannot select ("query") a distribution of its choice.Rather, it may obtain a pair (i, j) where j is selected uniformly (we refer to this as the Uniform sampling model) and i is distributed according to D j (i).
We also consider a generalization in which there is an underlying weight vector w = (w 1 , . . ., w m ) (where m j=1 w j = 1), and the distribution D j is selected according to w.In this case the notion of -far needs to be modified accordingly.Namely, we say that D is -far from P with respect to that has the property P. We consider two variants of this non-uniform model: The Known-Weights sampling model, in which w is known to the algorithm, and the Unknown-Weights sampling model in which w is known.
A main focus of this work is on the following property.We shall say that a list D = (D 1 . . .D m ) of m distributions over [n] belongs to P eq m,n (or has the property P eq m,n ) if D j = D j for all 1 ≤ j, j ≤ m.
3 A lower bound of Ω(n 2/3 m 1/3 ) for testing equivalence in the uniform sampling model when n = Ω(m log m) In this section we prove the following theorem: Theorem 1 Any testing algorithm for the property P eq m,n in the uniform sampling model for every ≤ 1/20 and for n > cm log m where c is some sufficiently large constant, requires Ω(n 2/3 m 1/3 ) samples.
The proof of Theorem 1 consists of two parts.The first is a general theorem (Theorem 2) concerning testing symmetric properties of lists of distributions.This theorem extends a lemma of Valiant [49,Lem. 4.5.4](which leads to what Valiant refers to as the "Wishful Thinking Theorem").The second part is a construction of two lists of distributions to which Theorem 2 is applied.Our analysis uses a technique called Poissonization [47] (which was used in the past in the context of lower bounds for testing and estimating properties of distributions in [42,48,49]), and hence we first introduce some preliminaries concerning Poisson distributions.We later provide some intuition regard-ing the benefits of Poissonization.All missing details of the analysis can be found in the appendix.

Preliminaries concerning Poisson distributions
For a positive real number λ, the Poisson distribution poi(λ) takes the value x ∈ N (where N = {0, 1, 2, . ..}) with probability poi(x; λ) = e −λ λ x /x!.The expectation and variance of poi(λ) are both λ.For λ 1 and λ 2 we shall use the following bound on the 1 distance between the corresponding Poisson distributions (for a proof see for example [42 For a vector λ = (λ 1 , . . ., λ d ) of positive real numbers, the corresponding multivariate Poisson distribution poi( λ) is the product distribution poi(λ 1 ) × We shall sometimes consider vectors λ whose coordinates are indexed by vectors a = (a 1 , . . ., a m ) ∈ N m , and will use λ( a) to denote the coordinate of λ that corresponds to a. Thus for our purposes we shall use the following generalized lemma.

positive real values, and for any partition
We shall also make use of the following Lemma.

Lemma 2 For any two d-dimensional vectors λ
The next two notations will play an important technical role in our analysis.For a list of distributions D = (D 1 , . . ., D m ), an integer κ and a vector a = (a 1 , . . ., a m ) ∈ N m , let ( That is, for a fixed choice of a domain element i ∈ [n], consider performing m independent trials, one for each distribution D j , where in trial j we select a nonnegative integer according to the Poisson distribution poi(λ) for λ = κ • D j (i).Then p D,κ (i; a) is the probability of the joint event that we get an outcome of a j in trial j, for each j ∈ [m].Let λ D,κ be a vector whose coordinates are indexed by all a ∈ N m , such that That is, λ D,κ ( a) is the expected number of times we get the joint outcome (a 1 , . . ., a m ) if we perform the probabilistic process defined above independently for every i ∈ [n].

Testability of symmetric properties of lists of distributions
In this subsection we state an important building block in the proof of prove the Theorem 1.
Theorem 2 Let D + and D − be two lists of m distributions over [n], all of whose frequencies are at most δ κ•m where κ is some positive integer and 0 < δ < 1.
then testing in the uniform sampling model any symmetric property of distributions such that D + has the property, while D − is Ω(1)-far from having the property requires Ω(κ • m) samples.
A high-level discussion of the proof of Theorem 2. For an element i ∈ [n] and a distribution D j , j ∈ [m], let α i,j be the number of times the pair (i, j) appears in the sample (when the sample is selected according to some sampling model).Thus (α i,1 , . . ., α i,m ) is the sample histogram of the element i.The histogram of the elements' histograms is called the fingerprint of the sample.That is, the fingerprint indicates, for every a ∈ N m , the number of elements i such that (α i,1 , . . ., α i,m ) = a.As shown in [13], when testing symmetric properties of distributions, it can be assumed without loss of generality that the testing algorithm is provided only with the fingerprint of the sample.Furthermore, since the number, n, of elements is fixed, it suffices to give the tester the fingerprint of the sample without the 0 = (0, . . ., 0) entry.
In order to prove Theorem 2, we would like to show that the distributions of the fingerprints when the sample is generated according to D + and when it is generated according to D − are similar, for a sample size that is below the lower bound stated in the theorem.For each choice of element i ∈ [n] and a distribution D j , the number of times the sample (i, j) appears, i.e. α i,j , depends on the number of times the other samples appear simply because the total number of samples is fixed.Furthermore, for each histogram a, the number of elements with sample histogram identical to a is dependent on the number of times the other histograms appear, because the number of samples is fixed.For instance, in the example above, if we know that we have the histogram (0, 1) once and the histogram (1, 1) once, then we know that third histogram can't be (2, 0).In addition, it is dependent because the number of elements is fixed.
We thus see that the distribution of the fingerprints is rather difficult to analyze (and therefore it is difficult to bound the statistical distance between two different such distributions).Therefore, we would like to break as much of the above dependencies.To this end we define a slightly different process for generating the samples that involves Poissonization [47].In the Poissonized process the number of samples we take from each distribution D j , denoted by κ j , is distributed according to the Poisson distribution.We prove that, while the overall number of samples the Poissonized process takes is bigger just by a constant factor from the uniform process, we get with very high probability that κ j > κ j , for every j, where κ j is the number of samples taken from D j .This implies that if we prove a lower bound for algorithms that receive samples generated by the Poissonized process, then we obtain a related lower bound for algorithms that work in the uniform sampling model.
As opposed to the process that takes a fixed number of samples according to the uniform sampling model, the benefit of the Poissonized process is that the α i,j 's determined by this process are independent.Therefore, the type of sample histogram that element i has is completely independent of the types of sample histograms the other elements have.We get that the fingerprint distribution is a generalized multinomial distribution, which fortunately for us has been studied by Roos [43] (the connection is due to Valiant [48]).See details in Appendix B.

Proof of Theorem 1
In this subsection we show how to apply Theorem 2 to two lists of distributions, D + and D − , which we will define shortly, where D + ∈ P eq = P eq m,n while D − is (1/20)-far from P eq .Recall that by the premise of Theorem 1, n ≥ cm log m for some sufficiently large constant c > 1.In the proof it will be convenient to assume that m is even and that n (which corresponds in the lemma to 2t) is divisible by 4. It is not hard to verify that it is possible to reduce the general case to this case.In order to define D − , we shall need the next lemma.
Lemma 3 For every two even integers t and m, there exists a 0/1-valued matrix M with t rows and m columns for which the following holds: 2. For every integer 2 ≤ x < m/2, and for every subset S ⊆ [m] of size x, the number of rows i such that M [i, j] = 1 for every j ∈ S is at We first define D + , in which all distributions are identical.Specifically, for each j ∈ [m]: We now turn to defining D − .Let M be a matrix as in Lemma 3 for t = n/2.For every j ∈ [m]: For both D + and D − , we refer to the elements 1 ≤ i ≤ n 2/3 m 1/3 2 as the heavy elements, and to the elements n 2 ≤ i ≤ n, as the light elements.Observe that each heavy element has exactly the same probability weight, 1 n 2/3 m 1/3 , in all distributions D + j and D − j .On the other hand, for each light element i, while D + j (i) = 1 n (for every j), in D − we have that D + j (i) = 2 n for half of the distributions, the distributions selected by the M , and D + j (i) = 0 for half of the distributions, the distributions which are not selected by M .We later use the properties of M to bound the 1 distance between the fingerprints' distributions of D + and D − .
A high-level discussion.To gain some intuition before delving into the detailed proof, consider first the special case that m = 2 (which was studied by Valiant [48], and indeed the construction is the same as the one he analyzes (and was initially proposed in [12]).In this case each heavy element has probability weight Θ(1/n 2/3 ) and we would like to establish a lower bound of Ω(n 2/3 ) on the number of samples required to distinguish between D + and D − .That is, we would like to show that the corresponding fingerprints' distributions when the sample is of size o(n 2/3 ) are very similar.
The first main observation is that since the probability weight of light elements is Θ(1/n) in both D + and D − , the probability that a light element will appear more than twice in a sample of size o(n 2/3 ) is very small.That is (using the fingerprints of histograms notation we introduced previously), for each a = (a 1 , a 2 ) such that a 1 +a 2 > 2, the sample won't include (with high probability) any light element i such that α i,1 = a 1 and α i,2 = a 2 (for both D + and D − ).Moreover, the expected number of elements i such that (α i,1 , α i,2 ) = (1, 0) is the same in D + and D − , as well as the variance (from symmetry, the same applies to (0, 1)).Thus, most of the difference between the fingerprints' distributions is due to the numbers of elements i such that (α i,1 , α i,2 ) ∈ {(1, 1), (2, 0), (0, 2)}.For these settings of a we do expect to see a nonnegligble difference for light elements between D + and D − (in particular, we can't get the (1, 1) histogram for light elements in D − , as opposed to D + ).
Here is where the heavy elements come into play.Recall that in both D + and D − the heavy elements have the same probability weight, so that the expected number of heavy elements i such that (a i,1 , a i,2 ) = (1, 1) (and similarly for (2, 0) and (0, 2)), is the same for D + and D − .However, intuitively, the variance of these numbers for the heavy elements "swamps" the differences between the light elements so that it is not possible to distinguish between D + and D − .The actual proof, which formalizes (and quantifies) this intuition, considers the difference between the values of the vectors λ D + ,k and λ D − ,k (as defined in Equation ( 3)) in the coordinates corresponding to a such that a 1 + a 2 = 2.We can then apply Lemmas 1 and 2 to obtain Equation (4) in Theorem 2.
Turning to m > 2, it is no longer true that in a sample of size o(n 2/3 m 1/3 ) we won't get histogram vectors a such that m j=1 a i > 2 for light elements.Thus we have to deal with many more vectors a (of dimension m) and to bound the total contribution of all of them to the difference between fingerprints of D + and of D − .To this end we partition the set of all possible histograms' vectors into several subsets according to their Hamming weight m j=1 a j and depending on whether all a j s are in {0, 1}, or there exists a least one a j such that a j ≥ 2. In particular, to deal with the former (whose number, for each choice of Hamming weight x is relatively large, i.e., roughly m x ), we use the properties of the matrix M based on which D − is defined.We note that from the analysis we see that, similarly to when m = 2, we need the variance of the heavy elements to play a role just for the cases where m j=1 a i = 2 while in the other cases the total contribution of the light elements is rather small.
In the remainder of this section we provide the details of the analysis.
We next introduce some more notation, which will be used throughout the remainder of the proof of Theorem 1.Let S x be the set of vectors that contain exactly x coordinates that are 1, and all the rest are 0 (which corresponds to an element that was sampled once or 0 times by each distribution).Let A x be the set of vector that their coordinates sum up to x but must contain at least one coordinate that is 2 (which corresponds to an element that was samples at least twice by at least one distribution).More formally, for any integer x, we define the following two subsets of N m : and In what follows we work towards establishing that Equation (4) in Theorem 2 holds for D + and D , where δ is a constant to be determined later.We shall use the shorthand λ + for λ D + ,κ , and λ − for λ D − ,κ (recall that the notation λ D,κ was introduced in Equation ( 3)).By the definition of λ + , for each a ∈ N m , By the construction of M , for every light i, Hence, λ + ( a) and λ − ( a) differ only on the term which corresponds to the contribution of the light elements.Equations ( 7) and ( 8) demonstrate why we choose M with the specific properties defined in Lemma 3. First of all, in order for every D − j to be a probability distribution, we want each column of M to sum up to exactly n/2.We also want each row of M to sum up to exactly m/2, in order to get . Finally, we would have liked |I M ( a)| • m j=1 2 aj to equal n/2 for every a.This would imply that λ + ( a) and λ − ( a) are equal.As we show below, this is in fact true for every a ∈ S 1 .For vectors a ∈ S x where x > 1, the second condition in Lemma 3 ensures that |I M ( a)| is sufficiently close to n 2 • 1 2 x .This property of M is not necessary in order to bound the contribution of the vectors in A x .The bound that we give for those vectors is less tight, but since there are fewer such vectors, it suffices.
We start by considering the contribution to Equation (4) of histogram vectors a ∈ S 1 (i.e., vectors of the form (0, . . ., 0, 1, 0, . . ., 0)) which correspond to the number of elements that are sampled only by one distribution, once.We prove that in the Poissonized sampling model, for every a ∈ S 1 the number of elements with such sample histogram is distributed exactly the same in D + and D − .
We now turn to bounding the contribution to Equation (4) of histogram vectors a ∈ A 2 (i.e., vectors of the form (0, . . ., 0, 2, 0, . . ., 0) which correspond to the number of elements that are sampled only by one distribution, twice.
By Equations ( 7), ( 8) and ( 9) it follows that and that By Equations ( 10) and (11) we have that By Equation ( 12) and the fact that The lemma follows by applying Lemma 2.
Recall that for a subset I of N m , poi( λ(I)) denotes the multivariate Poisson distributions restricted to the coordinates of λ that are indexed by the vectors in I.We separately deal with S x where 2 ≤ x < m/2, and x ≥ m/2, where our main efforts are with respect to the former, as the latter correspond to very low probability events.The proofs of the next lemmas appear in Appendix C. Lemma 7 For m ≥ 16, n ≥ cm ln m (where c is a sufficiently large constant) and for δ ≤ 1/16 poi We finally turn to the contribution of a ∈ A x such that x ≥ 3, where the proof of the next lemma is similar to the proof of Lemma 8.

Lemma 9
For n ≥ m and δ ≤ 1/4, We are now ready to finalize the proof of Theorem 1.
Proof of Theorem 1: Let D + and D − be as defined in Equations ( 5) and ( 6), respectively, and recall that κ = δ • n 2/3 m 2/3 (where δ will be set subsequently).By the definition of the distributions in D + and D − , the probability weight assigned to each element is at most , as required by Theorem 2. By Lemma 4, D − is (1/20)-far from P eq .Therefore, it remains to establish that Equation (4) holds for D + and D − .Consider the following partition of where { a} a∈T denotes the list of all singletons of elements in T .By Lemma 1 it follows that poi( λ For δ < 1/16 we get by Lemmas 5-9 that poi( λ Hence, we obtain that poi( λ Thus, the lemma follows by induction on .
Proof of Lemm 2: In order to prove the lemma we shall use the KL-divergence between distributions.
Namely, for two distributions p 1 and p 2 over a domain X, D KL (p 1 p 2 ) where in the last inequality we used the fact that ln x ≤ x − 1 for every x > 0. Therefore, we obtain that where in Equation ( 13) we used the facts that i∈N poi(i; λ) = 1 and i∈N poi(i; λ) • i = λ.The 1 distance is related to the KL-divergence by D − D 1 ≤ 2 2D KL (D D ) and thus we obtain the lemma.
The next lemma will be used in the proof of Theorem 2.
B Missing details for Subsection 3.2 Lemma 11 Assume that there exists a tester T in the uniform sampling model for a property P of lists of m distributions, that takes a sample of size s = κm where κ ≥ c log m for some sufficiently large constant c, and works for every ≥ 0 where 0 is a constant (and whose success probability is at least 2/3).Then there exists a tester T for P in the Poissonized uniform sampling model with parameter 4κ, that works for every ≥ 0 and whose success probability is at least 19  30 .191  Proof: Roughly speaking, the tester T tries to simulate T if it has a sufficiently large sample, and otherwise it guesses the answer.More precisely, let D = (D 1 , . . ., D m ) be a list of m distributions.For each j ∈ [m] let κ j denote the random variable that equals the number of samples that are selected according to D j in the uniform sampling model, when the total number of samples is κm.Thus, κ j ∼ Bin(κm, 1 m ).By [5, Thm.A.12], for each j ∈ [m], Pr [κ i ≥ 2κ] < (e/4) κ .Now consider a tester T that receives κ j samples from each D j where κ j ∼ poi(4κ).By Lemma (10), for each j we have that, Suppose T also selects κ 1 , . . ., κ m as in the distribution induced by the uniform sampling model.If κ j ≥ κ j for each j, then T simulates T on the union of the first κ j samples that it got for each j.Otherwise it outputs "accept" or "reject" with equal probability.
By taking a union bound over all j ∈ [m] we get that the probability that every j ∈ [m] it holds that both κ j ≤ 2κ and κ j ≥ 2κ (so that κ j ≥ κ j ), is at least 1 − m(((e/4)) κ + (3/4) κ ), which is greater than 4  5 for κ > c log m and a sufficiently large constant c.Therefore, the success probability of T is at least Given Lemma 11 it suffices to consider samples that are generated in the Poissonized uniform sampling model.The process for generating a sample {α i,1 , . . ., α i,m } i∈ [n] (recall that α i,j is the number of times that element i was selected by distribution D j ) in the κ-Poissonized model is equivalent to the following process: For each i ∈ [n] and j ∈ [m], independently select α i,j according to poi(κ • D j (i)) (see [23], p. 216).Thus the probability of getting a particular histogram a i = (a i,1 , . . ., a i,m ) for element i is p D,κ (i; a i ) (as defined in Equation ( 2)).We can represent the event that the histogram of element i is a i by a Bernoulli random vector b i that is indexed by all a ∈ N m , is 1 in the coordinate corresponding to a i , and is 0 elsewhere.Given this representation, the fingerprint of the sample corresponds to n i=1 b i .In fact, we would like b i to be of finite dimension, so we have to consider only a finite number (sufficiently large) of possible histograms.Under this relaxation, b i = (0, . . ., 0) would correspond to the case that the sample histogram of element i is not in the set of histograms we consider.Roos's theorem, stated next, shows that the distribution of the fingerprints can be approximated by a multivariate Poisson distribution (the Poisson here is related to the fact that the fingerprints' distributions are generalized multinomial distributions and not related to the Poisson from the Poissonization process).For simplicity, the theorem is stated for vectors b i that are indexed directly, that is b i = (b i,1 , . . ., b i,h ).
We next show how to obtain a bound on sums of the form given in Equation ( 14) under appropriate conditions.Proof: where the inequality in Equation ( 16) holds for δ ≤ 1/2 and the inequality in Equation ( 16) follows from: and the proof is completed.
Proof of Theorem 2: By the first premise of the theorem, D + j (i), D + j (i) ≤ δ κm for every i ∈ [n] and j ∈ [m].By Lemma 12 this implies that Equation (15) holds both for D = D + and for D = D − .Combining this with Theorem 3 we get that the 1 distance between the fingerprint distribution when the sample is generated according to D + (in the κ-Poissonized model, see Definition 1) and the distribution poi λ D + ,κ is at most 88 5 • 2δ = 176 5 δ, and an analogous statement holds for D − .By applying the premise in Equation (4) (concerning the 1 distance between poi λ D + ,κ and poi λ D − ,κ ) and the triangle inequality, we get that the 1 distance between the two fingerprint distributions is smaller than 2 • 176 5 δ + 16 30 − 352δ 5 = 16  30 , which implies that the statistical difference is smaller than 8  30 , and thus it is not possible to distinguish between D + and D − in the κ-Poissonized model with success probability at least 19 30 .By Lemma 11 we get the desired result.

C Missing details for Subsection 3.3
Proof of Lemma 3: Recall that we consider selecting a matrix M randomly as follows: Denote the first t/2 rows of M by F .For each row in F , pick, independently from the other t/2 − 1 rows in F , a random half of its elements to be 1, and the other half of the elements to be 0. Rows t/2+1, . . ., t are the negations of rows 1, . . ., t/2, respectively.Thus, in each row and each column of M , exactly half of the elements are 1 and the other half are 0.
Let E S,b denote the expected value of t/2 i=1 I S,i,b .From the fact that rows t/2+1, . . ., t are the negations of rows 1, . . ., t/2 it follows that t i=t/2+1 I S,i,1 = t/2 i=1 I S,i,0 .Therefore, the expected number of rows 1 ≤ i ≤ t such that M [i, j] = 1 for every j ∈ S is simply E S,1 + E S,0 (that is, at most t • 1  2 x and at least t Proof of Lemma 9: We first observe that |A x | ≤ m x−1 for every x.To see why this is true, observe that |A x | equals the number of possibilities of arranging x − 1 balls, where one ball is a "special" ("double") ball in m bins.By Equations ( 7) and ( 8) (and the fact that |x − y| ≤ max{x, y} for every positive real numbers x,y), where in Equation ( 17) we used the fact that n ≥ m and Equation ( 18) holds for δ ≤ 1/4.The lemma follows by applying Equation (1).

For
a ∈ N m , let sup( a) def = {j : a j = 0} denote the support of a, and let I M ( a) def = i : D − j (i) = 2 n ∀j ∈ sup( a) .Note that in terms of the matrix M (based on which D − is defined), I M ( a) consists of the rows in M whose restriction to the support of a contains only 1's.In terms of the D − , it corresponds to the set of light elements that might have a sample histogram of a (when sampling according to D − ).The proof of the next lemma appears in Appendix C. Lemma 4 D − is (1/20)-far from P eq for every m > 5 and n ≥ c ln m where c is a sufficiently large constant.

Lemma 12
Given a list D = (D 1 , . . ., D m ) of m distributions over [n] and a real number 0 < δ ≤ 1/2 such that for all i ∈ [n] and for all j ∈ [m], D j (i) ≤ δ m•κ for some integer κ, we have that a∈N m \ 0 n i=1 p D,κ (i; a) 2 n i=1 p D,κ (i; a) ≤ 2δ .(15) Consider a fixed choice of x.For each row i between 1 and t, each subset of columns S ⊆ [m] of size x, and b ∈ {0, 1}, define the indicator random variable I S,i,b to be 1 if and only if M [i, j] = b for every j ∈ S. Hence, Pr[I S,i,b = 1]

• 1 2 x 1 / 2 i=1Im 2 < 2 n 1 4 + 8 ln m n . 8 . 2 • 1 8 ,
− 2x 2 m ).By the additive Chernoff bound, Pr tS,i,b − E S,b > tx ln 2 exp(−2(t/2)(2x ln m)/t) = 2m −2x .Thus, by taking a union bound (over b ∈ {0, 1}), Pr t i=1I S,i,1 − (E S,1 + E S,0 ) > √ 2tx ln m < 4m −2x .By taking a union bound over all subsets S we get that M has the desired properties with probability greater than 0.Proof ofLemma 4: Consider any a ∈ S 2 .By Lemma 3, setting t = n/2, the size of I M ( a), i.e. the number of light elements such that D − j [ ] = for every j ∈ sup( a), is at most n 2 The same lower bound holds for the number of light elements such that D − j [ ] = 0 for every j ∈ sup( a).This implies that for every j = j in [m], for at least n 2 − n 1 4 + 8 ln m n of the light elements, , we have that D − j [ ] = 2 n while D − j [ ] = 0, or that D − j [ ] = 2 n while D − j [ ] = 0. Therefore, D − j − D − j 1 ≥ 1 2 − 2 8 ln m n , which for n ≥ c ln m and a sufficiently large constant c, is at least 1 Thus, by the triangle inequality we have that for every D * , m j=1 D − j − D * 1 ≥ m which greater than m/20 for m > 5.