Why simple hash functions work: exploiting the entropy in a data stream

Hashing is fundamental to many algorithms and data structures widely used in practice. For theoretical analysis of hashing, there have been two main approaches. First, one can assume that the hash function is truly random, mapping each data item independently and uniformly to the range. This idealized model is unrealistic because a truly random hash function requires an exponential number of bits to describe. Alternatively, one can provide rigorous bounds on performance when explicit families of hash functions are used, such as 2-universal or O(1)-wise independent families. For such families, performance guarantees are often noticeably weaker than for ideal hashing.
 In practice, however, it is commonly observed that simple hash functions, including 2-universal hash functions, perform as predicted by the idealized analysis for truly random hash functions. In this paper, we try to explain this phenomenon. We demonstrate that the strong performance of universal hash functions in practice can arise naturally from a combination of the randomness of the hash function and the data. Specifially, following the large body of literature on random sources and randomness extraction, we model the data as coming from a "block source," whereby each new data item has some "entropy" given the previous ones. As long as the (Renyi) entropy per data item is sufficiently large, it turns out that the performance when choosing a hash function from a 2-universal family is essentially the same as for a truly random hash function. We describe results for several sample applications, including linear probing, balanced allocations, and Bloom filters.


Introduction
Hashing is at the core of many fundamental algorithms and data structures, including all varieties of hash tables [20], Bloom filters and their many variants [7], summary algorithms for data streams [21], and many others.Traditionally, applications of hashing are analyzed as if the hash function is a truly random function (a.k. a. "random oracle") mapping each data item independently and uniformly to the range of the hash function.However, this idealized model is unrealistic, because a truly random function mapping {0, 1} n to {0, 1} m requires an exponential (in n) number of bits to describe.
For this reason, a line of theoretical work, starting with the seminal paper of Carter and Wegman [8] on universal hashing, has sought to provide rigorous bounds on performance when explicit families of hash functions are used, e. g., ones whose description and computational complexity are polynomial in n and m.While many beautiful results of this type have been obtained, they are not always as strong as we would like.In some cases, the types of hash functions analyzed can be implemented very efficiently (e. g., universal or O(1)-wise independent hash functions), but the performance guarantees are noticeably weaker than for ideal hashing.(A recent motivating example is the analysis of linear probing under 5-wise independence [25], discussed more below.)In other cases, the performance guarantees are (essentially) optimal, but the hash functions are more complex and expensive (e. g., with a super-linear time or space requirement).For example, if at most T items are going to be hashed, then a T -wise independent hash function will have precisely the same behavior as an ideal hash function.But a T -wise independent hash function mapping to {0, 1} m requires at least T • m bits to represent, which is often too large.For some applications, it has been shown that less independence, such as O(log T )-wise independence, suffices, e. g., [36,26], but such functions are still substantially less efficient than 2-universal hash functions.A series of works [38,24,11] have improved the time complexity of (almost) T -wise independence to a constant number of word operations, but the space complexity necessarily remains at least T • m.
In practice, however, the performance of standard universal hashing seems to match what is predicted for ideal hashing.This phenomenon was experimentally observed long ago in the setting of Bloom filters [31]; other reported examples include [6,10,26,30,32].Thus, it does not seem truly necessary to use the more complex hash functions for which this kind of performance can be proven.We view this as a significant gap between the theory and practice of hashing.
In this paper, we aim to bridge this gap.Specifically, we suggest that it is due to the use of worst-case analysis.Indeed, in some cases, it can be proven that there exist sequences of data items for which universal hashing does not provide optimal performance.But these bad sequences may be pathological cases that are unlikely to arise in practice.That is, the strong performance of universal hash functions in practice may arise from a combination of the randomness of the hash function and the randomness of the data.
Of course, doing an average-case analysis, whereby each data item is independently and uniformly distributed in {0, 1} n , is also very unrealistic (not to mention that it trivializes many applications).Here we propose that an intermediate model, previously studied in the literature on randomness extraction [9], may be an appropriate data model for hashing applications.Under the assumption that the data fits this model, we show that relatively weak hash functions achieve essentially the same performance as ideal hash functions.
Our model We model the data as coming from a random source in which the data items can be far from uniform and have arbitrary correlations, provided that each (new) data item is sufficiently unpredictable given the previous items.This is formalized by Chor and Goldreich's notion of a block source [9], 1 where we require that the i-th item (block) X i has at least some k bits of "entropy" conditioned on the previous items (blocks) X 1 , . . ., X i−1 .There are various choices for the entropy measure that can be used here; Chor and Goldreich use min-entropy, but most of our results hold even for the less stringent measure of Rényi entropy.
We believe that a block source is a plausible model for many real-life data sources, provided the entropy k required per-block is not too large.However, in some settings, the data may have structure that violates the block-source property, in which case our results will not apply.Indeed, recent experimental and theoretical results [40,27] have identified some natural classes of data sets (e. g., where the items are densely packed in an interval) where existing universal hash families perform poorly (e. g., when used in linear probing, as described below).
Our work is very much in the same spirit as previous works that have examined intermediate models between worst-case and average-case analysis of algorithms for other kinds of problems.Examples include the semi-random graph model of Blum and Spencer [5], and the smoothed analysis of Spielman and Teng [39].Interestingly, Blum and Spencer's semi-random graph models are based on Santha and Vazirani's model of semi-random sources [35], which in turn were the precursor to the Chor-Goldreich model of block sources [9].Chor and Goldreich suggest using block sources as an input model for communication complexity, but surprisingly it seems that no one has considered them as an input model for hashing applications.
Our results Our first observation is that standard results in the literature on randomness extractors already imply that universal hashing performs nearly as well ideal hashing, provided the data items have enough entropy [3,17,9,44].Specifically, if we have T data items coming from a block source (X 1 , . . ., X T ) where each data item has Rényi entropy at least m + 2 log(T /ε) and H is a random 2universal hash function mapping to {0, 1} m , then (H(X 1 ), . . ., H(X T )) has statistical difference at most ε from T uniform and independent elements of {0, 1} m .Thus, any event that would occur with some probability p under ideal hashing now occurs with probability p ± ε.This allows us to automatically translate existing results for ideal hashing into results for universal hashing in our model.
In our remaining results, we focus on reducing the amount of entropy required from the data items.Assuming our hash function has a description size o(mT ), then we must have at least (1 − o(1))m bits of entropy per item for the hashing to "behave like" ideal hashing (because the entropy of (H(X 1 ), . . ., H(X T )) is at most the sum of the entropies of H and the X i 's).The standard analysis mentioned above requires an additional 2 log(T /ε) bits of entropy per block.In the randomness extraction literature, the additional entropy required is typically not significant because log(T /ε) is much smaller than m.However, it can be significant in our applications.For example, a typical setting is hashing T = Θ(M) items into 2 m = M bins.Here m + 2 log(T /ε) ≥ 3m − O(1) and thus the standard analysis requires 3 times more entropy than the lower bound of (1 − o(1))m.(The bounds obtained for the specific applications mentioned below are even larger, sometimes due to the need for a subconstant ε = o(1) and sometimes due to the fact that several independent hash values are needed for each item.) We use a variety of general techniques to reduce the entropy required.These include switching from statistical difference (equivalently, 1 distance) to Rényi entropy (equivalently, 2 distance or collision probability) and/or Hellinger distance (corresponding to 1/2 distance under appropriate normalization) throughout the analysis and decoupling the probability that a hash function is "good" from the uniformity of the hashed values h(X i ).In particular, we reduce the required entropy, for (H(X 1 ), . . ., H(X T )) to be ε-close to uniform in statistical distance, from m + 2 log(T /ε) to m + log T + 2 log(1/ε), which we show is tight.We can reduce the entropy required even further for some applications by measuring the quality of the output differently (not using statistical distance) or by using 4-wise independent hash functions (which also have very fast implementations [40]).
Applications We illustrate our approach with several specific applications.Here we informally summarize the results; definitions and discussions appear in Sections 3 and 4. In the following discussion, T is the number of data to be hashed, M is the size of the hash table, and we focus on the typical setting where The classic analysis of Knuth [20] gives a tight bound for the insertion time in a hash table with linear probing in terms of the "load" of the table (the number of items divided by the size of the table), under the assumption that an idealized, truly random hash function is used.Resolving a longstanding open problem, Pagh, Pagh, and Ružić [25] recently showed that while pairwise independence does not suffice to bound the insertion time in terms of the load alone (for worst-case data), such a bound is possible with 5-wise independent hashing.However, their bound for 5-wise independent hash functions is polynomially larger than the bound for ideal hashing.We show that 2-universal hashing actually achieves the same asymptotic performance as ideal hashing, provided that the data comes from a block source with roughly 3 log M bits of (Rényi) entropy per item, where M is the size of the hash table.
With standard chained hashing, when T items are hashed into T buckets by a single truly random hash function, the maximum load is known to be (1 + o(1)) • (log T / log log T ) with high probability [15,28].In contrast, Alon et al. [1] show that for a natural family of 2-universal hash functions, it is possible for  1: Each entry denotes the (Rényi) entropy required per item to ensure that the performance of the given application is "close" to the performance when using truly random hash functions.In all cases, the bounds omit additive terms that depend on how close a performance is desired, and we restrict to the (standard) case that the size of the hash table is linear in the number of items being hashed.That is, an adversary to choose a set of T items so that the maximum load is always Ω(T 1/2 ).Our results in turn show that 2-universal hashing achieves the same performance as ideal hashing asymptotically, provided that the data comes from a block source with roughly 2 log T bits of (Rényi) entropy per item.
With the balanced allocation paradigm [2], it is known that when T items are hashed to T buckets, with each item being sequentially placed in the least loaded of d choices (e. g., d = 2), the maximum load is log log T / log d + O(1) with high probability.We show that the same result holds when the hash function is chosen from a 2-universal hash family, provided the data items come from a block source with roughly (d + 1) log T bits of entropy per data item.
Bloom filters [4] are data structures for approximately storing sets in which membership tests can result in false positives with some bounded probability.We begin by showing that there is a constant gap in the false positive probability for worst-case data when O(1)-wise independent hash functions are used instead of truly random hash functions.On the other hand, we show that if the data comes from a block source with roughly 3 log M bits of (Rényi) entropy per item, where M is the size of the Bloom filter, then the false positive probability with 2-universal hashing asymptotically matches that of ideal hashing.
A summary of required (Rényi) entropy per item for the above applications can be found in Table 1.

Preliminaries
Notation [N] denotes the set {1, . . ., N}.All logs are base 2. For a random variable X and an event E,

Hashing worst-case data
In this section, we describe the four main hashing applications we study in this paper-linear probing, chained hashing, balanced allocations, and Bloom filters-and describe mostly known results about what can be achieved for worst-case data.

Linear probing
A hash table using linear probing stores a sequence x = (x 1 , . . ., x T ) of data items from [N] using M memory locations.Given a hash function h : we place the data items x 1 , . . ., x T sequentially as follows.The data item x i first attempts placement at h(x i ), and if this location is already filled, locations (h(x i ) + 1) mod M, (h(x i ) + 2) mod M, and so on are tried until an empty location is found.The ratio α = T /M is referred to as the load of the table.The efficiency of linear probing is measured by the insertion time for a new data item.(Other measures, such as the average time to search for items already in the table, are also often studied, and our results can be generalized to these measures as well.) stored via linear probing using h, and an extra data item y / ∈ x, we define the insertion time Time LP (h, x, y) to be the value of j such that y is placed at location h(y) + ( j − 1) mod M.
It is well known that with ideal hashing (i.e., hashing using truly random hash functions), the expected insertion time can be bounded quite tightly as a function of the load [20].
where α = T /M is the load.
THEORY OF COMPUTING, Volume 9 (30), 2013, pp.897-945 Resolving a longstanding open problem, Pagh, Pagh, and Ružić [25] recently showed that the expected lookup time could be bounded in terms of α using only O(1)-wise independence.Specifically, with 5-wise independence, the expected time for an insertion is O 1/(1 − α) 2.5 for any sequence x.On the other hand, in [25] it is also shown that there are examples of sequences x and pairwise independent hash families such that the expected time for a lookup is logarithmic in T (even though the load α is independent of T ).In contrast, our work demonstrates that pairwise independent hash functions yield expected lookup times that are asymptotically the same as under the idealized analysis, assuming there is sufficient entropy in the data items themselves.

Chained hashing
A hash table using chained hashing stores a set x = {x 1 , . . ., x T } ∈ [N] T in an array of M buckets.Let h be a hash function mapping [N] to [M].We place each item x i in the bucket h(x i ).The load of a bucket when the process terminates is the number of items in it.
and a sequence x = {x 1 , . . ., x T } of data items from [N] stored via chained hashing using h, we define the maximum load MaxLoad CH (x, h) to be the maximum load among the buckets after all data items have been placed.
Gonnet [15] proved that when M = T , the expected maximum load is log T /log log T asymptotically.This bound also holds with high probability, as noted in [28].More precisely, we have: where the o(1) terms tend to zero as T → ∞.
The calculation underlying this theorem requires that the hash function be Ω(log T / log log T )-wise independent.Indeed, Alon et al. [1] demonstrate that this result does not hold in general for 2-universal hash functions.For example, they show that when the hash function is chosen from the (2-universal) family of linear transformations F 2 → F for a finite field F whose size T = |F| is a square, it is possible for an adversary to choose a set of T items so that the maximum load is always at least √ T .

Balanced allocations
A hash table using the balanced allocation paradigm [2] with d ∈ N choices stores a sequence x = (x 1 , . . ., x T ) ∈ [N] T in an array of M buckets.Let h be a hash function mapping where we view the components of h(x) as (h 1 (x), . . ., h d (x)).We place the items sequentially by putting x i in the least loaded of the d buckets h 1 (x i ), . . ., h d (x i ) at time i (breaking ties arbitrarily), where the load of a bucket at time i is the number of items from x 1 , . . ., x i−1 placed in it.
Definition 3.5.Given h : [N] → [M] d , a sequence x = (x 1 , . . ., x T ) of data items from [N] stored via the balanced allocation paradigm (with d choices) using h, we define the maximum load MaxLoad BA (x, h) to be the maximum load among the buckets at time T + 1.
In the case that the number T of items is the same as the number M of buckets and we do balanced allocation with d = 1 choice (i.e., chained hashing), it is proved [28] that the maximum load is Θ(log T / log log T ) with high probability.Remarkably, when the number of choices d is two or larger, the maximum load drops to be double-logarithmic.Theorem 3.6 ([2,41]).For every d ≥ 2 and γ > 0, there is a constant c such the following holds.Let H be a truly random hash function mapping [N] to [T ] d .For every sequence x ∈ [N] T of distinct data items, we have There are other variations on this scheme, including the asymmetric version due to Vöcking [41] and cuckoo hashing [26]; we choose to study the original setting for simplicity.
The asymmetric scheme has been recently studied under explicit functions [43], similar to those of [11].At this point, we know of no non-trivial upper or lower bounds for the balanced allocation paradigm using families of hash functions with constant independence, although performance has been tested empirically [6].Such bounds have been a long-standing open question in this area.

Bloom filters
A Bloom filter [4] represents a set x = {x 1 , . . ., x T } ⊂ [N] using an array of M bits and hash functions.For our purposes, it will be somewhat easier to work with a segmented Bloom filter, where the M bits are partitioned into disjoint subarrays of size M/ , with one subarray for each hash function.We assume that M/ is an integer.(This splitting does not substantially change the results from the standard approach of having all hash functions map into a single array of size M.) As in the previous section, we denote the components of a hash function h : as providing hash values h(x) = (h 1 (x), . . ., h (x)) ∈ [M/ ] in the natural way.The Bloom filter is initialized by setting all bits to 0, and then setting the h i (x j )'th bit to be 1 in the i'th subarray for all i ∈ [ ] and j ∈ [T ].Given a data item y, one tests for membership in x by accepting if the h i (y)'th bit is 1 in the i'th subarray for all i ∈ [ ], and rejecting otherwise.Clearly, if y ∈ x, then the algorithm will always accept.However, the algorithm may err if y / ∈ x.
For truly random families of hash functions, it is easy to compute the false positive probability.
In the typical case that M = Θ(T ), the asymptotically optimal number of hash functions is = (M/T ) • ln 2, and the false positive probability is approximately 2 − .
We now turn to the worst-case performance of Bloom filters under O(1)-wise independence.It is folklore that 2-universal hash functions can be used with a constant-factor loss in space efficiency.Indeed, a union bound shows that Pr[h i (y) ∈ h i (x)] is at most T • ( /M), compared to 1 − (1 − /M) T in the case of truly random hash functions.This can be generalized to s-wise independent families using the following inclusion-exclusion formula.Lemma 3.9.Let H : [N] → [M/ ] be a hash function chosen at random from a family H (where | M).For every sequence x ∈ [N] T , every y / ∈ x, and every even s ≤ T , we have If H is an s-universal hash family, then the first s − 1 terms of the outer sum above are the same as for a truly random function (namely (−1) j+1 • T j ( /M) j ).This gives the following bound.Proposition 3.10.Let s be an even constant.Let H be an s-universal family mapping [N] to [M/ ] (where divides M), and let H = (H 1 , . . ., H ) be a random hash function from H .For every sequence x ∈ [N] T of T ≤ M/ data items and every y / ∈ x, we have Proof.By Lemma 3.9, for each i = 1, . . ., , we have: THEORY OF COMPUTING, Volume 9 (30), 2013, pp.897-945 where the last inequality follows by observing that the sum is alternating and thus bounded by Thus, Notice that in the common case that = Θ(1) and T ≤ M/2 , so that the false positive probability is constant, the above bound differs from the one for ideal hashing by an amount that shrinks rapidly with s.However, when s is constant, the difference remains an additive constant.Another way of interpreting this is that to obtain a given guarantee on the false positive probability using O(1)-wise independence instead of ideal hashing, one must pay a constant factor in the space for the Bloom filter.The following proposition shows that no better bound can be proved based solely on O(1)-wise independence.Proposition 3.11.Let s be an even constant.For all N, M, , T ∈ N such that M/ is a prime power and T < min{M/ , N}, there exists an (s + 1)-wise independent family of hash functions H mapping [N] to [M/ ] a sequence x ∈ [N] T of data items, and a y ∈ [N] \ x, such that if H = (H 1 , . . ., H ) is a random hash function from H , we have Proof.Let q = M/ , and let F be the finite field of size q.Associate the elements of [M/ ] with elements of F, and similarly for the first M/ elements of [N].Let H 1 consist of all polynomials f : F → F of degree at most s; this is an (s + 1)-wise independent family.Let H 2 consist of any (s + 1)-wise independent family mapping Let H be the family of all such functions f ∪ g.We let x be an arbitrary sequence of T distinct elements of F, and y any other element of F.
Again we compute the false positive probability using Lemma 3.9.The first s terms can be computed exactly as before, using (s + 1)-wise independence.For the terms beyond s, we observe that when |U| ≥ s, it is the case that h i (y) = h i (x k ) for all k ∈ U if and only if h i = f ∪ g for a constant polynomial f .The reason is that no nonconstant polynomial of degree at most s can take on the same value more than s times.The probability that a random polynomial of degree at most s is a constant polynomial is 1/q s = ( /M) s .
Again, to bound the false positive probability, we simply raise the above to the -th power.
4 Hashing block sources

Block sources
We view our data items as being random variables distributed over a finite set of size N, which we identify with [N].We use the following quantities to measure the amount of randomness in a data item.For a random variable X, the max probability of X is mp( Measuring these quantities is equivalent to measuring the min-entropy and the Rényi entropy ), so min-entropy and Rényi entropy are always within a factor of 2 of each other.
We model a sequence of data items as a sequence (X 1 , . . ., X T ) of correlated random variables where each item is guaranteed to have some entropy even conditioned on the previous items.Definition 4.1 (Block Sources).A sequence of random variables (X 1 , . . ., X T ) taking values in [N] T is a block source with collision probability p per block (respectively, max probability p per block) if for every i ∈ [T ] and every (x 1 , . . ., When max probability is used as the measure of entropy, then this is precisely the model of sources suggested in the randomness extractor literature by Chor and Goldreich [9].We will mainly use the collision probability formulation as the entropy measure, since it makes our results more general.Definition 4.2.(X 1 , . . ., X T ) is a black K-source if it is a block source with collision probability p = 1/K per block.
The following simple fact relates the collision probability of a joint distribution with its marginal.
Proof.It follows by an application of the Cauchy-Schwarz inequality.
Let (X,Y ) be jointly distributed random variables.We can define the conditional collision probability of X conditioning on Y as follows.
We note that in general, the chain rule (i.e., H(X The first term is cp(X) by definition.For the second term, note that when X is uniformly distributed, the distribution of X remains uniform after conditioning on X = X .Thus, THEORY OF COMPUTING, Volume 9 (30), 2013, pp.897-945 On the other hand, the following lemma says that as in the case of Shannon entropy, conditioning can only decrease the entropy.Lemma 4.6.Let (X,Y, Z) be jointly distributed random variables.We have Proof.For the first inequality, we have For the second inequality, observe that for every y in the support of Y , we have

Extracting randomness
A randomness extractor [23] can be viewed as a family of hash functions with the property that for any random variable X with enough entropy, if we pick a random hash function h from the family, then h(X) is "close" to being uniformly distributed on the range of the hash function.Randomness extractors are a central object in the theory of pseudorandomness and have many applications in theoretical computer science.Thus there is a large body of work on the construction of randomness extractors.(See the surveys [22,37].)A major emphasis in this line of work is constructing extractors where it takes extremely few (e. g., a logarithmic number of) random bits to choose a hash function from the family.This parameter is less crucial for us, so instead our emphasis is on using simple and very efficient hash functions (e. g., universal hash functions) and minimizing the amount of entropy needed from the source X.To do this, we will measure the quality of a hash family in ways that are tailored to our application, and thus we do not necessarily work with the standard definitions of extractors.
In requiring that the hashed value h(X) be "close" to uniform, the standard definition of an extractor uses the most natural measure of "closeness."Specifically, for random variables X and Y , taking values in [N], their statistical difference is defined as The classic Leftover Hash Lemma shows that universal hash functions are randomness extractors with respect to statistical difference.
Lemma 4.7 (The Leftover Hash Lemma [3,17]).Let H : [N] → [M] be a random hash function from a 2-universal family H.For every random variable X taking values in Notice that the above lemma says that the joint distribution of (H, H(X)) is ε-close to uniform (for ε = (1/2) • M/K); a family of hash functions achieving this property is referred to as a "strong" randomness extractor.Up to some loss in the parameter ε (which we will later want to save), this strong extraction property is equivalent to saying that with high probability over h ← H, the random variable h(X) is close to uniform.The above formulation of the Leftover Hash Lemma, passing through collision probability, is attributed to Rackoff [18].
To prove the lemma, let (H , X ) be an i. i. d. copy of (H, X).We have Also, since H is uniformly distributed, by Lemma 4.5, It then relies on the fact that if the collision probability of a random variable is close to that of the uniform distribution, then the random variable is close to uniform in statistical difference.This fact is captured (in a more general form) by the following lemma.
Proof.By the premise of the lemma, The "in particular" follows from the fact that σ Item (b) follows by noting that the statistical difference between X and While the bound on statistical difference given by Lemma 4.8 (b) is simpler to state, Lemma 4.8 (a) often provides substantially stronger bounds.To see this, suppose there is a bad event S of vanishing density, i. e., |S| = o(M), and we would like to say that Pr[X ∈ S] = o(1).Using Lemma 4.8 (b), we would need K = ω(M), i. e., cp(X) = (1 + o(1))/M.But applying Lemma 4.8 (a) with f equal to the characteristic function of S, we get the desired conclusion assuming only K = O(M), i. e., cp(X) = O(1/M).Variations of Lemma 4.8 (a) can be obtained by using Hölder's inequality instead of Cauchy-Schwarz in the proof; these variations provide bounds in terms of Rényi entropy of different orders (and different moments of The classic approach to extracting randomness from block sources is to simply apply a (strong) randomness extractor, like the one in Lemma 4.7, to each block of the source, and uses a union bound over blocks.The bound on the distance from the uniform distribution grows linearly with the number of blocks.
Theorem 4.9 ([9, 44]).Let H : [N] → [M] be a random hash function from a 2-universal family H.For for every block source (X 1 , . . ., X T ) with collision probability 1/K per block, the random variable (H, H(X 1 ), . . ., Thus, if we have enough entropy per block, universal hash functions behave just like ideal hash functions.How much entropy do we need?To achieve an error ε with the above theorem, we need K ≥ MT 2 /(4ε 2 ).In the next section, we will explore how to improve the quadratic dependence on ε and T .

Optimized block-source extraction
In this section, we present several optimized variants of Theorem 4.9.Working with statistical distance, we shave a factor of √ T from Theorem 4.9, which translates to a factor of T saving in the needed K for the distribution of hashed value to be ε-close to uniform.Recall that a block K-source (X 1 , . . ., X T ) is simply a block source with collision probability 1/K per block.Theorem 4.10.Let H : [N] → [M] be a random hash function from a 2-universal family H.For for every block K-source (X 1 , . . ., X T ), the random variable (H, H(X 1 ), . . ., Recall that Theorem 4.9 is proven by passing to statistical distance first, and then measuring the growth of distance using statistical distance, which incurs a linear loss in the number of blocks T .As the linear loss in statistical distance is tight in the worst case, we instead measure the growth of distance using Hellinger distance (cf.[14]), and only pass to statistical distance in the end.
In addition to working with the stringent notion of statistical distance, it turns out that for several applications, it suffices to ensure that the hashed value (H(X 1 ), . . ., H(X T )) has (or is statistically close to having) sufficiently small collision probability, say, within an O(1) factor of that of the uniform distribution.We prove theorems of this form with smaller required entropy from the block source, where Theorem 4.11 uses only 2-universal hash functions and Theorem 4.12 achieves better bounds using 4-wise independent hash functions.Theorem 4.11.Let H : [N] → [M] be a random hash function from a 2-universal family H.For every block K-source (X 1 , . . ., X T ) and every ε > 0, the random variable Theorem 4.12.Let H : [N] → [M] be a random hash function from a 4-wise independent family H.For every block K-source (X 1 , . . ., X T ) and every ε > 0, the random variable In particular, if K ≥ MT + 2MT 2 /ε, then (H, Z) has collision probability at most Note that by Lemma 4.3, the conclusions of Theorems 4.11 and 4.12 imply that the collision probability cp(Z) is at most (1 + 2MT /(εK))/M T and (1 + γ)/M T , for γ = 2 • (MT + 2MT 2 /ε)/K, respectively.
We shall prove the above three theorems in the following subsections.As the proof of Theorem 4.10 is more involved, we prove Theorems 4.11 and 4.12 first in Sections 4.3.1 and 4.3.2, and then prove Theorem 4.10 in Section 4.3.3.We further provide several lower bounds in Section 4.4 showing that the above theorems are optimal in several aspects.

Small collision probability using 2-universal hash functions
Let H : [N] → [M] be a random hash function from a 2-universal family H.We first study the conditions under which This requirement is less stringent than (H,Y ) being ε-close to uniform in statistical distance, and so requires less bits of entropy.
The starting point of our analysis is the Leftover Hash Lemma stated in Lemma 4.7 above, which asserts that if cp(X) ≤ 1/K, then cp(H(X) | H) ≤ 1/M + 1/K.Using the Leftover Hash Lemma, we show that for every hashed block Y i , the conditional collision probability cp . By the definition of block K-source, for every x <i in the support of X <i , cp(X i | X <i =x <i ) ≤ 1/K.By the Leftover Hash Lemma, we have cp which by definition can be rewrite as Noting that the collision probability is at least 1/M, Markov's inequality implies that with probability at least 1 − ε over (h, y) ← (H,Y ), We proceed in the following two steps to finish the proof of Theorem 4.11.
THEORY OF COMPUTING, Volume 9 (30), 2013, pp.897-945 1. First, we show how to fix the ε-fraction of bad (h, y)'s.Namely, we modify at most ε-fraction of the distribution (H,Y ) to obtain a distribution (H, Z) = (H, Z 1 , . . ., Z T ) such that for every 2. Then we show that the above condition is sufficient to imply that We Then there exists a distribution (H, Z) = (H, Z 1 , . . ., Z T ) such that (H, Z) is ε-close to (H,Y ), and for every (h, z) ∈ supp(H, Z), we have Furthermore, the marginal distribution of H is unchanged.
Proof.We define the distribution (H, Z) as follows.
It is easy to check that (i and (iv) the marginal distribution of H is unchanged.
Proof.By the Arithmetic Mean-Geometric Mean inequality, the inequality in the premise implies Therefore, it suffices to prove cp(X) ≤ max We prove the inequality by induction on T .The base case T = 1 is trivial.Suppose the inequality is true for T − 1.We have ≤ cp(X 1 ) • max as desired.

Small collision probability using 4-wise independent hash functions
Here we improve the bound in the previous section using 4-wise independent hash functions.The improvement comes from the fact that when we use 4-wise independent hash functions, we have a concentration result on the conditional collision probability for each block, via the following lemma.
and note that The lemma then follows by the following calculation.
We can then replace the application of Markov's inequality in the proof of Theorem 4.11 by Chebychev's inequality to get a stronger result.Formally, we prove the following lemma, which suffices to prove Theorem 4.12.
Proof of Lemma 4.17.Recall that we have Hence, our goal is to upper bound the probability of the value deviating from its mean by 2/MK 2 ε.Our strategy is to bound the variance of a properly defined random variable, and then apply Chebychev's inequality.By Lemma 4.16, for every i ∈ [T ], we have Fix i ∈ [T ], let us try to bound the variance of the i-th block.There are two issues to take care of.First, the variance we have is conditioning on X <i instead of Y <i .Second, even when conditioning on X <i , it is possible that the variance is too large: The reason is that conditioning on different X <i = x <i , the collision probability of (Y i | X <i =x <i ) may have different expectations over h ← H. Thus, we have to subtract the mean first.Let us define Now, for every x <i ∈ supp(X <i ), f (H, x <i ) has mean 0, and variance ≤ 2/MK 2 .It follows that Var We now deal with the issue of conditioning on X <i versus Y <i .Let us define THEORY OF COMPUTING, Volume 9 (30), 2013, pp.897-945 We claim that cp Indeed, by Lemma 4.6 and the definition of f and g, Also note that g(H,Y <i ) has mean 0 and small variance: The above argument holds for every block i ∈ [T ].Taking average over blocks, we get Finally, we can apply Chebychev's inequality to random variable (1/T ) • ∑ i g(H,Y <i ) to get the desired result: with probability 1 − ε over (h, y) ← (H,Y ),

Statistical distance to uniform distribution
Let H : [N] → [M] be a random hash function form a 2-universal family H.Let X = (X 1 , . . ., X T ) be a block K-source over [N] T .In this subsection, we study the statistical distance between the distribution of hashed sequence (H,Y ) = (H, H(X 1 ), . . ., H(X T )) and the uniform distribution (H,U [M] T ).
As mentioned, the classic result in Theorem 4.9 showed that (H,Y ) is (T /2) • M/K-close to (H,U [M] T ).The result was proven by passing to statistical distance first, and then measuring the growth of statistical distance using a hybrid argument, which incurs a linear loss in the number of blocks T .Since without further information, the hybrid argument is tight, to avoid linear loss in T , we have to measure the increase of distance over blocks in another way, and pass to statistical distance only in the end.It turns out that the Hellinger distance (cf.[14]) is a good measure for our purposes: Like statistical distance, Hellinger distance is a distance measure for distributions, and it takes value in [0, 1].The following standard lemma says that the two distance measures are closely related.We remark that the lemma is tight in both directions even if Y is the uniform distribution.Lemma 4.19 (cf.[14]).Let X and Y be two random variables over In particular, the lemma allows us to upper-bound the statistical distance by upper-bounding the Hellinger distance.Since our goal is to bound the distance to uniform, it is convenient to work with the following definition.

Definition 4.20 (Bhattacharyya Coefficient to Uniform). Let X be a random variable over
(In general, the Bhattacharyya coefficient of random variables X and Y is defined to be 1 when X and Y are independent random variables, so the Bhattacharyya coefficient is well-behaved with respect to products (unlike statistical distance).By Lemma 4.19, if the Bhattacharyya coefficient C(X) is close to 1, then X is close to uniform in statistical distance.Recall that collision probability behaves similarly.If the collision probability cp(X) is close to 1/M, then X is close to uniform.In fact, by the following normalization, we can view the collision probability as the 2-norm of X, and the Bhattacharyya coefficient as the 1/2-norm of X. Let We now discuss our approach to prove Theorem 4.10.We want to show that (H,Y ) is close to uniform.All we know is that the conditional collision probability cp(Y i | H,Y <i ) is close to 1/M for every block.If all blocks are independent, then the overall collision probability cp(H,Y ) is small, and so (H,Y ) is close to uniform.However, this is not true without independence, since 2-norm tends to over-weight heavy elements.In contrast, the 1/2-norm does not suffer this problem.Therefore, our approach is to show that small conditional collision probability implies large Bhattacharyya coefficient.Formally, we have the following lemma.Lemma 4.21.Let X = (X 1 , . . ., X T ) be jointly distributed random variables over With this lemma, the proof of Theorem 4.10 is immediate. Proof We proceed to prove Lemma 4.21.The main idea is to use Hölder's inequality to relate two different norms.We recall Hölder's inequality.Lemma 4.22 (Hölder's inequality [13]).
• Let F, G be two non-negative functions from [M] to R, and p, q > 0 satisfying 1/p + 1/q = 1.Let x be a uniformly random index over [M].We have • In general, let F 1 , . . ., F n be non-negative functions from [M] to R, and p 1 , . . .p n > 0 satisfying 1/p 1 + . . .1/p n = 1.We have Towards proving Lemma 4.21, we first prove the following lemma that relates the collision probability and the Bhattacharyya coefficient of a random variable, which may be of independent interest.The lemma is in fact a special case of Lemma 4.21 with T = 1.Lemma 4.23.Let X be a random variable over [M] with cp(X) ≤ α/M.Then the Bhattacharyya coefficient of X satisfies C(X) ≥ 1/α.That is, the Hellinger distance satisfies THEORY OF COMPUTING, Volume 9 (30), 2013, pp.897-945 Proof.We use Hölder's inequality to relate the two notions.To do so, we express them using the normalization we mentioned before.Let f We now apply Hölder's inequality with F = f 2/3 , G = f 1/3 , p = 3, and q = 3/2.We have Proof of Lemma 4.21.We prove it by induction on T .The base case T = 1 is exactly Lemma 4.23 above.Suppose the lemma is true for T − 1, we show that it is true for T .To apply the induction hypothesis, we consider the conditional random variables (X 2 , . . ., X T | X 1 =x ) for every x ∈ [M 1 ].For every x ∈ [M 1 ] and j = 2, . . ., T , we define to be the "normalized" conditional collision probability.By the induction hypothesis, we have and note that by definition, It follows that We use Hölder's inequality twice to show that Let us first summarize the constraints we have.By definition, we have Ex It follows that Ex[ f (x)g j (x)] ≤ α j for j = 2, . . ., T .Now, we apply the second version of Hölder's inequality with ) for j = 2, . . ., T , p 1 = 2/(T + 1), and p j = 1/(T + 1) for j = 2, . . ., T , which gives THEORY OF COMPUTING, Volume 9 (30), 2013, pp.897-945 so It remains to lower bound the first term by 1/α 1 .We apply Hölder again with F = f 2/(T +2) , G = f T /(T +2) , p = T + 2, and q = (T + 2)/(T + 1), which gives Combining the inequalities, we have

Lower bounds
In this section, we provide lower bounds on the entropy needed for the data items.We show that if K is not large enough, then for every hash family H, there exists a block K-source X = (X 1 , . . ., X T ) such that the hashed sequence Y = (H(X 1 ), . . ., H(X T )) does not satisfy the desired closeness requirements to uniform (possibly in conjunction with the hash function H).

Lower bound for statistical distance to uniform distribution
Let us first consider the requirement for the joint distribution of (H,Y ) being ε-close to uniform.When there is only one block, this is exactly the requirement for a "strong extractor."The lower bound in the extractor literature, due to Radhakrishnan and Ta-Shma [29] shows that K ≥ Ω(M/ε 2 ) is necessary, which is tight up to a constant factor.Our goal is to show that when hashing T blocks, the value of K required for each block increases by a factor of T .Intuitively, each block will produce some error (i.e., the hashed value is not close to uniform), and the overall error will accumulate over the blocks, so we need to inject more randomness per block to reduce the error.Indeed, we use this intuition to show that K ≥ Ω(MT /ε 2 ) is necessary for the hashed sequence to be ε-close to uniform, matching the upper bound in Theorem 4.10.Note that the lower bound holds even for a truly random hash family.Formally, we prove the following theorem.
To prove the theorem, we need to find such an X for every hash family H.Following the intuition, we find an X that incurs a certain error on a single block, and take X = (X 1 , . . ., X T ) to be T i. i. d. copies of X.More precisely, we first find a K-source X such that for Ω(1)-fraction of hash functions h ∈ H, h(X) is Ω(ε/ √ T )-far from uniform.This step is the same as the lower bound proof for extractors [29], which uses the probabilistic method.We pick X to be a random flat K-source, i. e., a uniform distribution over a random set of size K, and show that X satisfies the desired property with nonzero probability.The next step is to measure how the error accumulates over independent blocks.Note that for a fixed hash function h, the hashed sequence (h(X 1 ), . . ., h(X T )) consists of T i. i. d. copies of h(X).Reyzin [33] has shown that the statistical distance increases by a factor of √ T when we have T independent copies for small T .However, Reyzin's result only shows an increase up to distance O(δ 1/3 ), where δ is the statistical distance of the original random variables.We improve Reyzin's result to show that the Ω( √ T ) growth continues until the distance reaches some absolute constant.We then use it to show that the joint distribution (H,Y ) is far from uniform.
The following lemma corresponds to the first step.
Lemma 4.25.Let N and M be positive integers and ε ∈ (0, 1/4), δ ∈ (0, 1) real numbers such that N ≥ M/ε 2 .Let H : [N] → [M] be a random hash function from an hash family H.Then there exists an integer K = Ω(δ 2 M/ε 2 ), and a flat K-source X over [N], such that with probability at least for some α to be determined later.Let X be a random flat K- We claim that for every hash function h : for some absolute constant c.Let us assume (4.3), and prove the lemma first.Since the claim holds for every hash function h, Thus, there exists a flat K-source U S such that Pr The lemma follows by setting α = min{δ 2 /c 2 , 1/32}.We proceed to prove (4.3).It suffices to show that for every y ∈ [M], with probability at least 1 − c • √ α over random U S , the deviation of Pr[h(U S ) = y] from 1/M is at least 4ε/M, where c is another absolute constant.That is,

We have Pr
for some absolute constant c .
Intuitively, the probability in the claim is maximized when the set T has size NL/K so that L = ES[|S ∩ T |], and the claim follows by observing that in this case, the distribution has deviation Θ( √ L), and each possible outcome has probability O( 1/L).The formal proof of the claim is in Appendix A and is proved by expressing the probability in terms of binomial coefficients, and estimating them using Stirling formula.The next step is to measure the increase of statistical distance over independent random variables.Lemma 4.27.Let X and Y be random variables over [M] such that ∆(X,Y ) ≥ ε.Let X = (X 1 , . . ., X T ) be T i. i. d. copies of X, and let Y = (Y 1 , . . .,Y T ) be T i. i. d. copies of Y .We have where ε 0 , c are absolute constants.
Proof.We prove the lemma by the following two claims.The first claim reduces the lemma to the special case that X is a Bernoulli random variable with bias Ω(ε), and Y is a uniform coin.The second claim proves the special case.
For our first claim, we make use of the notion of a randomized function.Recall that with a randomized function, the output f (x) for an input x is a random variable that may take on different values each time f (x) is evaluated.
where ε 0 , c are absolute constants independent of ε and T .
Proof.For x ∈ {0, 1} T , let the weight wt(x) of x to be the number of 1's in x.Let be the subset of {0, 1} T with small weight.(This choice of S is the main source of improvement in our proof compared to that of Reyzin [33], who instead considers the set of all x with weight at most T /2.)For every x ∈ S, we have By standard results on large deviation, we have Combining the above two inequalities, we get for some absolute constants c, ε 0 , which completes the proof.
THEORY OF COMPUTING, Volume 9 (30), 2013, pp.897-945 Note that applying the same randomized function f on two random variables X and Y cannot increase the statistical distance.I. e., ∆( f (X), f (Y )) ≤ ∆(X,Y ).The lemma follows immediately by the above two claims: where f 1 , . . ., f T are independent copies of the randomized function defined in Claim 4.28, and ε 0 , c are absolute constants from Claim 4.29.
Proof of Theorem 4.24.The absolute constant ε 0 in the theorem is a half of the ε 0 in Lemma 4.27.By Lemma 4.25 there is a flat K-source such that for 1/2-fraction of hash functions ∈ H, h(X) is . We set X = (X 1 , . . ., X T ) to be T independent copies of X.Consider a hash function h such that h(X) is (2ε/c √ T )-far from uniform.By Lemma 4.27, (h(X 1 ), . . ., h(X T )) is 2ε-far from uniform.Note that this holds for 1/2-fraction of hash functions h.It follows that

Lower bound for small collision probability
In this subsection, we prove lower bounds on the entropy needed per item to ensure that the sequence of hashed values is close to having small collision probability.Since this requirement is less stringent than being close to uniform, less entropy is needed from the source.The interesting setting in applications is to require the hashed sequence (H,Y ) = (H, H(X 1 ), . . ., H(X T )) to be ε-close to having collision probability O(1/(|H| • M T )).Recall that in this setting, instead of requiring K ≥ MT /ε 2 , K ≥ Ω(MT /ε) is sufficient for 2-universal hash functions (Theorem 4.11), and K ≥ Ω(MT + T M/ε) is sufficient for 4-wise independent hash functions (Theorem 4.12).The main improvement from 2-universal to 4-wise independent hashing is the better dependency on ε.Indeed, it can be shown that if we use truly random hash functions, we can reduce the dependency on ε to log(1/ε).Since we are now proving lower bounds for arbitrary hash families, we focus on the dependency on M and T .Specifically, our goal is to show that K = Ω(MT ) is necessary.More precisely, we show that when K MT , it is possible for the hashed sequence (H,Y ) to be .99-farfrom any distribution that has collision probability less than 100/(|H| • M T ).
We use the same strategy as in the previous subsection to prove this lower bound.Fixing a hash family H, we take T independent copies (X 1 , . . ., X T ) of the worst-case X found in Lemma 4.25, and show that (H, H(X 1 ), . . ., H(X T )) is far from having small collision probability.The new ingredient is to show that when we have T independent copies, and K MT , then (h(X 1 ), . . ., h(X T )) is very far from uniform (say, 0.99-far) for many h ∈ H.We then argue that in this case, we can not reduce the collision probability of (h(X 1 ), . . ., h(X T )) by changing a small fraction of distribution, which implies the overall distribution (H,Y ) is far from any distribution (H , Z) with small collision probability.Formally, we prove the following theorem.Theorem 4.30.Let N, M, and T be positive integers such that N ≥ MT .Let δ ∈ (0, 1) and α > 1 be real numbers such that α < δ 3 • e T /32 /128.Let H : [N] → [M] be a random hash function from a hash family H.There exists an integer K = Ω(δ 2 MT / log(α/δ )), and a block K-source X = (X 1 , . . ., X T ) Think of α and δ as constants.Then the theorem says that K = Ω(MT ) is necessary for the hashed sequence (H, H(X 1 ), . . ., H(X T )) to be close to having small collision probability, matching the upper bound in Theorem 4.11.In the previous proof, we used Lemma 4.27 to measure the increase of distance over blocks.However, the lemma can only measure the progress up to some small constant.It is known that if the number of copies T is larger then Ω(1/ε 2 ), where ε is the statistical distance of original copy, then the statistical distance goes to 1 exponentially fast.Formally, we use the following lemma.Lemma 4.31 ([34]).Let X and Y be random variables over [M] such that ∆(X,Y ) ≥ ε.Let X = (X 1 , . . ., X T ) be T i. i. d. copies of X, and let Y = (Y 1 , . . .,Y T ) be T i. i. d. copies of Y .We have We remark that Lemmas 4.27 and 4.31 are incomparable.In the parameter range of Lemma 4.27, Lemma 4.31 only gives ∆(X,Y ) ≥ Ω(T ε 2 ) instead of Ω( √ T ε).To argue that the overall distribution is far from having small collision probability, we introduce the following notion of nonuniformity.Definition 4.32.Let X be a random variable over [M] with probability mass function p. X is (δ , β )nonuniform if for every function q : [M] → R such that 0 ≤ q(x) ≤ p(x) for all x ∈ [M], and ∑ x q(x) ≥ δ , the function satisfies Intuitively, a distribution X over [M] is (δ , β )-nonuniform means that even if we remove (1 − δ )fraction of probability mass from X, the "collision probability" remains greater than β /M.In particular, X is (1 − δ )-far from any random variable Y with cp(Y ) ≤ β /M.Lemma 4.33.Let X be a random variable over Proof.Let p be the probability mass function of X, and q : [M] → R be a function such that 0 ≤ q(x) ≤ p(x) for every x ∈ [M], and We are ready to prove Theorem 4.30.

Lower bounds for the distribution of hashed values only
We can extend our lower bounds to the distribution of hashed sequence Y = (H(X 1 ), . . ., H(X T )) along (without H) for both closeness requirements, at the price of losing the dependency on ε and incurring some dependency on the size of the hash family.Let 2 d = |H| be the size of the hash family.The dependency on d is necessary.Intuitively, the hashed sequence Y contains at most T • m bits of entropy, and the input (H, X 1 , . . ., X T ) contains at least d + T • k bits of entropy.When d is large enough, it is possible that all the randomness of hashed sequence comes from the randomness of the hash family.Indeed, if H is T -wise independent (which is possible with d T • m), then (H(X 1 ), . . ., H(X T )) is uniform when X 1 , . . ., X T are all distinct.Therefore, Thus, K = Ω(T 2 ) (independent of M) suffices to make the hashed value close to uniform.Think of α and δ as constants.Then the theorem says that when the hash function contains d ≤ T /(32 ln 2) − O(1) bits of randomness, K = Ω(MT /d) is necessary for the hashed sequence to be close to uniform.For example, in some typical hash applications, N = poly(M) and the hash function is 2-universal or O(1)-wise independent.In this case, d = O(log M) and we need K = Ω(MT / log M). (Recall that our upper bound in Theorem 4.11 says that K = O(MT ) suffices.) Proof.We will deduce the theorem from Theorem 4.30.Replacing the parameter α by α • 2 d in Theorem 4.30, we know that there exists an integer K = Ω(δ 2 MT /d • log(α/δ )) and a block K-source X = (X 1 , . . ., X T ) such that (H,Y ) = (H, H(X 1 ), . . ., H(X T )) is One limitation of the above lower bound is that it only works when d ≤ T /(32 ln 2) − O(1).For example, the lower bound cannot be applied when the hash function is T -wise independent.Although d = Ω(T ) may not be interesting in practice, for the sake of completeness, we provide another simple lower bound to cover this parameter region.Theorem 4.36.Let N, M, T be positive integers, and δ ∈ (0, 1), α > 1, d > 0 real numbers.Let H : [N] → [M] be a random hash function from an hash family H of size at most 2 d .Suppose K ≤ N be an integer such that K ≤ (δ2 /4α • 2 d ) 1/T • M. Then there exists a block K-source X = (X 1 , . . ., X T ) such that Y = (H(X 1 ), . . ., H(X T )) is (1 − δ )-far from any distribution Z = (Z 1 , . . ., Z T ) with cp(Z) ≤ α/M T .In particular, Y is (1 − δ )-far from uniform.
Again, think of α and δ as constants.The theorem says that K = Ω(M/2 d/T ) is necessary for the hashed sequence to be close to uniform.In particular, when d = Θ(T ), K = Ω(M) is necessary.Proof.Let X be any flat K-source, i. e., a uniform distribution over a set of size K.We simply take X = (X 1 , . . ., X T ) to be T independent copies of X.Note that Y has support at most as large as (H, X).Thus, Therefore, Y is (1 − δ 2 /4α)-far from uniform.By Lemma 4.33, Y is (1 − δ )-far from any distribution Z = (Z 1 , . . ., Z T ) with cp(Z) ≤ α/M T .

Lower bound for 2-universal hash functions
In this subsection, we show Theorem 4.11 is almost tight in the following sense.We show that there exists K = Ω(MT /ε • log(1/ε)), a 2-universal hash family H, and a block K-source X such that (H,Y ) is ε-far from having collision probability 100/(|H| • M T ).The improvement over Theorem 4.30 is the almost tight dependency on ε.Recall that Theorem 4.11 says that for 2-universal hash family, K = O(MT /ε) suffices.The upper and lower bound differs by a factor of log(1/ε).In particular, our result for 4-wise independent hash functions (Theorem 4.12) cannot be achieved with 2-universal hash functions.The lower bound can further be extended to the distribution of hashed sequence Y = (H(X 1 ), . . ., H(X T )) as in the previous subsection.Furthermore, since the 2-universal hash family we use has small size, we only pay a factor of O(log M) in the lower bound on K. Formally we prove the following results.
Basically, the idea is to show that the Markov inequality applied in the proof of Theorem 4.11 (see inequality (4.1)) is tight for a single block.More precisely, we show that there exists a 2-universal hash family H, and a K-source X such that with probability ε over h ← H, cp(h(X)) ≥ 1/M + Ω(1/Kε).Intuitively, if we take T = Θ(Kε • log(α/ε)/M) independent copies of such X, then the collision probability will satisfy cp(h(X 1 ), . . ., h(X T )) ≥ (1 + Ω(M/Kε)) T /M T ≥ α/(εM T ), and so the overall collision probability is cp(H,Y ) ≥ α/(|H| • M T ).Formally, we analyze our construction below using Hellinger distance, and show that the collision probability remains high even after modifying a Θ(ε)-fraction of distribution.
Proof of Theorem 4.37.Fix a prime power M, and ε > 0, we identify [M] with the finite field F of size M. Let t be an integer parameter such that M t−1 > 1/ε.Recall that the set H 0 of linear functions {h a : F t → F} a∈F t where h a ( x) = ∑ i a i x i is 2-universal.Note that picking a random hash function h ← H 0 is equivalent to picking a random vector a ← F t .Two special properties of H 0 are (i) when a = 0, the whole domain F t is sent to 0 ∈ F, and (ii) the size of hash family |H 0 | is the same as the size of the domain, namely |F t |.We will use H 0 as a building block in our construction.
We proceed to construct the hash family H.We partition the domain [N] into several sub-domains, and apply different hash function to each sub-domain.Let s be an integer parameter to be determined later.We require N ≥ s • M t , and partition [N] into D 0 , D 1 , . . ., D s , where each of D 1 , . . ., D s has size M t and is identified with F t , and D 0 is the remaining part of [N].In our construction, the data X will never come from D 0 .Thus, w. l. o. g., we can assume D 0 is empty.For every i = 1, . . ., s, we use a linear hash function h a i ∈ H 0 to send D i to F. Thus, a hash function h ∈ H consists of s linear hash function (h a 1 , . . ., h a s ), and can be described by s vectors a 1 , . . ., a s ∈ F t .Note that to make H 2-universal, it suffices to pick a 1 , . . ., a s pairwise independently.Specifically, we identify F t with the finite field F of size M t , and pick ( a 1 , . . ., a s ) by picking a, b ∈ F, and output (a + α 1 • b, a + α 2 • b, . . ., a + α s • b) for some distinct constants α 1 , . . ., α s ∈ F. Formally, we define the hash family to be It is easy to verify that H is indeed 2-universal, and |H| = M 2t .We next define a single block K-source X that makes the Markov inequality (4.1) tight.We simply take X to be a uniform distribution over Note that h a,b is bad with probability We set s = 4εM t ≤ M t .It follows that with probability at least 2ε over h ← H, the collision probability satisfies cp(h(X)) ≥ 1/M + 1/(4Kε), as we intuitively desired.However, instead of working with collision probability directly, we need to use Bhattacharyya coefficient to measure the growth of distance to uniform (see Definition 4.18.)The following claim upper bounds the Bhattacharyya coefficient of h(X) for bad hash functions h.The proof of the claim is deferred to the end of this section.
Claim 4.39.Suppose h is a bad hash function defined as above, then the Bhattacharyya coefficient of h(X) satisfies C(h(X)) ≤ 1 − M/(64Kε).
Consider the distribution (h(X 1 ), . . ., h(X T )) for a bad hash function h ∈ H. From the above claim, the Bhattacharyya coefficient satisfies By Lemma 4.19 and the definition of Bhattacharyya coefficient, we have By Lemma 4.33, (h(X 1 ), . . ., h(X T )) is (0.In sum, given M, ε, α,t that satisfies the premise of the theorem, we set K = 4εM t • M t , and proved that for every N ≥ K, and T = Θ((Kε/M) • ln(α/ε)), the conclusion of the theorem is true.It remains to prove Claim 4.39.
Recall that |H| = M 2t .Theorem 4.38 follows from Theorem 4.37 by exactly the same argument as in the proof of Theorem 4.35.

Linear probing
We consider data items come as a block K-source (X 1 , . . ., X T −1 , X T ) where the item Y = X T to be inserted is the last block.An immediate application of Theorem 4.10, using just a 2-universal hash family, gives that if K ≥ MT /ε 2 , the resulting distribution of the element hashes is ε-close to uniform.The effect of the ε statistical difference on the expected insertion time is at most εT , because the maximum insertion time is T .That is, if we let E U be the expected time for an insertion when using a truly random hash function, and E P be the expected time for an insertion using pairwise independent hash functions, we have A natural choice is ε = o(1/T ), so that the εT term is o(1), giving that K needs to be ω(MT 3 ) = ω(M 4 ) in the standard case where T = αM for a constant α ∈ (0, 1) (which we assume henceforth).An alternative interpretation is that with probability 1 − ε, our hash table behaves exactly as though a truly random hash function was used.In some applications, constant ε may be sufficient, in which case K = O(M 2 ) suffices.
Better results can be obtained by applying Lemma 4.8, in conjunction with Theorem 4.11 or Theorem 4.12.In particular, for linear probing, the standard deviation σ of the insertion time is known (see, e. g., [16, p.52]) and is O(1/(1 − α) 2 ).With a 2-universal family, as long as K ≥ MT /ε, from Theorem 4.11 the resulting hash values are ε-close to a block source with collision probability at most (1 + 2MT /(εK))/M T .Using this, we apply Lemma 4.8 to bound the expected insertion time as Choosing ε = o(1/T ) gives that E P and E U are the same up to lower order terms when K is ω(M 3 ).Theorem 4.12 gives a further improvement; for K ≥ MT + 2MT 2 /ε, we have Choosing ε = o(1/T ) now allows for K to be only ω(M 2 ).
In other words, the Rényi entropy needs only to be 2 log M + ω(1) bits when using 4-wise independent hash functions, and 3 log M + ω(1) for 2-universal hash functions.These numbers seem quite reasonable for practical situations.We formalize the result for the case of 2-universal hash functions as follows: Theorem 5.1.Let H be chosen at random from a 2-universal hash family H mapping Here α = T /M is the load and σ = O(1/(1 − α) 2 ) is the standard deviation in the insertion time in the case of truly random hash functions.

Chained hashing
We can follow essentially the same line of argument as in the previous section.Recall here T elements are hashed into a table of size M = T .Theorem 4.10 again implies that using just a 2-universal hash family, if K ≥ MT /ε 2 = T 2 /ε 2 , the resulting distribution of the element hashes is ε-close to uniform.In this case, if we let E U be the expected maximum load when using a truly random hash function, and E P be the expected maximum load using a 2-universal hash function, we again have and similarly having K be ω(T 4 ) suffices.
Similarly, extending the argument for Theorem 5.1, we deduce that if where the o(1) term goes to zero as T → ∞ and σ is the standard deviation in the maximum load in the case of a truly random hash function.
However, here we get a cleaner "high-probability" result by using Theorem 4.11: Proof.Set M = T d .Note that the value of MaxLoad BA (x, h) can be determined from the hashed sequence (h(x 1 ), . . ., h(x T )) ∈ M T alone, and does not otherwise depend on the data sequence x or the hash function h.Thus we can let S ⊆ M T be the set of all sequences of hashed values that produce an allocation with a max load greater than (log log T )/(log d) + c.By Theorem 3.6, we can choose the constant c such that Pr where I is a truly random hash function mapping for sufficiently large T .(Small values of T can be handled by increasing the constant c in the theorem.) Theorem 5.4.For every d ≥ 2 and γ > 0, there is a constant c such the following holds.Let H be chosen at random from a 4-wise independent hash family H mapping [N] to [T ] d .For every block K-source X taking values in [N] T with K ≥ (T d+1 + 2T (d+2+γ)/2 ), we have Proof.The proof is identical to that of Theorem 5.3, except we use Theorem 4.12 instead of Theorem 4.11 and set K = T d+1 + T (d+2+γ)/2 = MT + 2MT 2 /ε.

Bloom filters
We consider the following setting: our block source takes on values in [N] T +1 , producing a collection (x 1 , . . ., x T , y) = (x, y), where x constitutes the set represented by the filter, and y represents an additional data item that will not be equal to any data items of x (with high probability).We first take the advantage of the following result by [19], which reduces the number of required hash function from to 2. The restriction to prime integers M/ is not strictly necessary in general; for more complete statements of when 2 truly random hash functions suffice, see [19].
If we allow the false positive probability to increase by some ε > 0 over truly random hash functions, we can use Theorem 4.10 to immediately obtain the following parallel to Theorem 5.5:3  If we set ε = o(1), then we obtain the same asymptotic false positive probabilities as with truly random hash functions.When T = Θ(M), the Rényi entropy per block needs only to be 3 log M + ω(1) bits for 2-universal hash functions.

Alternative approaches
The results we have described in Section 5 rely on very general arguments, referring to the collision probability of the entire sequence of hashed data values.We suggest, however, that it may prove useful in the future to view the results of hashing block sources in this paper as a collection of tools that can be applied in various ways to specific applications.For example, here we present a variant of Theorems 4.11 and 4.12, asserting that the hashed values are close to a block source with bounded collision probability per block, which may yield improved results in some cases.Theorem 6.1.Let H : [N] → [M] be a random hash function from a 2-universal family H.For every block K-source (X 1 , . . ., X T ) and every ε > 0, the random variable Y = (H(X 1 ), . . ., H(X T )) is ε-close to a block source Z with collision probability 1/M + T /(εK) per block.Theorem 6.2.Let H : [N] → [M] be a random hash function from a 4-wise independent family H.For every block K-source (X 1 , . . ., X T ) and for every ε > 0, the random variable Y = (H(X 1 ), . . ., H(X T )) is ε-close to a block source Z with collision probability 1/M + 1/K + 2T /(εM) • 1/K per block.Theorem 6.1 and 6.2 can be proved in a similar way to the proof of Theorem 4.11 and 4.12, where instead of applying Markov/Chebychev's inequality to the whole sequence once, here we apply the inequalities to each block to achieve the stronger conclusion.
We sketch an example of how these results can be applied to more specific arguments for an application.In the standard layered induction argument for balanced allocations [2], the following key step is used.Suppose that there are at most β i T buckets with load at least i throughout the process.Then (using truly random hash functions) the probability that a data item with d choices lands in a bin with i or more balls already present is bounded above by (β i ) d .When using 2-universal hash functions, we can bound this probability, but with slightly weaker results.The choices for a data item correspond to the hash of one of the blocks from the input block source.Let S be the set of size at most (β i ) d possible hash values for the item's choices that would place the item in a bin with i or more balls.We can bound the probability that the item hashes to a value in S by bounding the collision probability per block (via Theorem 6.1) and applying Lemma 4.8 with f equal to the characteristic function of S. We have applied this technique to generalize the standard layered induction proof of [2] to this setting.This approach turns out to require slightly less entropy from the source for 2-universal hash functions than Theorem 5.3, but the loss incurred in applying Lemma 4.8 means that the analysis only works for d ≥ 3 choices and the maximum load changes by a constant factor (although the O(log log n) behavior is still apparent).We omit the details.

Conclusion
We have started to build a link between previous work on randomness extraction and the practical performance of simple hash functions, specifically 2-universal hash functions.In the future, we hope that there will be a collaboration between theory and systems researchers aimed at fully understanding the behavior of hashing in practice.Indeed, while our view of data as coming from a block source is a natural initial suggestion, theory-systems interaction could lead to more refined and realistic models for real-life data (and in particular, provide estimates for the amount of entropy in the data).A complementary direction is to show that hash functions used in practice (such as those based on cryptographic functions, which may not even be 2-universal) behave similarly to truly random hash functions for these data models.Some results in this direction can be found in [12].We use the following bound on binomial coefficients, which can be derived from Stirling's formula.

Theorem 3 . 4 (
[15]).Let H be a truly random hash function mapping [N] to [T ].For every sequence x ∈ [N] T of distinct data items, we have E [MaxLoad CH (x, H)] = (1 + o(1)) • log T log log T and there is a function g(T ) = o(1) such that

Lemma 4 . 16 .
Let H : [N] → [M] be a random hash function from a 4-wise independent family H, and X a random variable over [N] with cp(X) ≤ 1/K.Then we have Var h←H

Definition 4 . 18 (
Hellinger distance).Let X and Y be two random variables over [M].The Hellinger distance between X and Y is d
D 1 ∪ • • • ∪ D s , and so K = s • M t .Consider a hash function h a,b ∈ H.If all h a,b i are non-zero and distinct, then h a,b (X) is the uniform distribution.If exactly one h a,b i = 0, then h a,b sends M t + (s − 1)M t−1 elements in [N] to 0, and (s − 1)M t−1 elements to each nonzero THEORY OF COMPUTING, Volume 9 (30), 2013, pp.897-945 y ∈ F. Let us call such h a,b bad hash functions.Thus, if h a,b is bad, then cp(h a,b [N]  to [M] = [T ] d and x is an arbitrary sequence of distinct data items.We are interested in the quantityPr MaxLoad BA (X, H) > log log T log d + c = Pr [(H(X 1 ), . . ., H(X T )) ∈ S] .Set ε = 1/2T γ and K = 2T d+1+γ = MT /ε.By Theorem 4.11, (H(X 1 ), . . ., H(X T )) is ε-close to a random variable Z with collision probability at most (1 + 2MT /(εK))/M T = 3/M T per block.Thus, applying Lemma 4.8 with f as the characteristic function of S and µ = E[ f (U [M] T )] ≤ 1/T 3γ , we have Pr[(H(X 1 ), . . ., H(X T )) ∈ S] ≤ Pr[Z ∈ S] + ε
L , as desired.Note that when N > 2K, such T exists.Finally, observe that β 2 < L implies R ≥ 1, andf (T ) f (T + 1) = (T − R + 1)(N − T ) (T + 1)(N − T − K + R).It follows that f (T ) is increasing when T ≤ 2R, and f (T ) is decreasing whenT ≥ N −2K +2R.Therefore, f (T ) ≤ f (2R) = O( 1/L) for T ≤ 2R, and f (T ) ≤ f (N − 2K + 2R) = O( 1/L) for T ≥ N − 2K + 2R,which complete the proof.THEORY OF COMPUTING, Volume 9 (30), 2013, pp.897-945 H is s-universal if for every sequence of distinct elements x 1 , ...,x s ∈ [N], we have Pr[H(x 1 ) = • • • = H(x s )] ≤ 1/M s .The description size of H ∈ H is the number of bits to describe H, which is simply log |H|.For a hash family H mapping [N] → [M] and k ∈ N, we define H k to be the family mapping [N] → [M] k consisting of the functions of the form h(x) = (h 1 (x), . . ., h k (x)), where each h i ∈ H. Observe that if H is s-wise independent (resp., s-universal), then so is H k .However, description size and computation time for functions in H k are k times larger than for H.
[42]h is also denoted by Ex←X [ f (x)].For a finite set S, U S denotes a random variable uniformly distributed on S.Hashing Let H be a family (multiset) of hash functions h : [N] → [M] and let H be uniformly distributed over H.We use h ← H to denote that h is sampled according to the distribution H.We say that H is a truly random family if H is the set all functions mapping [N] to [M], i. e., the N random variables {H(x)} x∈[N] are independent and uniformly distributed over [M].For s ∈ N, H is s-wise independent (a.k.a.strongly s-universal[42]) if for every sequence of distinct elements x 1 , . . ., x s ∈ [N], the random variables H(x 1 ), . . ., H(x s ) are independent and uniformly distributed over [M].
use Lemmas 4.14 and 4.15 below to formalize the above two steps.
Lemma 4.14.Let (H,Y ) = (H,Y 1 , . . .,Y T ) be jointly distributed random variables over H × [M] T such that with probability at least 1 − ε over (h, y) ← (H,Y ), the average conditional collision probability satisfies 1 T • T fraction of hash functions h.By the first statement of Lemma 4.34 below, this implies that (H,Y ) is (1 − δ )-far from any distribution (H , Z) with collision probability α/(|H| • M T ).Lemma 4.34.Let (H,Y ) be a joint distribution over H × [M] such that the marginal distribution H is uniform over H. Let ε, δ , α be positive real numbers.