Techno Blender
Digitally Yours.

Estimating Event-specific Counts in Streaming Data | by Arun Jagota | Oct, 2022

0 67


The simple and compelling CountMin Sketch

Photo by luis arias on Unsplash

Imagine a stream of incoming symbols a, b, a, c, a, …. We’d like to know how many times a certain symbol has arrived.

This problem has many uses. To calculate the number of times a certain query was made to Google, a certain video watched on YouTube, or a certain song purchased from iTunes. And many others.

We can solve this problem using a hashmap whose keys are the distinct symbols seen in the stream thus far and whose values are their frequencies. Any symbol’s count may then be looked up from this hashmap.

This approach consumes a lot of memory when there are billions of distinct symbols.

If we are willing to sacrifice some accuracy in our counts, in a way that will get clarified as we proceed through this post, we can solve this problem using an extremely compact data structure. Moreover, the data structure will be updated in a streaming fashion.

Okay, let’s start getting into it. First, the table of contents, reveals in detail what we will cover.

Single Hash Function
Randomizing the Hash Function
Example 3
Multiple Hash Functions
Example
Estimated Symbol’s Count
Example
Very High-Dimensional Universes
Sparse Representations
Why Use Multiple Hash Functions?
Choosing The Hash Functions
Running Time
Mergability
Design Choices And Tradeoffs: A Thought Experiment
Very High-Dimensional Use Cases
Text Documents As Symbols
Configuring The CountMinSketch On Text Documents
Count Estimation Example
The Expected Effects Of Our Configuration Choices
Summary
Further Reading

Single Hash Function

Let’s start from the hashmap approach and “evolve our way” to the CountMinSketch. For illustration, as we go along, we will use a running scenario in which the symbols are 32-bit binary vectors.

The hashmap approach uses a map C of counters. Initially, the map is empty. It’s as if all symbols in the universe have 0 counts. When a new symbol v arrives in the stream, we increment C[v] by 1. We look up the frequency of any symbol v at any time as C[v].

This approach’s memory requirements are the number of distinct symbols in the stream, which in the worst case is 2³². This may be an issue if the stream already has had many distinct arrivals. Say billions. Even if the stream thus far has received only a modest number of distinct symbols we cannot guarantee that C’s size will not blow up eventually.

Now let’s start generalizing with the intent of seeking to prevent C from blowing up.

Let h(v) denote a particular hash function we apply to a symbol v. In the description of the previous paragraph, let’s replace every occurrence of C[v] with C[h(v)]. The hashmap approach corresponds to using the hash function h(v) equals v.

This is where things start getting interesting. Now let’s use a hash function whose range is much smaller than the size of the universe. For example, h(v) returns the first 16 bits of v. The range of h(v) is about 65K. So C will never need more than 65K counters, which is far fewer than 2³².

We must be paying a price. We are. We can now have collisions. Say two symbols u and v have the same hash value, i.e., h(u) equals h(v). The counts of u and v are inter-mixed into the same counter C[h(u)]. That said, while we cannot discern either u or v’s true counts, we can say with certainty that C[h(u)] is an upper bound on both. For example, should [h(u)] be 0, we know with certainty that neither u nor v ever arrived in the stream.

To summarize

To better bound the number of counters, we use a hash function whose range is constricted. While this potentially sacrifices accuracy, our estimated counts remain upper bounds.

Randomizing the Hash Function

The hash function “extract the first 16 bits” is biased towards the first sixteen bits. A simple randomization approach unbiases it.

During the initialization phase first, generate a random 32-bit binary number in which exactly 16 bits are 1-valued. (The randomness is in the positions of these 1-values.) Record this number and use it as a mask when computing the hash value of any symbol.

Let’s illustrate this in the example below.

Example

Here we have restricted the universe to 8-bit numbers. Note that the hashed value is a 4-bit number.

mask  0 1 1 0 0 1 0 1
value 1 1 1 0 0 0 0 1
hash 1 1 0 1

Very High-Dimensional Universes

In our running example, the binary vectors were 32 bits long. There are use cases in which the bit vectors are much longer, millions or billions of bits. We will cover them in a later section. The CountMinSketch is especially appealing there.

Sparse Representations

To this point, we have implicitly assumed that C is an array indexed by the chosen hash function’s range. In many use cases, the hash function we’d like to use may have a very large range. That is, m is very large, possibly in the billions.

Even in low-dimensional universes, the chosen hash function may have a very large range. We’ve seen an example earlier, h(v) = v in which m is 2³² on the universe of 32-bit binary vectors. Clearly, for a higher-dimensional universe, this becomes even more likely.

A natural representation of C in such cases is as a hashmap. C’s keys are the values of h(v) observed in the stream thus far. C[h(v)] is, as usual, h(v)’s count.

This representation is very compact when the number of distinct values of h(v) observed in the stream is much smaller than m. This is often the case in practice.

Multiple Hash Functions

We’ll use multiple hash functions h1, …, hn, all on the same range 1, 2, …, m. We’ll generalize our map of counters to a matrix of counters. The rows index hash functions. The columns index hash values, all being in the range 1, 2, …, m. C[i, hi(v)] will store the count of applying the ith hash function on the symbol v.

On the arrival of a symbol v into the stream, we increment C[i][hi(v)] by 1 for every i from 1 to n. This is illustrated below.

Example

For convenience, we will use a universe that is a higher-base variant of our running one.

Our symbols will be 3-digit numbers in base 4. That is v = v1 v2 v3 where each vi is 0, 1, 2, or 3. We will use three hash functions h1(v), h2(v), and h3(v). One per digit. hi(v) will equal vi, the value of v’s ith digit. For example, h2(132) equals 3, as shown in bold.

C is a 3X4 matrix since we have three hash functions, each hashing into the same four values.

Say the symbols 011 and 232 arrive four and seven times respectively into the stream. In the end, regardless of the order in which they arrive, C is

  0 1 2 3
1 4 0 7 0
2 0 4 0 7
3 0 4 7 0

C’s rows and columns, shown in bold, index digit positions and digit values respectively.

Let’s explain C[2][1]. It is at least 4 because 011 occurred four times in the stream and its second digit is 1. It is no more than 4 because there are no collisions since the only other symbol that occurred in the stream, 232, has a different value at the second position.

Estimated Symbol’s Count

Recall that for any symbol v, we’d like to estimate the number of times it has arrived in the stream till now.

When using a single hash function the symbol’s estimated count was simply C[h(v)]. When we use multiple hash functions the occurrences of v influence n counters C[i, hi(v)] for i ranging from 1 to n. How do we derive a single estimate go from n counts to a single estimate?

Recall that in a previous section we noted that, for any fixed hash function h, C[h(v)] is an upper bound on v’s true count. In the setting of n hash functions this means that, for i ranging from 1 to n, every C[i, hi(v)] is an upper bound on v’s true count. Therefore, the smallest of these upper bounds serves as the best estimate of this true count. That is, we have

CountUB(v): The minimum of C[i][hi(v)] for i from 1 to n

Example

Now let’s calculate count estimates for a couple of numbers in our universe. One will turn out to be an exact count, the other a significant overestimate.

Consider 232. CountUB(232) is the minimum of the counts in the three cells C[1][2], C[2][3], and C[3][2] depicted in bold below.

  0 1 2 3
1 4 0 7 0
2 0 4 0 7
3 0 4 7 0

First, note that each of these cells is collision-free. It records only the number of occurrences of 232 in the stream. 011, the only other symbol the stream receives, does not influence any of these cells. This is because the value sets in the two symbols — {2, 3} and {0, 1} respectively — are disjoint. Consequently, the minimum of these three cells, 7, is exactly 232’s true count.

Next, consider 212. CountUB(212) is the minimum of the counts in the three cells depicted in bold below.

  0 1 2 3
1 4 0 7 0
2 0 4 0 7
3 0 4 7 0

CountUB(212) is 4, whereas 212’s true count is zero. This is because all three cells have collisions relative to 212. This is because 212 collides with 232 in the first and the third positions and with 011 in the second position.

Why Use Multiple Hash Functions?

Instead of using multiple hash functions, why not suitably expand the range of a single one? Consider the hash function “extract first 16 bits” on our running universe example. It uses ~65K counters. Say we add a second hash function “extract the next 16 bits”. Now the two combined use ~130K counters. We could use a single hash function “extract the first 17 bits” and get the same number of counters.

The intuition for using multiple hash functions revolves around the use of the minimum operation in estimating a symbol’s count. For example, for an estimated symbol’s count to be exact, it suffices for just one cell involved in the minimum operation to be collision-free.

The benefits of using multiple hash functions, and in particular in the use of the minimum operation, will come out more clearly in a later section on very high-dimensional use cases.

Choosing The Hash Functions

The hash functions should be independent. The idea is to minimize collisions. That is, diversify.

As an extreme example of the opposite type, consider when all the hash functions are identical. We don’t get any benefits from using multiple. We do incur all the costs of using multiple, in additional memory and computation requirements.

Here is a concrete example of one choice of 4 sensible hash functions on our universe of 32-bit numbers that are independent. h1, h2, h3, and h4 will extract the first, second, third, and fourth chunks of 8 bits respectively.

Running Time

Assuming a hash function may be evaluated in constant time, finding and incrementing the appropriate counters in the counters matrix C runs in time proportional to n, the number of hash functions. Computing any symbol’s estimated count also runs in time proportional to n as it involves looking up n cells and computing their minimum.

Mergability

This is the concept that merging sketches created individually from two streams yield the same sketch as one would get from a single stream that was a merger of the two.

Mergability is a very useful operation in distributed settings. It allows sketches to be built from disparate streams in a distributed fashion. Then merged as needed to estimate global counts, i.e. as if from the mergers of all the streams.

CountMinSketch is mergeable so long as all the sketches use the same hash functions.

This may be expressed as follows:

C(S1) + C(S2) = C(S1 + S2)

Here C(S) denotes the sketch constructed from stream S, the ‘+’ in the LHS represents the sketch merge operation, and the ‘+’ in the RHS is the stream merge operation.

Merging two CountMinSketches is simple: just add up their matrices component-wise. This is depicted below.

C[i,j] = C1[i,j] + C2[i,j] for all i, j

C is the merger of C1 and C2.

Design Choices And Tradeoffs: A Thought Experiment

How does the number of hash functions and the sizes of their ranges affect various tradeoffs?

The following thought experiment will help us think through such issues.

Our universe will be 32-bit numbers. We will examine various configurations of n = 1, 2, 4, 8, …, 32.

Configuration n will use n hash functions h1(n), …, h2(n). hi(n) will extract the ith chunk of 32/n bits from a 32-bit number.

Configuration n=1 uses a single hash function h(v) = v. At the other extreme, configuration n=32 uses 32 hash functions hi(v) = vi, where i ranges from 1 to n.

The n=1 configuration delivers fully accurate counts. However, in the worst case, it can need 2³² counters. At the other extreme, n=32, we only need 32*2=64 counters. Obviously, the estimated symbol counts can be quite inaccurate.

It’s clear that as n increases, the memory bound improves exponentially. This is because whereas the number of rows in C increases linearly with n, the number of columns decreases exponentially.

It’s less clear how accuracy degrades.

Somewhere in between these two extremes may be a sweet spot for the particular tradeoff that the user deems acceptable. Of memory requirements versus accuracy. The nature of the data arriving into the stream also obviously affects this tradeoff.

Very High-Dimensional Use Cases

Text Documents As Symbols

Say our ‘symbols’ are text documents arriving into the stream. For instance, tweets arriving into the Twitter stream. Say for each document, we’d like to estimate the number of times it has been observed in the stream.

Following common practice, let’s represent a document by a suitable vector in the vector space model. The vector’s dimensionality is the number of terms in the lexicon, often in the millions. The ith component represents the ith term in the lexicon.

We may set the value of the ith component of a document’s vector to the frequency, in the document, of the ith lexicon term. Or a binarized version of this frequency. Or a more advanced variant that adjusts this frequency with the term’s IDF.

Okay so now we have encoded documents as vectors of the same fixed dimensionality. We now seek to estimate the counts of distinct vectors observed in the stream.

Note that distinct vector counts are not necessarily distinct document counts. This is because multiple documents may have the same vector encoding.

This can be a good thing. By suitably choosing a document’s vector encoding, we can ignore minor differences we may not care about.

Configuring The CountMinSketch On Text Documents

What is a sensible hash function in this setting? Here is one. The so-called bit sampling hash function, hi(v) = vi, we encountered earlier in this post. It simply extracts out v’s ith component. In our setting, this corresponds to a particular term in the lexicon.

The value vi depends on our vector encoding. We can make it the term’s occurrence indicator (boolean), or its frequency, in the document. Or something more elaborate. Our choice implicitly defines what we are counting. For example, the hash function h_data(v) when v is a boolean vector encoding returns true or false depending on whether the word data appears in the document encoded by v or not.

In the rest of this section, we will limit ourselves to bit-sampling hash functions that are boolean.

Next question: should we use multiple hash functions? If we can afford to use as many hash functions as there are terms in the lexicons (which could be in the millions) yes that would be good. If not, we might first pick a random subset of the lexicon of the size we can afford and use the hash functions that correspond to it.

Count Estimation Example

What would CountUB(v) be for a particular v? Let’s illustrate with a toy example. An interesting discussion will flow out of this example.

In this example, for ease of reading, we will use “estimating the frequency of the document” as short for “estimating the frequency of the document’s vector encoding”.

Say our lexicon is comprised of exactly five words: data, the, of, mining, and computer. Say we want to estimate the frequency of the document { data, of } in the stream.

First, let’s identify the particular counters we need in this estimation. These are C[data,1], C[of, 1], C[the, 0], C[mining, 0], and C[computer, 0]. The first two count the number of documents in the stream that contain the words data and mining respectively. The next three count the number of documents in the stream that do not contain the words the, of, and computer respectively. Next, we take the minimum of these and return that as our estimate.

Now the informal discussion. It will help to imagine that the lexicon is all the distinct articles in Wikipedia. And that the stream’s documents are drawn from this corpus.

We would expect C[data, 1] to be much smaller than C[of, 1] since of is a much more common word than data. We would expect C[the, 0] to be much smaller than C[mining, 0] or C[computer, 0] since most Wikipedia articles have the word the in them. Not so for mining or computer.

So we would expect that the estimated frequency of our document in the stream will be the minimum of C[data, 1] and C[the, 0]. In other words, the smaller of: “the number of Wikipedia articles containing the word data” and “the number of Wikipedia articles not containing the word the”.

The Expected Effects Of Our Configuration Choices

It’s intuitive that as we increase the number of hash functions, i.e. the number of lexicon terms whose presence and absence counts we track, the estimation accuracy should improve. The larger the number of lexicon words we track, the closer we can expect the minimum over their presence and absence counts in a particular document to approach the true count.

Summary

This post has covered the problem of estimating the number of occurrences of a certain symbol in a possibly unbounded stream. It has focused on the CountMinSketch, a data structure with accompanying algorithms, that is especially attractive in the streaming setting. It operates in a single pass on the data and can be configured to use as less or as much memory as we want. (The latter for improved accuracy or for very high-dimensional use cases).

Further Reading

  1. https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch
  2. https://web.stanford.edu/class/cs168/l/l2.pdf


The simple and compelling CountMin Sketch

Photo by luis arias on Unsplash

Imagine a stream of incoming symbols a, b, a, c, a, …. We’d like to know how many times a certain symbol has arrived.

This problem has many uses. To calculate the number of times a certain query was made to Google, a certain video watched on YouTube, or a certain song purchased from iTunes. And many others.

We can solve this problem using a hashmap whose keys are the distinct symbols seen in the stream thus far and whose values are their frequencies. Any symbol’s count may then be looked up from this hashmap.

This approach consumes a lot of memory when there are billions of distinct symbols.

If we are willing to sacrifice some accuracy in our counts, in a way that will get clarified as we proceed through this post, we can solve this problem using an extremely compact data structure. Moreover, the data structure will be updated in a streaming fashion.

Okay, let’s start getting into it. First, the table of contents, reveals in detail what we will cover.

Single Hash Function
Randomizing the Hash Function
Example 3
Multiple Hash Functions
Example
Estimated Symbol’s Count
Example
Very High-Dimensional Universes
Sparse Representations
Why Use Multiple Hash Functions?
Choosing The Hash Functions
Running Time
Mergability
Design Choices And Tradeoffs: A Thought Experiment
Very High-Dimensional Use Cases
Text Documents As Symbols
Configuring The CountMinSketch On Text Documents
Count Estimation Example
The Expected Effects Of Our Configuration Choices
Summary
Further Reading

Single Hash Function

Let’s start from the hashmap approach and “evolve our way” to the CountMinSketch. For illustration, as we go along, we will use a running scenario in which the symbols are 32-bit binary vectors.

The hashmap approach uses a map C of counters. Initially, the map is empty. It’s as if all symbols in the universe have 0 counts. When a new symbol v arrives in the stream, we increment C[v] by 1. We look up the frequency of any symbol v at any time as C[v].

This approach’s memory requirements are the number of distinct symbols in the stream, which in the worst case is 2³². This may be an issue if the stream already has had many distinct arrivals. Say billions. Even if the stream thus far has received only a modest number of distinct symbols we cannot guarantee that C’s size will not blow up eventually.

Now let’s start generalizing with the intent of seeking to prevent C from blowing up.

Let h(v) denote a particular hash function we apply to a symbol v. In the description of the previous paragraph, let’s replace every occurrence of C[v] with C[h(v)]. The hashmap approach corresponds to using the hash function h(v) equals v.

This is where things start getting interesting. Now let’s use a hash function whose range is much smaller than the size of the universe. For example, h(v) returns the first 16 bits of v. The range of h(v) is about 65K. So C will never need more than 65K counters, which is far fewer than 2³².

We must be paying a price. We are. We can now have collisions. Say two symbols u and v have the same hash value, i.e., h(u) equals h(v). The counts of u and v are inter-mixed into the same counter C[h(u)]. That said, while we cannot discern either u or v’s true counts, we can say with certainty that C[h(u)] is an upper bound on both. For example, should [h(u)] be 0, we know with certainty that neither u nor v ever arrived in the stream.

To summarize

To better bound the number of counters, we use a hash function whose range is constricted. While this potentially sacrifices accuracy, our estimated counts remain upper bounds.

Randomizing the Hash Function

The hash function “extract the first 16 bits” is biased towards the first sixteen bits. A simple randomization approach unbiases it.

During the initialization phase first, generate a random 32-bit binary number in which exactly 16 bits are 1-valued. (The randomness is in the positions of these 1-values.) Record this number and use it as a mask when computing the hash value of any symbol.

Let’s illustrate this in the example below.

Example

Here we have restricted the universe to 8-bit numbers. Note that the hashed value is a 4-bit number.

mask  0 1 1 0 0 1 0 1
value 1 1 1 0 0 0 0 1
hash 1 1 0 1

Very High-Dimensional Universes

In our running example, the binary vectors were 32 bits long. There are use cases in which the bit vectors are much longer, millions or billions of bits. We will cover them in a later section. The CountMinSketch is especially appealing there.

Sparse Representations

To this point, we have implicitly assumed that C is an array indexed by the chosen hash function’s range. In many use cases, the hash function we’d like to use may have a very large range. That is, m is very large, possibly in the billions.

Even in low-dimensional universes, the chosen hash function may have a very large range. We’ve seen an example earlier, h(v) = v in which m is 2³² on the universe of 32-bit binary vectors. Clearly, for a higher-dimensional universe, this becomes even more likely.

A natural representation of C in such cases is as a hashmap. C’s keys are the values of h(v) observed in the stream thus far. C[h(v)] is, as usual, h(v)’s count.

This representation is very compact when the number of distinct values of h(v) observed in the stream is much smaller than m. This is often the case in practice.

Multiple Hash Functions

We’ll use multiple hash functions h1, …, hn, all on the same range 1, 2, …, m. We’ll generalize our map of counters to a matrix of counters. The rows index hash functions. The columns index hash values, all being in the range 1, 2, …, m. C[i, hi(v)] will store the count of applying the ith hash function on the symbol v.

On the arrival of a symbol v into the stream, we increment C[i][hi(v)] by 1 for every i from 1 to n. This is illustrated below.

Example

For convenience, we will use a universe that is a higher-base variant of our running one.

Our symbols will be 3-digit numbers in base 4. That is v = v1 v2 v3 where each vi is 0, 1, 2, or 3. We will use three hash functions h1(v), h2(v), and h3(v). One per digit. hi(v) will equal vi, the value of v’s ith digit. For example, h2(132) equals 3, as shown in bold.

C is a 3X4 matrix since we have three hash functions, each hashing into the same four values.

Say the symbols 011 and 232 arrive four and seven times respectively into the stream. In the end, regardless of the order in which they arrive, C is

  0 1 2 3
1 4 0 7 0
2 0 4 0 7
3 0 4 7 0

C’s rows and columns, shown in bold, index digit positions and digit values respectively.

Let’s explain C[2][1]. It is at least 4 because 011 occurred four times in the stream and its second digit is 1. It is no more than 4 because there are no collisions since the only other symbol that occurred in the stream, 232, has a different value at the second position.

Estimated Symbol’s Count

Recall that for any symbol v, we’d like to estimate the number of times it has arrived in the stream till now.

When using a single hash function the symbol’s estimated count was simply C[h(v)]. When we use multiple hash functions the occurrences of v influence n counters C[i, hi(v)] for i ranging from 1 to n. How do we derive a single estimate go from n counts to a single estimate?

Recall that in a previous section we noted that, for any fixed hash function h, C[h(v)] is an upper bound on v’s true count. In the setting of n hash functions this means that, for i ranging from 1 to n, every C[i, hi(v)] is an upper bound on v’s true count. Therefore, the smallest of these upper bounds serves as the best estimate of this true count. That is, we have

CountUB(v): The minimum of C[i][hi(v)] for i from 1 to n

Example

Now let’s calculate count estimates for a couple of numbers in our universe. One will turn out to be an exact count, the other a significant overestimate.

Consider 232. CountUB(232) is the minimum of the counts in the three cells C[1][2], C[2][3], and C[3][2] depicted in bold below.

  0 1 2 3
1 4 0 7 0
2 0 4 0 7
3 0 4 7 0

First, note that each of these cells is collision-free. It records only the number of occurrences of 232 in the stream. 011, the only other symbol the stream receives, does not influence any of these cells. This is because the value sets in the two symbols — {2, 3} and {0, 1} respectively — are disjoint. Consequently, the minimum of these three cells, 7, is exactly 232’s true count.

Next, consider 212. CountUB(212) is the minimum of the counts in the three cells depicted in bold below.

  0 1 2 3
1 4 0 7 0
2 0 4 0 7
3 0 4 7 0

CountUB(212) is 4, whereas 212’s true count is zero. This is because all three cells have collisions relative to 212. This is because 212 collides with 232 in the first and the third positions and with 011 in the second position.

Why Use Multiple Hash Functions?

Instead of using multiple hash functions, why not suitably expand the range of a single one? Consider the hash function “extract first 16 bits” on our running universe example. It uses ~65K counters. Say we add a second hash function “extract the next 16 bits”. Now the two combined use ~130K counters. We could use a single hash function “extract the first 17 bits” and get the same number of counters.

The intuition for using multiple hash functions revolves around the use of the minimum operation in estimating a symbol’s count. For example, for an estimated symbol’s count to be exact, it suffices for just one cell involved in the minimum operation to be collision-free.

The benefits of using multiple hash functions, and in particular in the use of the minimum operation, will come out more clearly in a later section on very high-dimensional use cases.

Choosing The Hash Functions

The hash functions should be independent. The idea is to minimize collisions. That is, diversify.

As an extreme example of the opposite type, consider when all the hash functions are identical. We don’t get any benefits from using multiple. We do incur all the costs of using multiple, in additional memory and computation requirements.

Here is a concrete example of one choice of 4 sensible hash functions on our universe of 32-bit numbers that are independent. h1, h2, h3, and h4 will extract the first, second, third, and fourth chunks of 8 bits respectively.

Running Time

Assuming a hash function may be evaluated in constant time, finding and incrementing the appropriate counters in the counters matrix C runs in time proportional to n, the number of hash functions. Computing any symbol’s estimated count also runs in time proportional to n as it involves looking up n cells and computing their minimum.

Mergability

This is the concept that merging sketches created individually from two streams yield the same sketch as one would get from a single stream that was a merger of the two.

Mergability is a very useful operation in distributed settings. It allows sketches to be built from disparate streams in a distributed fashion. Then merged as needed to estimate global counts, i.e. as if from the mergers of all the streams.

CountMinSketch is mergeable so long as all the sketches use the same hash functions.

This may be expressed as follows:

C(S1) + C(S2) = C(S1 + S2)

Here C(S) denotes the sketch constructed from stream S, the ‘+’ in the LHS represents the sketch merge operation, and the ‘+’ in the RHS is the stream merge operation.

Merging two CountMinSketches is simple: just add up their matrices component-wise. This is depicted below.

C[i,j] = C1[i,j] + C2[i,j] for all i, j

C is the merger of C1 and C2.

Design Choices And Tradeoffs: A Thought Experiment

How does the number of hash functions and the sizes of their ranges affect various tradeoffs?

The following thought experiment will help us think through such issues.

Our universe will be 32-bit numbers. We will examine various configurations of n = 1, 2, 4, 8, …, 32.

Configuration n will use n hash functions h1(n), …, h2(n). hi(n) will extract the ith chunk of 32/n bits from a 32-bit number.

Configuration n=1 uses a single hash function h(v) = v. At the other extreme, configuration n=32 uses 32 hash functions hi(v) = vi, where i ranges from 1 to n.

The n=1 configuration delivers fully accurate counts. However, in the worst case, it can need 2³² counters. At the other extreme, n=32, we only need 32*2=64 counters. Obviously, the estimated symbol counts can be quite inaccurate.

It’s clear that as n increases, the memory bound improves exponentially. This is because whereas the number of rows in C increases linearly with n, the number of columns decreases exponentially.

It’s less clear how accuracy degrades.

Somewhere in between these two extremes may be a sweet spot for the particular tradeoff that the user deems acceptable. Of memory requirements versus accuracy. The nature of the data arriving into the stream also obviously affects this tradeoff.

Very High-Dimensional Use Cases

Text Documents As Symbols

Say our ‘symbols’ are text documents arriving into the stream. For instance, tweets arriving into the Twitter stream. Say for each document, we’d like to estimate the number of times it has been observed in the stream.

Following common practice, let’s represent a document by a suitable vector in the vector space model. The vector’s dimensionality is the number of terms in the lexicon, often in the millions. The ith component represents the ith term in the lexicon.

We may set the value of the ith component of a document’s vector to the frequency, in the document, of the ith lexicon term. Or a binarized version of this frequency. Or a more advanced variant that adjusts this frequency with the term’s IDF.

Okay so now we have encoded documents as vectors of the same fixed dimensionality. We now seek to estimate the counts of distinct vectors observed in the stream.

Note that distinct vector counts are not necessarily distinct document counts. This is because multiple documents may have the same vector encoding.

This can be a good thing. By suitably choosing a document’s vector encoding, we can ignore minor differences we may not care about.

Configuring The CountMinSketch On Text Documents

What is a sensible hash function in this setting? Here is one. The so-called bit sampling hash function, hi(v) = vi, we encountered earlier in this post. It simply extracts out v’s ith component. In our setting, this corresponds to a particular term in the lexicon.

The value vi depends on our vector encoding. We can make it the term’s occurrence indicator (boolean), or its frequency, in the document. Or something more elaborate. Our choice implicitly defines what we are counting. For example, the hash function h_data(v) when v is a boolean vector encoding returns true or false depending on whether the word data appears in the document encoded by v or not.

In the rest of this section, we will limit ourselves to bit-sampling hash functions that are boolean.

Next question: should we use multiple hash functions? If we can afford to use as many hash functions as there are terms in the lexicons (which could be in the millions) yes that would be good. If not, we might first pick a random subset of the lexicon of the size we can afford and use the hash functions that correspond to it.

Count Estimation Example

What would CountUB(v) be for a particular v? Let’s illustrate with a toy example. An interesting discussion will flow out of this example.

In this example, for ease of reading, we will use “estimating the frequency of the document” as short for “estimating the frequency of the document’s vector encoding”.

Say our lexicon is comprised of exactly five words: data, the, of, mining, and computer. Say we want to estimate the frequency of the document { data, of } in the stream.

First, let’s identify the particular counters we need in this estimation. These are C[data,1], C[of, 1], C[the, 0], C[mining, 0], and C[computer, 0]. The first two count the number of documents in the stream that contain the words data and mining respectively. The next three count the number of documents in the stream that do not contain the words the, of, and computer respectively. Next, we take the minimum of these and return that as our estimate.

Now the informal discussion. It will help to imagine that the lexicon is all the distinct articles in Wikipedia. And that the stream’s documents are drawn from this corpus.

We would expect C[data, 1] to be much smaller than C[of, 1] since of is a much more common word than data. We would expect C[the, 0] to be much smaller than C[mining, 0] or C[computer, 0] since most Wikipedia articles have the word the in them. Not so for mining or computer.

So we would expect that the estimated frequency of our document in the stream will be the minimum of C[data, 1] and C[the, 0]. In other words, the smaller of: “the number of Wikipedia articles containing the word data” and “the number of Wikipedia articles not containing the word the”.

The Expected Effects Of Our Configuration Choices

It’s intuitive that as we increase the number of hash functions, i.e. the number of lexicon terms whose presence and absence counts we track, the estimation accuracy should improve. The larger the number of lexicon words we track, the closer we can expect the minimum over their presence and absence counts in a particular document to approach the true count.

Summary

This post has covered the problem of estimating the number of occurrences of a certain symbol in a possibly unbounded stream. It has focused on the CountMinSketch, a data structure with accompanying algorithms, that is especially attractive in the streaming setting. It operates in a single pass on the data and can be configured to use as less or as much memory as we want. (The latter for improved accuracy or for very high-dimensional use cases).

Further Reading

  1. https://en.wikipedia.org/wiki/Count%E2%80%93min_sketch
  2. https://web.stanford.edu/class/cs168/l/l2.pdf

FOLLOW US ON GOOGLE NEWS

Read original article here

Denial of responsibility! Techno Blender is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – [email protected]. The content will be deleted within 24 hours.

Leave a comment