Estimating Approximate Frequent Values Using Space-Saving

Space-Saving is a space and time efficient way of estimating frequent values in data.

In this Topic:

Overview

Snowflake provides an implementation of the Space-Saving algorithm by Metwally, Agrawal and Abbadi. It can be accessed using the APPROX_TOP_K family of functions. APPROX_TOP_K_COMBINE uses the parallel Space-Saving algorithm outlined by Cafaro, Pulimeno and Tempesta.

The error of the algorithm heavily depends upon how skewed the data is, and the number of counters used in the algorithm. As data becomes more skewed, or more counters are used, the output will be more accurate.

SQL Functions

The following Aggregate Functions are provided for using Space-Saving to estimate frequent values:

  • APPROX_TOP_K: Returns an approximation of frequent values in the input.
  • APPROX_TOP_K_ACCUMULATE: Skips the final estimation step and returns the Space-Saving state at the end of an aggregation.
  • APPROX_TOP_K_COMBINE: Combines (i.e. merges) input states into a single output state.
  • APPROX_TOP_K_ESTIMATE: Computes a cardinality estimate of a Space-Saving state produced by APPROX_TOP_K_ACCUMULATE and APPROX_TOP_K_COMBINE.

Implementation Notes

Each counter in our implementation tracks an item and its frequency. Notably, our implementation does not track the epsilon values of counters, as they are only useful for giving guarantees about the output of the algorithm, they are not used for the algorithm itself.

The maximum number of counters is set to 100 thousand. In this case, there are 100 thousand counters stored in memory, but only a fraction of these are stored in an exported state.

The maximum number of k is 100 thousand. This value is automatically reduced if all the values cannot fit in the output.

In most cases, the runtime of our implementation does not depend on the number of counters. Our implementation ensures the number of counters does not have a noticeable effect on the runtime of the algorithm.

Each counter in each aggregation state uses a constant amount of memory overhead of around 100 bytes. Thus, if an aggregation uses c counters and there are g aggregation groups, the aggregation will use c * g * 100B of memory, plus memory to store the values. If this memory exceeds the total memory budget, memory is spilled to disk. This is far less memory than the exact version would use, especially when there a large number of unique values.