SQL and JavaScript Table UDFs

A table function returns a set of rows, and can be accessed in the FROM clause of a query. In addition to pure SQL table UDFs, Snowflake supports JavaScript table UDFs, which can leverage the full JavaScript language.

In this Topic:

SQL Table UDFs

The function body of a SQL table UDF must be a SELECT statement. The return types specified in the RETURNS clause must match the types of the expressions in the corresponding positions of the SELECT statement in the function body.

A SQL UDF is defined as a table function by specifying a return type containing the TABLE keyword. The return type specifies the names and types of the columns of the table:

CREATE OR REPLACE FUNCTION <name> ( <arguments> )
  RETURNS TABLE ( <output_columns> )
  ...
  AS '<sql_expression>';

For example:

CREATE OR REPLACE FUNCTION get_countries_for_user ( id number )
  RETURNS TABLE (country_code char, country_name varchar)
  AS 'select distinct c.country_code, c.country_name
      from user_addresses a, countries c
      where a.user_id = id
      and c.country_code = a.country_code';

A SQL table UDF is typically called in the FROM clause of a query, using the TABLE keyword. The columns defined in the table UDF can appear anywhere that a normal table column can be used.

For example:

select *
from table(get_countries_for_user(123)) cc
where cc.country_code in ('US','FR','CA');

+--------------+---------------+
| COUNTRY_CODE | COUNTRY_NAME  |
|--------------+---------------+
| FR           | France        |
| US           | United States |
+--------------+---------------+

JavaScript Table UDFs

A JavaScript table UDF is defined using the same syntax as a regular SQL table UDF, but with the addition of the LANGUAGE keyword. In addition, instead of a SQL block in the function definition, JavaScript code is passed:

CREATE OR REPLACE FUNCTION <name> ( <arguments> )
  RETURNS TABLE ( <output_columns> )
  ...
  LANGUAGE JAVASCRIPT
  ...
  AS '<javascript_code>';

The JavaScript code must meet the following requirements for the table UDF to be valid:

  • The code must define a single literal JavaScript object.
  • The defined object must include a function called processRow.

Note

If the JavaScript code does not meet these requirements, the table UDF will still be created; however, it will fail when called in a query.

Similar to SQL table UDFs, the standard way to call a JavaScript table UDF is through a lateral join. For example:

SELECT * FROM tab, TABLE(js_tudf(tab.c1, tab.c2));

In this example, tab is the table, with columns c1 and c2, and js_tudf is the name of the UDF.

JavaScript table UDFs can also be called using a partition. For example:

SELECT * FROM tab, TABLE(js_tudf(tab.c1, tab.c2) OVER (PARTITION BY tab.c3 ORDER BY tab.c1));

JavaScript Table UDF Object Callbacks

Snowflake interacts with the user-defined object by invoking various callback functions during the execution of the query. The following skeleton outlines all available callbacks and their expected signature:

{
   processRow: function (row, rowWriter, context) {/*...*/},
   finalize: function (rowWriter, context) {/*...*/},
   initialize: function (argumentInfo, context) {/*...*/},
}
processRow

This callback function is invoked once for each row in the input relation. The arguments are passed in the row object. For each of the arguments defined in the CREATE FUNCTION DDL, there is a property on the row object with the same name in all uppercase. The value of this property is the value of the argument for the current row converted to a JavaScript value.

The rowWriter argument is used by the user-supplied code to produce output rows. A single function, writeRow, is defined on the object. The writeRow function takes one argument, the row object which is a single row in the output table represented as a JavaScript object. For each column defined in the RETURNS clause of the CREATE FUNCTION DDL, a corresponding property can be defined on the row object. The value of a property on the row object will become the value for the corresponding column in the output relation. Any output columns without a corresponding property on the row object will have the value NULL in the result table.

finalize

This callback function is invoked once for each partition, after all rows have been passed to processRow.

It can be used to output any state that may have been aggregated in processRow using the same row writer as is passed to processRow.

initialize

This callback function is invoked once for each partition prior to any invocations of processRow; rows cannot be produced by this method.

initialize can be used to set up any state needed during the result computation. In addition, initialize is passed metadata about the arguments to the user-defined function in the argumentInfo object. argumentInfo has a property for each input column defined in the DDL with metadata about that column. Each entry is an object with the following values:

:type: String. The type of this column. :isConst: Boolean. If true, the value of this column is constant (i.e. is the same for every row). :constValue: If isConst (as defined above) is true, this entry contains the constant value of the column; otherwise, this field is undefined.

Note that only the processRow callback is required; the rest are optional.

In addition, other functions and properties can be defined on the object for use in the UDF. All functions take a context object; this is reserved for future use and currently is empty. While the arguments are positional and can be named anything, this topic uses the above names for the remaining discussion and examples.

Note

Modifying this object can yield undefined behavior.

JavaScript Table UDF Partitioning

For this discussion, we define an instance of a JavaScript table UDF as one instance of the JavaScript object used to represent the UDF in Snowflake.

For improved performance when invoking a JavaScript table UDF, several instances can execute in parallel. For this to work, the rows in the input need to be partitioned between the different instances of the UDF. Partitioning is specified when calling a JavaScript table UDF:

  • All rows in a partition are passed to the same UDF instance through the processRow function.
  • initialize and finalize are each invoked once per partition.

Note that a given UDF instance can “see” multiple partitions; it is therefore important to use the initialize and finalize to specifically set up and tear down partitions.

The following examples illustrate three different methods for calling a JavaScript table UDF, demonstrating the different ways to use partitioning.

No Explicit Partitioning

SELECT * FROM tab, TABLE(js_tudf(tab.c1, tab.c2));

In this example, the Snowflake execution engine partitions the input itself according to multiple factors, such as the size of the warehouse processing the UDF and the cardinality of the input relation. When running in this mode, the user code can make no assumptions about partitions. This is most useful when the UDF only needs to look at rows in isolation to produce its output and no state is aggregated across rows.

Explicit Partitioning

SELECT * FROM tab, TABLE(js_tudf(tab.c1, tab.c2) OVER (PARTITION BY <expr> ORDER BY <expr>));

In this example, a partition consists of all rows for which the partition by expressions evaluates to the same value. The rows are passed to processRow in the order defined by the ORDER BY expression.

Explicit Partitioning with an Empty OVER Clause

SELECT * from tab, TABLE(js_tudf(tab.c1, tab.c2) OVER ());

An empty OVER clause means that every row belongs to the same partition (i.e the entire input relation is one partition).

Note

You should exercise caution when calling a Javascript table UDF with an empty OVER clause because this limits Snowflake to creating one instance of the UDF and, therefore, is unable to parallelize the computation.

JavaScript Table UDF Result Columns

This section describes the columns that are produced by a JavaScript Table UDF:

  • All columns defined in the RETURNS clause of the CREATE FUNCTION are returned in the output relation.
  • All columns passed into the UDF are also returned.

There is a distinction between rows produced in the processRow callback and rows produced by finalize:

  • When a row is produced in processRow, Snowflake can correlate it to an input, namely the one passed into the function as the row argument. Note that if a given processRow produces more than one row, input attributes are duplicated into each row.

    For rows produced in processRow, all input columns are duplicated in the output relation.

  • In the finalize callback, Snowflake is unable to correlate it to any single row because there is no “current row” to correlate to.

    For rows produced in the finalize callback, only the columns used in the PARTITION BY clause are duplicated (as these are the same for any row in the current partition); all other attributes are NULL. If no PARTITION BY clause is specified, all input attributes are NULL.

Sample JavaScript Table UDF

The following simple JavaScript table UDF illustrates all the APIs and various output columns. It simply returns all rows as-is and provides a count of number characters seen in each partition. It also illustrates how to share state across a partition using the THIS reference. Note that the example uses an initialize callback to initialize the counter to zero; this is needed because a given UDF instance can be used to process multiple partitions.

-- setup for the example
CREATE TABLE parts (p FLOAT, s STRING);

INSERT INTO parts VALUES (1, 'michael'), (1, 'kelly'), (1, 'brian');
INSERT INTO parts VALUES (2, 'clara'), (2, 'maggie'), (2, 'reagan');

-- creation of the UDF
CREATE OR REPLACE FUNCTION "CHAR_SUM"(INS STRING)
    RETURNS TABLE (NUM FLOAT)
    LANGUAGE JAVASCRIPT
    AS '{
    processRow: function (row, rowWriter) {
      this.ccount = this.ccount + 1;
      this.csum = this.csum + row.INS.length;
      rowWriter.writeRow({NUM: row.INS.length});
    },
    finalize: function (rowWriter, i) {
     rowWriter.writeRow({NUM: this.csum});
    },
    initialize: function() {
     this.ccount = 0;
     this.csum = 0;
    }}';

This example illustrates the CHAR_SUM table UDF called on the parts table with no partitioning:

SELECT * FROM parts, TABLE(char_sum(s));

+--------+---------+-----+
| P      | S       | NUM |
+--------+---------+-----+
| 1      | michael | 7   |
| 1      | kelly   | 5   |
| 1      | brian   | 5   |
| 2      | clara   | 5   |
| 2      | maggie  | 6   |
| 2      | reagan  | 6   |
| [NULL] | [NULL]  | 34  |
+--------+---------+-----+

With no partitioning specified, Snowflake defines partitions. In this example, due to the small number of rows, only one partition is created (i.e. only one invocation of finalize is executed). Note how the final row has NULL values in the input columns.

Same example as before, but with explicit partitioning:

SELECT * FROM parts, TABLE(char_sum(s) OVER (PARTITION BY p));

+--------+---------+-----+
| P      | S       | NUM |
+--------+---------+-----+
| 1      | michael | 7   |
| 1      | kelly   | 5   |
| 1      | brian   | 5   |
| 1      | [NULL]  | 17  |
| 2      | clara   | 5   |
| 2      | maggie  | 6   |
| 2      | reagan  | 6   |
| 2      | [NULL]  | 17  |
+--------+---------+-----+

This example partitions over the p column, yielding two partitions. For each partition, a single row is returned in the finalize callback, yielding a total of two rows, distinguished by the NULL value in the s column. Because p is the PARTITION BY column, the rows created in finalize have the value of p that defines the current partition.