Secure UDFs

This topic covers concepts and syntax, including examples, for using secure UDFs (user-defined functions).

In this Topic:

Overview

Why Should I Use Secure UDFs?

Some of the internal optimizations for SQL UDFs require access to the underlying data in the base tables. This access might allow data that is hidden from users of the UDF to be exposed indirectly through programmatic methods. Secure UDFs do not utilize these optimizations, ensuring that users do not have even indirect access to the underlying data.

In addition, the SQL expression or JavaScript code used to create a UDF, also known as the UDF definition or text, is visible to users in the following commands and interfaces:

For security or privacy reasons, you might not wish to expose the underlying tables or algorithmic details for a UDF. With secure UDFs, the definition and details are visible only to authorized users (i.e. users who are granted the role that owns the view).

When Should I Use a Secure UDF?

SQL UDFs should be defined as secure when they are specifically designated for data privacy (i.e. to limit access to sensitive data that should not be exposed to all users of the underlying tables).

Secure UDFs should not be used for SQL UDFs that are defined for query convenience, such as those created for simplifying querying data for which users do not need to understand the underlying data representation. This is because the Snowflake query optimizer, when evaluating secure UDFs, bypasses the optimizations used for regular UDFs. This might reduce query performance for secure UDFs.

Tip

When deciding whether to use a secure UDF, you should consider the purpose of the UDF and weigh the trade-off between data privacy/security and query performance.

Also, if your data is sensitive enough that you decide that accesses via one type of object (such as UDFs) should be secure, then you should strongly consider ensuring that accesses via other types of objects (such as views) are also secure.

For example, if you only allow secure UDFs to access a given table, then any views that you allow to access the same table probably also should be secure.

How Might Data be Exposed by a Regular UDF?

Users might wonder how the security of a regular UDF could be circumvented. In certain/specific situations, a user might be able to deduce information about rows that the user cannot see directly. For more details, see Pushdown (in this topic).

Pushdown

To understand how data can be exposed indirectly, it helps to understand the concept of pushdown.

What is Pushdown?

Pushdown improves performance by filtering out unneeded rows as early as possible during query processing. Pushdown can also reduce memory consumption. However, pushdown can allow confidential data to be exposed indirectly.

Consider the following query:

SELECT col1
  FROM tab1
  WHERE location = 'New York';

One approach to processing the query is:

  1. Read all rows from the table into memory (i.e. execute the FROM clause).
  2. Scan the rows in memory, filtering out any rows that do not match New York (i.e. execute the WHERE clause).
  3. Select col1 from the rows still remaining in memory (i.e. execute the SELECT list).

You can think of this as a “load first, filter later” strategy, which is straight-forward, but inefficient.

It’s usually more efficient to filter as early as possible. Early filtering is called “pushing the filter down deeper into the query plan”, or simply “pushdown”.

In our example query, it would be more efficient to tell the table-scanning code not to load records that don’t match the WHERE clause. This doesn’t save filtering time (every row’s location must still be read once), but it can save considerable memory and reduce subsequent processing time because there are fewer rows to process.

In some cases, you can process the data even more efficiently. For example, suppose that the data is partitioned by state (i.e. all the data for New York is in one micro-partition, all the data for Florida is in another micro-partition, and so on). In this scenario:

  • Snowflake does not need to store all the rows in memory.
  • Snowflake does not need to read all the rows.

We loosely define this as another form of “pushdown”.

The principle of “pushing down the filters” applies to a wide range of queries. Often, the filter that is the most selective (screens out the most data) is pushed deepest (executed earliest) to reduce the work that the remaining query must do.

Pushdown can be combined with other techniques, such as clustering (sorting/ordering the data), to reduce the amount of irrelevant data that needs to be read, loaded, and processed.

How do Secure UDFs Protect Data?

If the optimizer re-orders the filters in a way that allows a general filter to run before the appropriate filter(s) used to secure data are applied, underlying details could be exposed. Therefore, the solution is to prevent the optimizer from pushing down certain types of filters (more generally, to prevent the optimizer from using certain types of optimizations, including but not limited to filter pushdown) if those optimizations are not safe.

Declaring a UDF as “secure” tells the optimizer to not push down certain filters (more generally, not to use certain optimizations). However, preventing certain types of optimizations can impact performance.

Example of Indirect Data Exposure Through Pushdown

The following example shows one way that pushdown could indirectly result in the exposure of underlying details about a query. This example focuses on views, but the same principles apply to UDFs.

Suppose there is a table that stores information about patients:

CREATE TABLE patients
  (patient_ID INTEGER,
   category VARCHAR,      -- 'PhysicalHealth' or 'MentalHealth'
   diagnosis VARCHAR
   );

INSERT INTO patients (patient_ID, category, diagnosis) VALUES
  (1, 'Mental', 'paranoia'),
  (2, 'Physical', 'lung cancer');

There are two views, one of which shows mental health information and one of which shows physical health information:

CREATE VIEW mental_health_view AS
  SELECT * FROM patients WHERE category = 'MentalHealth';

CREATE VIEW physical_health_view AS
  SELECT * FROM patients WHERE category = 'PhysicalHealth';

Most users don’t have direct access to the table. Instead, users are assigned one of two roles:

  • MentalHealth, which has privileges to read from mental_health_view, or
  • PhysicalHealth, which has privileges to read from physical_health_view.

Now suppose that a doctor with privileges only on physical health data wants to know whether there are currently any mental health patients in the table. The doctor can construct a query similar to the following:

SELECT * FROM physical_health_view
  WHERE 1/IFF(category = 'MentalHealth', 0, 1) = 1;

This query is equivalent to:

SELECT * FROM patients
  WHERE
    category = 'PhysicalHealth' AND
    1/IFF(category = 'MentalHealth', 0, 1) = 1;

There are (at least) two methods that Snowflake uses to process this query.

  • Method 1:
    1. Read all the rows in the patients table.
    2. Apply the view’s security filter (i.e. filter out the rows for which the category is not PhysicalHealth).
    3. Apply the WHERE clause in the query (i.e. filter based on WHERE 1/IFF(category = 'MentalHealth', 0, 1) = 1).
  • Method 2 changes the order of the filters, so that the query executes as follows:
    1. Read all the rows in the patients table.
    2. Apply the WHERE clause in the query (i.e. filter based on WHERE 1/IFF(category = 'MentalHealth', 0, 1) = 1).
    3. Apply the view’s security filter (i.e. filter out the rows for which the category is not PhysicalHealth).

Logically, these two sequences seem equivalent; they return the same set of rows. However, depending on how selective these two filters are, one order of processing might be faster, and Snowflake’s query planner might choose the plan that executes faster.

Suppose that the optimizer chooses the second plan, in which the clause WHERE 1/IFF(category = 'MentalHealth', 0, 1) = 1 is executed before the security filter. If the patients table has any rows in which category = 'MentalHealth', then the IFF function returns 0 for that row, and the clause effectively becomes WHERE 1/0 = 1, so the statement causes a divide-by-zero error. The user with physical_health_view privileges does not see any rows for people with mental health issues, but can deduce that at least one person in the mental health category exists.

Note that this technique does not always result in exposing underlying details; it relies heavily on the choices that the query planner makes, as well as on how the views (or UDFs) are written. But this example shows that a user can deduce information about rows that the user cannot view directly.

Creating Secure UDFs

Secure UDFs are defined using the SECURE keyword with the standard DDL for functions:

  • To create a secure function, specify the SECURE keyword in the CREATE FUNCTION command.
  • To convert an existing function to a secure function or back to a regular function, set/unset the SECURE keyword in the
  • ALTER FUNCTION command.

Interacting with Secure UDFs

Viewing the Definition for Secure Functions

The definition of a secure function is only exposed to authorized users, i.e. users who have been granted the role that owns the function. If an unauthorized user uses any of the following commands or interfaces, the function definition is not displayed by any of the following:

Determining if a Function is Secure

The IS_SECURE column in the FUNCTIONS Information Schema view identifies whether a function is secure. For example:

select is_secure from information_schema.functions where function_name = 'MYFUNCTION';

Viewing Secure Function Details in Query Profile

The internals of a secure function are not exposed in Query Profile (in the web interface). This is the case even for the owner of the secure function, since non-owners might have access to an owner’s Query Profile.

Best Practices for Secure UDFs

Secure UDFs prevent users from possibly being exposed to data from rows of tables that are filtered by the function. However, there are still ways that a data owner might inadvertently expose information about the underlying data if UDFs are not constructed carefully. In this section, we discuss some potential pitfalls to avoid.

Sequence-generated Columns

A common practice for generating surrogate keys is to use a sequence or auto-increment column. If these keys are exposed to users who do not have access to all of the underlying data, then a user might be able to guess details of the underlying data distribution.

For example, suppose that we have a function get_widgets_function() that exposes the ID column. If ID is generated from a sequence, then a user of get_widgets_function() could deduce the total number of widgets created between the creation timestamps of two widgets that the user has access to. Consider the following query and result:

select * from table(get_widgets_function()) order by created_on;

------+-----------------------+-------+-------+-------------------------------+
  ID  |         NAME          | COLOR | PRICE |          CREATED_ON           |
------+-----------------------+-------+-------+-------------------------------+
...
 315  | Small round widget    | Red   | 1     | 2017-01-07 15:22:14.810 -0700 |
 1455 | Small cylinder widget | Blue  | 2     | 2017-01-15 03:00:12.106 -0700 |
...

Based on the result, the user might suspect that 1139 widgets (1455 - 315) were created between January 7 and January 15. If this information is too sensitive to expose to users of a function, you can use any of the following alternatives:

  • Do not expose the sequence-generated column as part of the function.
  • Use randomized identifiers (e.g. generated by UUID_STRING) instead of sequence-generated values.
  • Programmatically obfuscate the identifiers.

Scanned Data Size

For queries containing secure functions, Snowflake does not expose the amount of data scanned (either in terms of bytes or micro-partitions) or the total amount of data. This is to protect the information from users who only have access to a subset of the data. However, users might still be able to make observations about the quantity of underlying data based on performance characteristics of queries. For example, a query that runs twice as long might process twice as much data. While any such observations are approximate at best, in some cases it might be undesirable for even this level of information to be exposed.

In such cases, it is best to materialize data per user/role instead of exposing functions on the base data to users. In the case of the widgets table, a table would be created for each role that has access to widgets, which contains only the widgets accessible by that role, and a role would be granted access to its table. This is much more cumbersome than using a single function, but for extremely high-security situations, this might be warranted.

Secure UDFs and Data Sharing

When using secure UDFs with data sharing, the CURRENT_ACCOUNT function can be used to authorize users from a specific account to access rows in a base table.

Note that Snowflake does not recommend using CURRENT_ROLE or CURRENT_USER in secure UDFs that will be shared with other accounts, since the owner of the data being shared does not typically control the roles and users in the account with which it is being shared.