Querying data protected by differential privacy

This topic helps an analyst run queries against data protected by differential privacy (that is, privacy-protected tables and views), and understand and adjust the results returned by the queries.

To execute a query against a privacy-protected table, a user must have the SELECT privilege on the table.

Limitations

  • Differential privacy supports a subset of Snowflake data types, operators, query syntax, and functions. For a list of supported SQL that you can use in a query, see Differential privacy SQL reference.

  • Queries against privacy-protected tables have additional latency because Snowflake must run additional computations to determine how much noise to add. Currently, this latency is at least 7 seconds, but can be longer for queries that output multiple rows in the result, for example, when using GROUP BY clauses that result in many groups.

Query Fundamentals

This section discusses the basic components of a query that will succeed when run against a privacy-protected table. It includes:

Aggregating data

All queries against a privacy-protected table must aggregate results rather than retrieve individual records. Not every part of a query needs to use an aggregation function as long as the final result is aggregated.

With the exception of a COUNT function, a query cannot aggregate a column unless the column has a privacy domain.

For a list of supported aggregations, see Aggregate functions.

Using joins

The following sections provide guidelines for using joins in a differentially private query:

To learn about the implications that joining two privacy-protected tables has on privacy domains, see Privacy domains and joins.

Join operators

Each join must be an equi join that uses a single operator. For example, t1.c1 == t2.c1 is supported, but col1 > col2 and col1 + 10 = col2 are not. Unconditioned joins are not supported.

Joins must use the JOIN operator. The WHERE syntax for joins is not supported. For more information about join syntax, see Implementing Joins.

Supported joins

Joins in a differentially private query must be one of the following:

  • INNER

  • { LEFT | RIGHT | FULL } OUTER

  • NATURAL

Both sides of the join must have the same query pattern. For example, the following joins are supported:

Both sides are identifiers

SELECT COUNT(*)
FROM t1 INNER JOIN t2 ON t1.a=t2.a;
Copy

Both sides are subqueries

SELECT COUNT(*)
FROM (SELECT a, SUM(b) FROM t1 GROUP BY a) AS g1
    INNER JOIN (SELECT * FROM t2) AS g2
    ON g1.a=g2.a;
Copy

Joining an identifier with a subquery is currently not supported.

For information about the supported query syntax related to joins, see Query syntax.

Using entity keys in joins

When working with tables protected with entity-level privacy, you can minimize the amount of noise by including the entity key column as part of the join key, especially if it doesn’t semantically change the query.

For example, consider the following tables where the entity is customers:

Table

Description

customers

Customer directory, where each row is a customer and has a customer_id.

transactions

Customer transactions, where each customer can have multiple transactions.

transaction_lines

Unique items that were purchased in a transaction. There can be multiple rows in a single transaction.

If they are following best practices, the data provider has structured the data so that each of these tables has the entity key customer_id. For this data schema, each transaction line can only belong to one transaction, and each transaction can only belong to one customer. This relationship is not evident from the data itself, so without additional information the amount of noise added for differential privacy will be higher than it needs to be.

You can minimize the amount of noise by including the entity key customer_id as part of the join key, even if it is redundant. For example, joining the table transactions with transaction_lines typically only requires the join key transaction_id. However, joining on both transaction_id and customer_id will result in a lower amount of noise.

Data types and privacy domains

When joining two tables, the data types of the join key columns from either side must be the same. For differential privacy, the data type of a column includes whether or not it has a privacy domain.

For example, if you had a privacy-protected table transactions and an unprotected table product_lookup, and you wanted to join them on product_id, the product_id column in both tables must be the same data type (for example, a string) and must each have a privacy domain.

To meet this requirement, the administrator for the analyst might need to define a privacy domain just like the data provider defines them. For information on how to set a privacy domain for a table, see Setting a privacy domain.

Uniqueness requirement

Joins can potentially duplicate rows of data, which can cause the amount of noise added to a query result to become unbounded. To ensure that privacy-protected data is not duplicated in a join, the join key (that is, the columns on which the tables are joined) for privacy-protected tables must match only one record in the other table. This means that when joining with a privacy-protected table, the join key on the opposite side must be de-duplicated.

Important

The uniqueness requirement for joins doesn’t always apply to queries against tables that are protected by entity-level privacy. For entity-level privacy, queries must de-duplicate on the entity key before the aggregation. As long as this is done after a join but before the aggregation, the join doesn’t need to be on de-duplicated data. For more information about meeting these requirements, see Querying data protected by entity-level privacy.

To satisfy the uniqueness requirement for joins, the query can use a GROUP BY on a subset of the join columns to group duplicate rows into one result.

For example, suppose the patients table is protected by differential privacy and the geo_lookup table is not. The analyst wants to join these two tables on zip_code so that they can filter the patients table on State. In order to ensure that the records in the privacy-protected patients table are not duplicated, the query must de-duplicate the zip_code table on the join key. This must be done explicitly even if the geo_lookup table is already unique on zip_code. This ensures that Snowflake can correctly account for privacy.

SELECT COUNT(*)
  FROM patients
  LEFT JOIN (SELECT zip_code, ANY_VALUE(state) AS residence_state
            FROM geo_lookup
            GROUP BY zip_code)
  USING zip_code
  WHERE birth_state = residence_state;
Copy

Querying data protected by entity-level privacy

Most data providers use an entity key to implement entity-level privacy when configuring differential privacy. When a table is protected by entity-level privacy, Snowflake does not allow aggregates on fields if there might be an unbounded number of rows per entity. This means queries must meet the following requirements:

  • At some point in the query, the privacy-protected table must be deduplicated on the entity key. Operations that can be used to deduplicate data are:

    • COUNT( DISTINCT <entity_key_column> )

    • GROUP BY <entity_key_column>

    • UNION (but not UNION ALL) when only the entity key is projected.

  • If a join uses a join key other than the entity key column, that join cannot occur between the deduplication and the final SELECT clause with aggregation.

Note

If the data provider implemented row-level privacy, the deduplication requirement for joins is different. For more information about these requirements, see Uniqueness requirement.

To help illustrate the requirements for entity-level privacy, suppose you have a privacy-protected table patients with the entity key column patient_id. You also have a non-sensitive, unprotected table geo_lookup. The following examples show a query that fails followed by a re-written version that succeeds.

Example: Deduplication

The following query fails because it doesn’t meet the deduplication requirement. Even though the table patients might already be unique on patient_id, the query fails because it does not explicitly deduplicate.

SELECT COUNT(*)
  FROM patients
  WHERE insurance_type = 'Commercial';
Copy

To re-write the query so it succeeds, include a distinct count on the entity key column in order to explicitly deduplicate on the entity key. For example:

SELECT COUNT(DISTINCT patient_id)
  FROM patients
  WHERE insurance_type = 'Commercial';
Copy
Example: Location of join

The following query fails even though it is using a GROUP BY clause to meet the deduplication requirement. It fails because the table is being joined with another table using a column that is not the entity key column.

SELECT AVG(bmi)
  FROM (SELECT patient_id, ANY_VALUE(zip_code) AS zip_code
    FROM patients
    GROUP BY patient_id) AS p
  JOIN geo_lookup AS g
    ON p.zip_code = g.zip_code
  WHERE state='CA';
Copy

To re-write the query so it succeeds, use the GROUP BY clause after the join. The join cannot occur in between the deduplication and the SELECT clause with aggregation.

SELECT AVG(bmi)
  FROM (SELECT patient_id, ANY_VALUE(bmi) as bmi, ANY_VALUE(state) as state
      FROM patients AS p
      JOIN geo_lookup AS g
        ON p.zip_code = g.zip_code
      GROUP BY patient_id)
  WHERE state='CA';
Copy

Executing transaction-level queries

The deduplication requirement for entity-level differential privacy does not prevent you from executing transaction-level queries. However, you must first group the data to the entity-level, and then aggregate on those groups.

For example, suppose you have a table doctor_visits and that the data provider has defined an entity key patient_id to implement entity-level privacy. A transaction-level query might be: “How many doctor visits weren’t for a regular checkup?” The following is an example of how to write this query:

SELECT SUM(num_visits)
  FROM (SELECT SUM((visit_reason<>'Regular checkup')::INT) AS num_visits
        WHERE visit_reason IS NOT NULL
        GROUP BY patient_id)
  WHERE num_visits > 0 AND num_visits < 20;
Copy

The subquery groups by patient_id to deduplicate the data. The aggregate column num_visits captures the number of visits per patient that were not for a regular checkup. The query then aggregates again on that per-patient column to get the total number of visits. Note that the WHERE clause on the outer query is required in order to specify a privacy domain on the data.

Note

While not a requirement, a best practice when joining tables protected by entity-level differential privacy is to include the entity key column as part of the join key (if it doesn’t semantically change the query). For more information, see Using entity keys in joins.

Understanding query results

Queries against a privacy-protected table don’t return the exact value of an aggregation. Differential privacy introduces noise into the result so it becomes an approximation of the actual value. The returned value differs enough from the actual value to conceal whether an individual’s data is included in the aggregation. This applies to all queries except for a query that returns the total number of rows in the privacy-protected table, for example, SELECT COUNT(*) FROM t.

An analyst needs to be able to determine whether the noise introduced into the result has decreased the usefulness of the query. Snowflake uses a noise interval to help analysts interpret the results. A noise interval is a closed mathematical interval that, in most cases, includes the actual value of the aggregation. There is a 95% chance that the actual result of a query falls within the noise interval.

Adding the following functions to a query allows the analyst to use the noise interval to make decisions about the utility of a query:

  • DP_ESTIMATE_LOW — Returns the lower bound of the noise interval. The actual value is most likely to be equal to or larger than this number.

  • DP_ESTIMATE_HIGH — Returns the upper bound of the noise interval. The actual value is most likely to be equal to or smaller than this number.

To use these functions, pass in the alias of an aggregated column in the main query. For example, the following query returns the sum of the num_claims column along with the noise interval for that aggregation:

SELECT SUM(num_claims) AS sum_claims,
      DP_INTERVAL_LOW(sum_claims),
      DP_INTERVAL_HIGH(sum_claims)
  FROM t1;
Copy

The output might be:

+--------------+--------------------------------+----------------------------------+
|  sum_claims  |  dp_estimate_low("sum_claims") |  dp_estimate_high("sum_claims")  |
|--------------+--------------------------------+----------------------------------+
|  50          |  35                            |    75                            |
+--------------+--------------------------------+----------------------------------+

In this case, the return value is a sum of 50. But the analyst has also determined with 95% certainty that the actual value of the aggregation is between 35 and 75.

Tip

For information about techniques that can potentially reduce noise in results, see

Understanding your privacy budget

Snowflake manages your privacy budget so the privacy loss of a single query cannot exceed the privacy budget, which would prevent you from running subsequent queries. Snowflake determines how much privacy loss is acceptable for each query and introduces enough noise to ensure that a query does not exceed that level. As an analyst, you don’t need to manage or consider your privacy budget as you run queries.