Querying data protected by differential privacy¶
This topic helps an analyst run queries against data protected by differential privacy (that is, privacy-protected tables and views), and understand and adjust the results returned by the queries.
To execute a query against a privacy-protected table, a user must have the SELECT privilege on the table.
Limitations
Differential privacy supports a subset of Snowflake data types, operators, query syntax, and functions. For a list of supported SQL that you can use in a query, see Differential privacy SQL reference.
Queries against privacy-protected tables take longer because Snowflake must run additional computations to determine how much noise to add. For basic queries, this latency is at least 7 seconds. Complex queries, such as the following, can take much longer:
Queries with many joins and sub-queries.
Queries that output multiple rows in the result, for example, when using GROUP BY clauses that result in hundreds or thousands of groups.
In tables protected by differential privacy, fields in the outermost SELECT clause can only have aggregation, GROUP BY, or DP_INTERVAL_LOW/HIGH applied. Other actions, such as math and concatenation, are not allowed. Examples:
SELECT key, COUNT(*) AS 'c', DP_INTERVAL_LOW('c') FROM T GROUP BY key
Succeeds: No unsupported actions taken on
key
,COUNT(*)
, orc
.SELECT key, 1 + COUNT(*) AS 'c', DP_INTERVAL_LOW('c') FROM T GROUP BY key
Fails:
1 + COUNT(*)
is specified on a field in the outermost SELECT clause.SELECT key, AVG(1 + x) AS 'avg', DP_INTERVAL_LO('avg') FROM T GROUP BY key
Succeeds:
AVG
, a permitted aggregation, happens after1 + x
in the outermost SELECT clause.SELECT key, COUNT(x) AS 'c', DP_INTERVAL_LOW('c') FROM (SELECT key, 1 + income AS x FROM table) GROUP BY key
Succeeds:
1 + income
is applied in a nested SELECT clause, which is allowed.
Query Fundamentals¶
This section discusses the basic components of a query that will succeed when run against a privacy-protected table. It includes:
Aggregating data¶
All queries against a privacy-protected table must aggregate results rather than retrieve individual records. Not every part of a query needs to use an aggregation function as long as the final result is aggregated.
With the exception of a COUNT function, a query cannot aggregate a column unless the column has a privacy domain.
For a list of supported aggregations, see Aggregate functions.
Using joins¶
The following sections provide guidelines for using joins in a differentially private query:
To learn about the implications that joining two privacy-protected tables has on privacy domains, see Privacy domains and joins.
Join operators¶
Each join must be an equi join that uses a single operator. For example, t1.c1 == t2.c1
is supported, but col1 > col2
and
col1 + 10 = col2
are not. Unconditioned joins are not supported.
Joins must use the JOIN operator. The WHERE syntax for joins is not supported. For more information about join syntax, see Implementing Joins.
Supported joins¶
Joins in a differentially private query must be one of the following:
INNER
{ LEFT | RIGHT | FULL } OUTER
NATURAL
Both sides of the join must have the same query pattern. For example, the following joins are supported:
Both sides are identifiers
SELECT COUNT(*) FROM t1 INNER JOIN t2 ON t1.a=t2.a;
Both sides are subqueries
SELECT COUNT(*) FROM (SELECT a, SUM(b) FROM t1 GROUP BY a) AS g1 INNER JOIN (SELECT * FROM t2) AS g2 ON g1.a=g2.a;
Joining an identifier with a subquery is currently not supported.
For information about the supported query syntax related to joins, see Query syntax.
Using entity keys in joins¶
When working with tables protected with entity-level privacy, you can minimize the amount of noise by including the entity key column as part of the join key, especially if it doesn’t semantically change the query.
For example, consider the following tables where the entity is customers:
Table
Description
customers
Customer directory, where each row is a customer and has a
customer_id
.
transactions
Customer transactions, where each customer can have multiple transactions.
transaction_lines
Unique items that were purchased in a transaction. There can be multiple rows in a single transaction.
If they are following best practices, the data provider has structured the data so that each of these tables has the entity key
customer_id
. For this data schema, each transaction line can only belong to one transaction, and each transaction can only belong to
one customer. This relationship is not evident from the data itself, so without additional information the amount of noise added for
differential privacy will be higher than it needs to be.
You can minimize the amount of noise by including the entity key customer_id
as part of the join key, even if it is redundant. For
example, joining the table transactions
with transaction_lines
typically only requires the join key transaction_id
. However,
joining on both transaction_id
and customer_id
will result in a lower amount of noise.
Data types and privacy domains¶
When joining two tables, the data types of the join key columns from either side must be the same. For differential privacy, the data type of a column includes whether or not it has a privacy domain.
For example, if you had a privacy-protected table transactions
and an unprotected table product_lookup
, and you wanted to join them
on product_id
, the product_id
column in both tables must be the same data type (for example, a string) and must each have a privacy
domain.
To meet this requirement, the administrator for the analyst might need to define a privacy domain just like the data provider defines them. For information on how to set a privacy domain for a table, see Setting a privacy domain.
Uniqueness requirement¶
Joins can potentially duplicate rows of data, which can cause the amount of noise added to a query result to become unbounded. To ensure that privacy-protected data is not duplicated in a join, the join key (that is, the columns on which the tables are joined) for privacy-protected tables must match only one record in the other table. This means that when joining with a privacy-protected table, the join key on the opposite side must be de-duplicated.
Important
The uniqueness requirement for joins doesn’t always apply to queries against tables that are protected by entity-level privacy. For entity-level privacy, queries must de-duplicate on the entity key before the aggregation. As long as this is done after a join but before the aggregation, the join doesn’t need to be on de-duplicated data. For more information about meeting these requirements, see Querying data protected by entity-level privacy.
To satisfy the uniqueness requirement for joins, the query can use a GROUP BY on a subset of the join columns to group duplicate rows into one result.
For example, suppose the patients
table is protected by differential privacy and the geo_lookup
table is not. The analyst wants to
join these two tables on zip_code
so that they can filter the patients
table on State
. In order to ensure that the records in
the privacy-protected patients
table are not duplicated, the query must de-duplicate the zip_code
table on the join key. This must
be done explicitly even if the geo_lookup
table is already unique on zip_code
. This ensures that Snowflake can correctly account
for privacy.
SELECT COUNT(*)
FROM patients
LEFT JOIN (SELECT zip_code, ANY_VALUE(state) AS residence_state
FROM geo_lookup
GROUP BY zip_code)
USING zip_code
WHERE birth_state = residence_state;
df_patients = session.table("patients")
df_geo = session.table("geo_lookup")
df_geo_deduped = df_geo.group_by("zip_code").agg(f.any_value("state").as_("state"))
df_patients.join(df_geo_deduped, on="zip_code", join_type="left")\
.where(f.col("birth_state") == f.col("residence_state")).select(f.count("*"))
Querying data protected by entity-level privacy¶
Most data providers use an entity key to implement entity-level privacy when configuring differential privacy. When a table is protected by entity-level privacy, Snowflake does not allow aggregates on fields if there might be an unbounded number of rows per entity. This means queries must meet the following requirements:
At some point in the query, the privacy-protected table must be deduplicated on the entity key. Operations that can be used to deduplicate data are:
COUNT( DISTINCT <entity_key_column> )
GROUP BY <entity_key_column>
UNION (but not UNION ALL) when only the entity key is projected.
If a join uses a join key other than the entity key column, that join cannot occur between the deduplication and the final SELECT clause with aggregation.
Note
If the data provider implemented row-level privacy, the deduplication requirement for joins is different. For more information about these requirements, see Uniqueness requirement.
To help illustrate the requirements for entity-level privacy, suppose you have a privacy-protected table patients
with the entity key
column patient_id
. You also have a non-sensitive, unprotected table geo_lookup
. The following examples show a query that fails
followed by a re-written version that succeeds.
- Example: Deduplication
The following query fails because it doesn’t meet the deduplication requirement. Even though the table
patients
might already be unique onpatient_id
, the query fails because it does not explicitly deduplicate.SELECT COUNT(*) FROM patients WHERE insurance_type = 'Commercial';
To re-write the query so it succeeds, include a distinct count on the entity key column in order to explicitly deduplicate on the entity key. For example:
SELECT COUNT(DISTINCT patient_id) FROM patients WHERE insurance_type = 'Commercial';
- Example: Location of join
The following query fails even though it is using a GROUP BY clause to meet the deduplication requirement. It fails because the table is being joined with another table using a column that is not the entity key column.
SELECT AVG(bmi) FROM (SELECT patient_id, ANY_VALUE(zip_code) AS zip_code FROM patients GROUP BY patient_id) AS p JOIN geo_lookup AS g ON p.zip_code = g.zip_code WHERE state='CA';
To re-write the query so it succeeds, use the GROUP BY clause after the join. The join cannot occur in between the deduplication and the SELECT clause with aggregation.
SELECT AVG(bmi) FROM (SELECT patient_id, ANY_VALUE(bmi) as bmi, ANY_VALUE(state) as state FROM patients AS p JOIN geo_lookup AS g ON p.zip_code = g.zip_code GROUP BY patient_id) WHERE state='CA';
Executing transaction-level queries¶
The deduplication requirement for entity-level differential privacy does not prevent you from executing transaction-level queries. However, you must first group the data to the entity-level, and then aggregate on those groups.
For example, suppose you have a table doctor_visits
and that the data provider has defined an entity key patient_id
to implement
entity-level privacy. A transaction-level query might be: “How many doctor visits weren’t for a regular checkup?” The following is an
example of how to write this query:
SELECT SUM(num_visits)
FROM (SELECT SUM((visit_reason<>'Regular checkup')::INT) AS num_visits
WHERE visit_reason IS NOT NULL
GROUP BY patient_id)
WHERE num_visits > 0 AND num_visits < 20;
The subquery groups by patient_id
to deduplicate the data. The aggregate column num_visits
captures the number of visits per
patient that were not for a regular checkup. The query then aggregates again on that per-patient column to get the total number of visits.
Note that the WHERE clause on the outer query is required in order to
specify a privacy domain on the data.
Note
While not a requirement, a best practice when joining tables protected by entity-level differential privacy is to include the entity key column as part of the join key (if it doesn’t semantically change the query). For more information, see Using entity keys in joins.
Understanding query results¶
Queries against a privacy-protected table don’t return the exact value of an aggregation. Differential privacy introduces
noise into the result so it becomes an approximation of the actual value. The returned value
differs enough from the actual value to conceal whether an individual’s data is included in the aggregation. This applies to all queries
except for a query that returns the total number of rows in the privacy-protected table, for example, SELECT COUNT(*) FROM t
.
An analyst needs to be able to determine whether the noise introduced into the result has decreased the usefulness of the query. Snowflake uses a noise interval to help analysts interpret the results. A noise interval is a closed mathematical interval that, in most cases, includes the actual value of the aggregation. There is a 95% chance that the actual result of a query falls within the noise interval.
Adding the following functions to a query allows the analyst to use the noise interval to make decisions about the utility of a query:
DP_INTERVAL_LOW — Returns the lower bound of the noise interval. The actual value is most likely to be equal to or larger than this number.
DP_INTERVAL_HIGH — Returns the upper bound of the noise interval. The actual value is most likely to be equal to or smaller than this number.
To use these functions, pass in the alias of an aggregated column in the main query. For example, the following query returns the sum of the
num_claims
column along with the noise interval for that aggregation:
SELECT SUM(num_claims) AS sum_claims,
DP_INTERVAL_LOW(sum_claims),
DP_INTERVAL_HIGH(sum_claims)
FROM t1;
dp_interval_low = f.function('dp_interval_low')
dp_interval_high = f.function('dp_interval_high')
dpdf_cohort.group_by().agg(f.sum("num_claims").alias("sum_claims"),
dp_interval_low("sum_claims"),
dp_interval_low("sum_claims")
).show()
The output might be:
+--------------+--------------------------------+----------------------------------+
| sum_claims | dp_interval_low("sum_claims") | dp_interval_high("sum_claims") |
|--------------+--------------------------------+----------------------------------+
| 50 | 35 | 75 |
+--------------+--------------------------------+----------------------------------+
In this case, the return value is a sum of 50. But the analyst has also determined with 95% certainty that the actual value of the aggregation is between 35 and 75.
Tip
For information about techniques that can potentially reduce noise in results, see
Tracking privacy budget spending¶
You can use the ESTIMATE_REMAINING_DP_AGGREGATES function to estimate how many more queries you can run
within the current budget window (that is, until the cumulative privacy loss is reset to 0). The estimate is based on the number of
aggregates, not queries. For example, the query SELECT SUM(age), COUNT(age) FROM T
contains two aggregate functions: SUM(age)
and
COUNT(age)
.
When executing the ESTIMATE_REMAINING_DP_AGGREGATES function, be sure to use the exact conditions you’re using to execute queries, for example, the same user, role, and account.
If you’re running a query that uses multiple tables, you should run ESTIMATE_REMAINING_DP_AGGREGATES once per table, then use the lowest
NUMBER_OF_REMAINING_DP_AGGREGATES
value as the estimated usage cap.
The following example shows how a series of queries affect how much of the privacy budget’s limit has been spent (that is, the cumulative privacy loss of the queries) and the estimated number of remaining aggregates.
1. Initial check
Let’s look at privacy budget numbers on the table my_table
. You’ve never run any queries on this table.
SELECT * FROM TABLE(SNOWFLAKE.DATA_PRIVACY.ESTIMATE_REMAINING_DP_AGGREGATES('my_table'));
No budget used so far:
+-----------------------------------+--------------+---------------+--------------+
| NUMBER_OF_REMAINING_DP_AGGREGATES | BUDGET_LIMIT | BUDGET_WINDOW | BUDGET_SPENT |
|-----------------------------------+--------------+---------------+--------------|
| 996 | 233 | WEEKLY | 0.0 |
+-----------------------------------+--------------+---------------+--------------+
2. Run a query
Let’s run a query with one aggregate function and check our numbers again:
SELECT SUM(salary) FROM my_table;
-- results omitted ...
SELECT * FROM TABLE(SNOWFLAKE.DATA_PRIVACY.ESTIMATE_REMAINING_DP_AGGREGATES('my_table'));
Estimate of remaining aggregate calls has dropped by one and the cumulative privacy loss (budget spent) has increased.
+-----------------------------------+--------------+---------------+--------------+
| NUMBER_OF_REMAINING_DP_AGGREGATES | BUDGET_LIMIT | BUDGET_WINDOW | BUDGET_SPENT |
|-----------------------------------+--------------+---------------+--------------|
| 995 | 233 | WEEKLY | 0.6 |
+-----------------------------------+--------------+---------------+--------------+
3. Run another query with two aggregate functions
SELECT SUM(age), COUNT(age) FROM my_table GROUP BY STATE;
-- results omitted ...
SELECT * FROM TABLE(SNOWFLAKE.DATA_PRIVACY.ESTIMATE_REMAINING_DP_AGGREGATES('my_table'));
Estimated remaining queries has dropped by two. Remember, this is an estimate.
+-----------------------------------+--------------+---------------+--------------+
| NUMBER_OF_REMAINING_DP_AGGREGATES | BUDGET_LIMIT | BUDGET_WINDOW | BUDGET_SPENT |
|-----------------------------------+--------------+---------------+--------------|
| 993 | 233 | WEEKLY | 1.8 |
+-----------------------------------+--------------+---------------+--------------+
4. Rerun a query
Let’s rerun a previous query to show that privacy budget is always charged, even on identical queries. A duplicate query incurs the same privacy loss each time it runs (that is, it spends the same amount of privacy budget).
SELECT SUM(age), COUNT(age) FROM T GROUP BY STATE;
-- results omitted ...
SELECT * FROM TABLE(SNOWFLAKE.DATA_PRIVACY.ESTIMATE_REMAINING_DP_AGGREGATES('my_table'));
Same charge for the query as before: 1.2 units of privacy loss.
+-----------------------------------+--------------+---------------+--------------+
| NUMBER_OF_REMAINING_DP_AGGREGATES | BUDGET_LIMIT | BUDGET_WINDOW | BUDGET_SPENT |
|-----------------------------------+--------------+---------------+--------------|
| 991 | 233 | WEEKLY | 3.0 |
+-----------------------------------+--------------+---------------+--------------+