About privacy domains¶
Within differential privacy in Snowflake, a privacy domain defines the possible values in a column, similar to a mathematical domain. A privacy domain is either a range of values with a minimum and maximum or an enumerated list of values.
The privacy domain is one factor that Snowflake uses to calculate the amount of noise that must be added to preserve privacy. Because of this, most fields should have a finite privacy domain; otherwise, the amount of noise added would need to be infinite. By default, fields without privacy domains are assumed to have an infinite domain.
Which columns need a privacy domain?¶
With the exception of a COUNT function, a query cannot aggregate a column unless the column has a privacy domain. Similarly, a query cannot
use a column in a GROUP BY clause unless the column has a privacy domain. For example, both the score
and age
columns in the
following example must have privacy domains.
SELECT AVG(age) FROM t1
GROUP BY score;
df.group_by("score").agg(f.mean("age").as_("mean_age")).show()
These requirements don’t apply to subqueries. For example, in the following query, the mean_age
column must have a privacy domain, but
the score
and age
columns don’t, even though they’re used in an aggregation.
SELECT AVG(mean_age)
FROM (SELECT AVG(age) AS mean_age FROM t1 GROUP BY score)
WHERE mean_age >= 20 AND mean_age <= 80;
df.group_by("score").agg(f.mean("age").as_("mean_age")) \
.where((f.col("mean_age") >= 20) & (f.col("mean_age") <= 80)) \
.select(f.mean("mean_age")).show()
Defining a privacy domain¶
While both administrators and the analysts who are running queries can define a privacy domain for a column, they do so in different ways:
An administrator uses CREATE TABLE and ALTER TABLE commands to set a privacy domain for a column. An administrator for the data provider sets privacy domains before giving access to analysts. In some circumstances, an administrator for the analyst might also need to set privacy domains on tables being joined with the data provider’s protected tables. If you’re an administrator who needs to set privacy domains, see Working with privacy domains as an administrator.
An analyst shapes a query to implicitly specify a privacy domain using query elements like filters and column transformations. These privacy domains can be specified for columns without a privacy domain or can narrow a privacy domain set by the data provider. If you’re an analyst who needs to specify or narrow a privacy domain, see Working with privacy domains as an analyst.
Interactions between privacy domains¶
Multiple privacy domains can be involved in a query. There can be an admin-specified privacy domain and an analyst-specified privacy domain on the same column. Alternatively, a query might join two tables on a column that has a privacy domain in both tables.
Snowflake evaluates all privacy domains and calculates the privacy domain to use for the duration of the query. For information about how this query-time privacy domain is determined, see:
Interaction between admin-specified and analyst-specified privacy domains¶
An analyst uses query elements to implicitly specify a privacy domain for a column. For example, filtering on a column defines a privacy domain for it. This analyst-specified privacy domain exists only for the duration of the query; it doesn’t change the privacy domain that an administrator set on the column.
An analyst-specified privacy domain can narrow an admin-specified privacy domain, but can never expand it. The query-time privacy domain is the intersection between the privacy domain specified by the query and the privacy domain set by the administrator. For example, if the data provider set the privacy domain as a range (5, 15) and the query uses filters to specify the privacy domain as a range (0, 10), then the effective, query-time privacy domain is (5, 10).
Similarly, if the administrator set the privacy domain as a list ( ‘blue’, ‘yellow’ ) and the query uses filters to specify a privacy domain of ( ‘orange’, ‘blue’) , the query-time privacy domain is ( ‘blue’ ).
Privacy domains and joins¶
When an analyst joins two tables on a column that has a privacy domain in both tables, the type of join determines the query-time privacy domain. During the duration of the query, the effective privacy domain can be the intersection of the two privacy domains, the union of the two privacy domains, or just one of the privacy domains.
In the following table, domainL
refers to the privacy domain on the join column in the left table and domainR
refers to the privacy
domain on the join column in the right table.
Join type |
Query-time privacy domain |
---|---|
INNER |
Intersection of |
OUTER |
Union of |
LEFT |
|
RIGHT |
|
LEFT SEMI |
Intersection of |
LEFT ANTI |
|
For example, suppose the day
column in t1
has a privacy domain of (1, 100) and the day
column in t2
has a privacy domain of
(0, 90). When an analyst joins t1
and t2
on day
, the query-time privacy domain is (1, 90), which is the intersection of the two
privacy domains.
Values outside a privacy domain¶
A privacy domain defines possible values in a column, not necessarily actual values. The following summarizes what happens to values that are not included in the list or range of the privacy domain.
- Strings
Values in a string column that fall outside the privacy domain are always treated as NULL for the duration of the query. This is true regardless of whether it is an admin-specified privacy domain, an analyst-specified privacy domain, or an intersection of privacy domains.
For example, suppose the data provider set a privacy domain on a column
state
of ('california'
,'oregon'
) and the analyst wrote a query that filters thestate
column to ('nevada'
,'oregon'
). If the query uses thestate
column in a GROUP BY clause, then the result contains two groups:OREGON
andNULL
. TheNULL
group includes all records where the value of thestate
column is notOREGON
along with records where the value of thestate
column is literallyNULL
.
- Numeric, date, and time
Snowflake treats numeric, date, or time values that fall outside the range of a privacy domain differently depending on whether the privacy domain was defined by an administrator or an analyst.
- Admin-specified:
When the data provider defines a range privacy domain that contains a subset of the column’s actual values, the values outside the privacy domain are clamped, meaning they are treated as if they are the nearest value in the domain (the minimum or maximum value). For example, if the privacy domain of a column consists of integers between 1-100, a record with an actual value of 105 is treated as if it has a value of 100 when calculating aggregations. Analysts cannot access values outside the privacy domain.
When a join of two privacy-protected tables results in the intersection of privacy domains, values outside the query-time privacy domain are clamped.
- Analyst-specified:
When an analyst specifies a privacy domain for a column that doesn’t have one or narrows an admin-specified privacy domain, the query itself determines what happens to values that fall outside the privacy domain.
If the query uses a filter (WHERE clause), values outside of the privacy domain are ignored when calculating aggregations.
If the query uses a column transformation, values in the column that are outside of the privacy domain are clamped like an admin-specified privacy domain.
How intermediary query elements affect privacy domains¶
How a query is written can affect whether the range of a privacy domain changes or even whether a privacy domain still exists on a column. This section helps you understand how intermediary parts of a query, that is, parts of the query before the final aggregation, can affect the privacy domain of a column.
- Adding new columns
If a query adds a new column that is based on an existing column, specifying or narrowing a privacy domain on the original column has no effect on the new column.
In the following example, assume the data provider defined the privacy domain on the
score
column as a range between 0 and 100. When the query specifies the privacy domain ofscore
as a range between 1 and 2, it has no effect on the privacy domain of the columnscore_derived
.
SELECT AVG(score_derived)
FROM (SELECT score, score_derived FROM t1 WHERE score <= 2);
df.with_column("score_derived", f.col("score")) \
.where((f.col("score") >= 1) & \
(f.col("score") <= 2)) \
.select(f.mean("score_derived")).show()
For example, the output might be:
--------------------------
|"avg(""SCORE_DERIVED"")"|
--------------------------
|31.8196209349 |
--------------------------
- Using a GROUP BY clause in intermediary aggregations
For intermediary portions of a query, using a GROUP BY clause while aggregating a column removes the privacy domain from the column. As a result, you need to specify a new privacy domain on the column if it is used in the final aggregation of the query.
In the following example, the initial aggregation removes any privacy domain that has been set on the
score
column. The query succeeds only because it sets a privacy domain on the alias of the column before the final aggregation.
SELECT AVG(num_scores)
FROM (SELECT COUNT(score) AS num_scores
FROM t1
GROUP BY age)
WHERE num_scores >= 0 AND num_scores <= 100;
df.group_by("age").agg(f.count("score").as_("num_scores")) \
.where((f.col("num_scores") >= 0) & (f.col("num_scores") <= 100)) \
.select(f.mean("num_scores")).show()