Snowpark Python: Eliminate repeated subqueries in Snowpark-generated queries (Preview)

Attention

This behavior change is in the 2025_04 bundle.

For the current status of the bundle, refer to Bundle History.

Repeated subquery elimination identifies identical sub-DataFrames within a query plan and employs Common Table Expressions (CTEs) to construct the final query. Almost half of the queries with compilation times exceeding one second contain at least one redundant subquery. The benefit of this optimization scales with the quantity and complexity of the identified duplicate subqueries.

To demonstrate the difference between old and new behavior, consider the following DataFrame transformations in Snowpark Python:

df = session.table("test_table")
df1 = df.with_column("a", F.col("A") + 1).filter(df.a > 1)
df1 = df1.union_all(df1)

print(df1.queries["queries"][0])
Copy
Before the change:

Because the union_all above is using the same DataFrame df1 twice, the generated SQL queries will repeat the underlying subquery twice:

( SELECT * FROM ( SELECT "B", "C", ( "A" + 1 ) AS "A" FROM test_table )
  WHERE ( "A" > 1 ) )
UNION ALL
( SELECT * FROM ( SELECT "B", "C", ( "A" + 1 ) AS "A" FROM test_table )
  WHERE ( "A" > 1 ) )
Copy
After the change:

The optimization will detect that df1 is being used twice, will replace the subquery with a CTE expression, and then use that to build the query:

WITH SNOWPARK_TEMP_CTE_7G3ZFVJYBK AS
  ( SELECT * FROM ( SELECT "B", "C", ( "A" + 1 ) AS "A" FROM test_table )
      WHERE ( "A" > 1 ) ) ( SELECT * FROM ( SNOWPARK_TEMP_CTE_7G3ZFVJYBK ) )
  UNION ALL
  ( SELECT * FROM ( SNOWPARK_TEMP_CTE_7G3ZFVJYBK ) )
Copy

Ref: 1995