Query Parquet files directly in your data lake

Note

Snowflake offers other ways to work with Parquet files on Iceberg tables:

Parquet Direct (this topic) is for read-only access to Parquet files you manage in external object storage without ingesting data into Snowflake.

Overview

This topic covers how to create a read-only Apache Iceberg™ table for Snowflake from Apache Parquet™ files that you manage in object storage, otherwise known as Parquet Direct. This option allows you to query Parquet data directly in your data lake, so you don’t have to make a copy of the data or pay for data ingestion. Parquet Direct presents multiple benefits:

  • Cost: Significantly lower cost as compared to full ingestion, and no per-file charge in refresh as compared to external tables
  • Seamless syncing: Automatically refresh tables in Snowflake to reflect changes made to files in storage (add, delete, upsert, schema changes)
  • Hive-style partitioning: Full support for Hive-style partitioning (for example, key=value), which makes it easy for you to modernize legacy datasets
  • Read-only permission model: The permission model doesn’t require write access to your storage, enabling its use in security-conscious and regulated verticals
  • Performance: Iceberg-grade query performance unlike external tables

Note

Parquet Direct tables are read-only, so you can’t perform DML operations (INSERT, UPDATE, DELETE) on these tables through Snowflake. Instead, you can use other engines to perform DML operations directly on the files in cloud storage. For the full list of limitations in this private preview, see Considerations and limitations.

Partitioned tables

To improve query performance, Snowflake strongly recommends that you partition tables created from Parquet source files by using partition columns. Query response time is faster when Snowflake processes only a small part of the data instead of having to scan the entire data set. An Iceberg table definition can include multiple partition columns, which impose a multi-dimensional structure on the external data.

How partition columns work depends on the TABLE_FORMAT you use for the catalog integration:

  • TABLE_FORMAT = NONE: Use when your paths are not Hive-style (key=value). You define partition columns explicitly using expressions that parse the file path stored in the METADATA$FILENAME pseudo-column. For example, you write a regular expression to extract a date or identifier from the path. See Specify a partition column.
  • TABLE_FORMAT = HIVE: Snowflake automatically infers partition column names and values from Hive-style key=value directory names in your file paths (for example, year=2024/month=01/). You don’t need to write partition expressions; Snowflake handles this for you.

To partition a table, your data must be organized using logical paths.

When you use TABLE_FORMAT = NONE, Snowflake computes and adds partitions from your explicit partition column expressions when an Iceberg table is refreshed. When you use TABLE_FORMAT = HIVE, Snowflake infers partition columns from Hive-style key=value path segments when an Iceberg table is refreshed.

If your data lake has files with inconsistently-cased Hive-style partition key segments (for example, some files under Year=2024/Month=01/ and others under year=2024/month=01/), you can set MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE on the table. Snowflake then resolves all casing variants to the same partition column rather than treating them as separate partitions.

Workflow

Use the workflow in this section to create an Iceberg table from Parquet source files.

Note

If you store your Parquet files in Amazon S3, Azure Blob Storage, or Google Cloud Storage, you can create a table that supports automatically refreshing the table data. For setup instructions, see Tutorials.

Step 1: Create an external volume

To create an external volume, complete the instructions for your cloud storage service:

Step 2: Create a catalog integration

When you create a catalog integration for Parquet Direct, you specify CATALOG_SOURCE = OBJECT_STORE and a TABLE_FORMAT that controls how Snowflake interprets your file layout:

TABLE_FORMAT = NONE

Use when your data is not in Hive-style key=value layout, or when you want to define partition columns using path expressions. You define partition columns explicitly in the CREATE ICEBERG TABLE statement using expressions against the METADATA$FILENAME pseudo-column.

TABLE_FORMAT = HIVE

Use when your data uses Hive-style partitioning (for example, year=2024/month=01/data.parquet). Snowflake automatically infers partition column names and values from the key=value directory names in your file paths. You can’t use explicit PARTITION BY expressions with this table format.

Create a catalog integration by using the CREATE CATALOG INTEGRATION (Object storage) command:

-- Standard Parquet Direct (custom path structure or explicit partition expressions)
CREATE OR REPLACE CATALOG INTEGRATION parquet_catalog_int
  CATALOG_SOURCE = OBJECT_STORE
  TABLE_FORMAT = NONE
  ENABLED = TRUE;

-- Parquet Direct with automatic Hive partition inference
CREATE OR REPLACE CATALOG INTEGRATION hive_catalog_int
  CATALOG_SOURCE = OBJECT_STORE
  TABLE_FORMAT = HIVE
  ENABLED = TRUE;

Note

Snowflake does not support creating Iceberg tables from Parquet-based table definitions in the AWS Glue Data Catalog.

Step 3: Create an Iceberg table

Create an Iceberg table by using the CREATE ICEBERG TABLE command. See Syntax for parameters and Examples for common patterns.

Syntax

CREATE [ OR REPLACE ] ICEBERG TABLE [ IF NOT EXISTS ] <table_name>
  [
    --Data column definition
    <col_name> <col_type>
    [ COLLATE '<collation_specification>' ]
    [ [ WITH ] MASKING POLICY <policy_name> [ USING ( <col_name> , <cond_col1> , ... ) ] ]
    [ [ WITH ] TAG ( <tag_name> = '<tag_value>' [ , <tag_name> = '<tag_value>' , ... ] ) ]
    [ COMMENT '<string_literal>' ]
    -- In-line constraint
    [ inlineConstraint ]
    -- Additional column definitions (data, virtual, or partition columns)
    [ ,     <col_name> <col_type> ...
      -- Virtual column definition
      |  <col_name> <col_type> AS <expr>
      -- Partition column definition (TABLE_FORMAT = NONE only)
      | <part_col_name> <col_type> AS <part_expr>
      -- In-line constraint
      [ inlineConstraint ]
      [ , ... ]
    ]
    -- Out-of-line constraints
    [ , outoflineConstraint [ ... ] ]
  ]
  [ PARTITION BY ( <part_col_name> [, <part_col_name> ... ] ) ]
  [ EXTERNAL_VOLUME = '<external_volume_name>' ]
  [ CATALOG = <catalog_integration_name> ]
  BASE_LOCATION = '<relative_path_from_external_volume>'
  [ ENABLE_SCHEMA_INFERENCE = { TRUE | FALSE } ]
  [ ENABLE_SCHEMA_EVOLUTION = { TRUE | FALSE } ]
  [ MATCH_BY_COLUMN_NAME = { CASE_INSENSITIVE | CASE_SENSITIVE } ]
  [ AUTO_REFRESH = { TRUE | FALSE } ]
  [ REFRESH_ON_CREATE = { TRUE | FALSE } ]
  [ PATTERN = '<regex_pattern>' ]
  [ REPLACE_INVALID_CHARACTERS = { TRUE | FALSE } ]
  [ [ WITH ] ROW ACCESS POLICY <policy_name> ON ( <col_name> [ , <col_name> ... ] ) ]
  [ [ WITH ] TAG ( <tag_name> = '<tag_value>' [ , <tag_name> = '<tag_value>' , ... ] ) ]
  [ COMMENT = '<string_literal>' ]

Where:

inlineConstraint ::=
  [ CONSTRAINT <constraint_name> ]
  { UNIQUE
    | PRIMARY KEY
    | [ FOREIGN KEY ] REFERENCES <ref_table_name> [ ( <ref_col_name> ) ]
  }
  [ <constraint_properties> ]

For additional inline constraint details, see CREATE | ALTER TABLE … CONSTRAINT.

outoflineConstraint ::=
  [ CONSTRAINT <constraint_name> ]
  { UNIQUE [ ( <col_name> [ , <col_name> , ... ] ) ]
    | PRIMARY KEY [ ( <col_name> [ , <col_name> , ... ] ) ]
    | [ FOREIGN KEY ] [ ( <col_name> [ , <col_name> , ... ] ) ]
      REFERENCES <ref_table_name> [ ( <ref_col_name> [ , <ref_col_name> , ... ] ) ]
  }
  [ <constraint_properties> ]

For additional out-of-line constraint details, see CREATE | ALTER TABLE … CONSTRAINT.

Required parameters

table_name

Specifies the identifier (name) for the table; must be unique for the schema in which the table is created.

In addition, the identifier must start with an alphabetic character and cannot contain spaces or special characters unless the entire identifier string is enclosed in double quotes (for example, "My object"). Identifiers enclosed in double quotes are also case-sensitive.

For more details, see Identifier requirements.

BASE_LOCATION = 'relative_path_from_external_volume'

Specifies a relative path from the table’s EXTERNAL_VOLUME location to a directory where Snowflake can access your Parquet files. The base location must point to a directory and cannot point to a single Parquet file.

Optional parameters

col_name

Specifies the column identifier (name). All the requirements for table identifiers also apply to column identifiers.

For more details, see Identifier requirements and Reserved & limited keywords.

Note

In addition to the standard reserved keywords, the following keywords cannot be used as column identifiers because they are reserved for ANSI-standard context functions:

  • CURRENT_DATE
  • CURRENT_ROLE
  • CURRENT_TIME
  • CURRENT_TIMESTAMP
  • CURRENT_USER

For the list of reserved keywords, see Reserved & limited keywords.

col_type

Specifies the data type for the column.

For details about the data types that can be specified for table columns, see Data type mapping and SQL data types reference.

expr

String that specifies the expression for the column. When queried, the column returns results derived from this expression.

A column can be a virtual column, which is defined using an explicit expression.

METADATA$FILENAME:

A pseudo-column that identifies the name of each Parquet data file included in the table, relative to its path on the external volume.

For example:

If the external volume location is s3://bucket-name/data/warehouse/ and the BASE_LOCATION of the table is default_db/schema_name/table_name/, the absolute location of the Parquet file is s3://bucket-name/data/warehouse/default_db/schema_name/table_name/ds=2023-01-01/file1.parquet.

As a result, the METADATA$FILENAME for this file is default_db/schema_name/table_name/ds=2023-01-01/file1.parquet.

CONSTRAINT ...

Defines an inline or out-of-line constraint for the specified column(s) in the table.

For syntax details, see CREATE | ALTER TABLE … CONSTRAINT. For more information about constraints, see Constraints.

COLLATE 'collation_specification'

Specifies the collation to use for column operations such as string comparison. This option applies only to text columns (VARCHAR, STRING, TEXT, and similar types). For more details, see Collation specifications.

MASKING POLICY = policy_name

Specifies the masking policy to set on a column.

EXTERNAL_VOLUME = 'external_volume_name'

Specifies the identifier (name) for the external volume where Snowflake can access your Parquet data files.

You must specify an external volume if you have not set one at the database or schema level. Otherwise, the Iceberg table defaults to the external volume for the schema, database, or account. The schema takes precedence over the database, and the database takes precedence over the account.

CATALOG = 'catalog_integration_name'

Specifies the identifier (name) of the catalog integration for this table.

You must specify a catalog integration if you have not set one at the database or schema level. Otherwise, the Iceberg table defaults to the catalog integration for the schema, database, or account. The schema takes precedence over the database, and the database takes precedence over the account.

ENABLE_SCHEMA_INFERENCE = { TRUE | FALSE }

Specifies whether Snowflake automatically detects column definitions from your Parquet files when you create the table.

  • TRUE: Snowflake reads sample files at creation time and generates column definitions automatically. For tables that use TABLE_FORMAT = HIVE, partition columns are also inferred from the key=value directory structure. If you include explicit column definitions, Snowflake uses them for those columns and infers the remaining columns.
  • FALSE: You must provide explicit column definitions in the CREATE ICEBERG TABLE statement.

Default: TRUE if no column definition is provided; otherwise FALSE.

If your CREATE ICEBERG TABLE statement omits a column list, Snowflake enables schema inference (and sets ENABLE_SCHEMA_EVOLUTION = TRUE by default for that case) even when you don’t specify these parameters explicitly.

ENABLE_SCHEMA_EVOLUTION = { TRUE | FALSE }

Specifies whether Snowflake automatically evolves the table schema when Parquet file schemas change during future REFRESH operations.

  • TRUE: When a REFRESH detects new columns or eligible type widening in source files, Snowflake adds those changes to the table schema automatically. See Schema evolution for supported schema changes.
  • FALSE: Snowflake does not modify the table schema on REFRESH, even if the source files change.

ENABLE_SCHEMA_EVOLUTION = TRUE requires ENABLE_SCHEMA_INFERENCE = TRUE. If you set ENABLE_SCHEMA_EVOLUTION = TRUE without specifying ENABLE_SCHEMA_INFERENCE, inference is automatically enabled.

Default: FALSE.

AUTO_REFRESH = { TRUE | FALSE }

Specifies whether the table data will be automatically refreshed. This parameter is required only when you create an Iceberg table from Parquet files that supports automatic refreshes. For more information, see Auto-refresh for Amazon S3.

When you enable auto-refresh on a table that previously had auto-refresh disabled, Snowflake queues notifications for files that were added while auto-refresh was off and processes a full backfill refresh asynchronously.

REFRESH_ON_CREATE = { TRUE | FALSE }

Specifies whether Snowflake performs an initial refresh of the table at creation time to register all existing Parquet files in the BASE_LOCATION.

  • TRUE: Snowflake scans the BASE_LOCATION at creation time and registers all matching Parquet files.
  • FALSE: Snowflake creates an empty table with no registered files. Use this option when your data lake contains a very large number of files and you want to register files selectively using sub-path refreshes after creation.

When you use REFRESH_ON_CREATE = FALSE, you must provide explicit column definitions. Schema inference requires files to be present at creation time.

Default: TRUE.

Important

If you create a table with REFRESH_ON_CREATE = FALSE and later set AUTO_REFRESH = TRUE, Snowflake runs the same full backfill refresh described for AUTO_REFRESH above. If you want to avoid that behavior, contact Snowflake Support.

PATTERN = '{regex_pattern}'

A regular expression pattern string, enclosed in single quotes, specifying the filenames and paths on the external stage to match.

If you manage your Parquet source files in Amazon S3, you can use this parameter to avoid reaching the AWS limit for the number of SNS topics that can be created per account. To avoid reaching this limit, do the following:

  1. Create one SNS topic at the bucket level
  2. Create tables, which each have their own regular expression pattern, to logically group them

For more information on this SNS topic limit, see the AWS documentation.

Tip

For the best performance, don’t apply patterns that filter on a large number of files.

MATCH_BY_COLUMN_NAME = { CASE_INSENSITIVE | CASE_SENSITIVE }

Controls how column names in Parquet files are matched against the table schema during refresh.

  • CASE_INSENSITIVE: Snowflake matches columns regardless of casing differences between the Parquet file schema and the table definition. Each time a refresh encounters a column name casing not yet recorded for that column, Snowflake adds the new casing to the column’s internal name mappings. This ensures that data from files using different casings (for example, id, ID, and Id) all populate the same table column. For tables that use Hive-style partitioning, partition key segments in the file path are also matched case-insensitively, so files under paths like Status=active/, status=pending/, and STATUS=done/ resolve to the same partition column.

  • CASE_SENSITIVE: Column names must match exactly, including casing. Files with differently-cased column names won’t populate the mismatched columns. This is the default when the option is not specified. If a table previously used CASE_INSENSITIVE and accumulated name mappings across multiple casings (for example, both id and ID), those mappings are preserved after switching to CASE_SENSITIVE and continue to be honored during refresh.

Default: CASE_INSENSITIVE when this option is supported on your account.

Note

Regardless of this setting, you can’t define two columns in the same table whose names differ only in casing (for example, id and ID). Snowflake enforces this constraint at CREATE TABLE time.

Example: Create a table that resolves column name casing differences across Parquet files:

CREATE ICEBERG TABLE my_table
  CATALOG = myCatalogInt
  EXTERNAL_VOLUME = myExternalVolume
  BASE_LOCATION = 'path/to/parquet/files/'
  ENABLE_SCHEMA_INFERENCE = TRUE
  MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;
REPLACE_INVALID_CHARACTERS = { TRUE | FALSE }

Specifies whether to replace invalid UTF-8 characters with the Unicode replacement character (�) in query results. You can only set this parameter for tables that use an external Iceberg catalog.

  • TRUE replaces invalid UTF-8 characters with the Unicode replacement character.
  • FALSE leaves invalid UTF-8 characters unchanged. Snowflake returns a user error message when it encounters invalid UTF-8 characters in a Parquet data file.

If not specified, the Iceberg table defaults to the parameter value for the schema, database, or account. The schema takes precedence over the database, and the database takes precedence over the account.

Default: FALSE

ROW ACCESS POLICY policy_name ON ( col_name [ , col_name ... ] )

Specifies the row access policy to set on a table.

TAG ( tag_name = 'tag_value' [ , tag_name = 'tag_value' , ... ] )

Specifies the tag name and the tag string value.

The tag value is always a string, and the maximum number of characters for the tag value is 256.

For information about specifying tags in a statement, see Tag quotas.

COMMENT 'string_literal'

Specifies a comment for the column or the table.

Comments can be specified at the column level or the table level. The syntax for each is slightly different.

Partitioning parameters (TABLE_FORMAT = NONE)

Use these parameters to partition your Iceberg table when using TABLE_FORMAT = NONE. For tables that use TABLE_FORMAT = HIVE, partition columns are automatically inferred from key=value directory names and you can’t use explicit partition column definitions.

part_col_name col_type AS part_expr

Defines one or more partition columns in the Iceberg table.

A partition column must evaluate as an expression that parses the path and/or filename information in the METADATA$FILENAME pseudo-column. A partition consists of all data files that match the path and/or filename in the expression for the partition column.

part_col_nameString that specifies the partition column identifier (the name). All the requirements for table identifiers also apply to column identifiers.
col_typeString (constant) that specifies the data type for the column. The data type must match the result of part_expr for the column.
part_exprString that specifies the expression for the column. The expression must include the METADATA$FILENAME pseudocolumn.

Iceberg tables currently support the following subset of functions in partition expressions:

List of supported functions:

[ PARTITION BY ( part_col_name [, part_col_name ... ] ) ]

Specifies any partition columns to evaluate for the Iceberg table.

Usage:

When querying an Iceberg table, include one or more partition columns in a WHERE clause, for example:

... WHERE part_col_name = 'filter_value'

A common practice is to partition the data files based on increments of time; or, if the data files are staged from multiple sources, to partition by a data source identifier and date or timestamp.

Examples

Specify data columns

The following example creates an Iceberg table from Parquet files in object storage.

The example specifies the external volume and catalog integration created previously in this workflow, and provides a value for the required BASE_LOCATION parameter.

CREATE ICEBERG TABLE myTable (
    first_name STRING,
    last_name STRING,
    amount NUMBER,
    create_date DATE
  )
  CATALOG = parquet_catalog_int
  EXTERNAL_VOLUME = myIcebergVolume
  BASE_LOCATION='relative_path_from_external_volume/';

Specify a partition column

The following example creates an Iceberg table from Parquet files in object storage and defines a partition column named sr_returned_date_sk using a path expression (TABLE_FORMAT = NONE).

 CREATE OR REPLACE ICEBERG TABLE store_returns (
  sr_returned_date_sk integer AS
    IFF(
        regexp_substr(METADATA$FILENAME, 'sr_returned_date_sk=(.*)/', 1, 1, 'e') = '__HIVE_DEFAULT_PARTITION__',
        null,
        TO_NUMBER(
          regexp_substr(METADATA$FILENAME, 'sr_returned_date_sk=(.*)/', 1, 1, 'e')
        )
    ),
  sr_return_time_sk         integer                       ,
  sr_item_sk                integer                       ,
  sr_customer_sk            integer                       ,
  sr_cdemo_sk               integer                       ,
  sr_hdemo_sk               integer                       ,
  sr_addr_sk                integer                       ,
  sr_store_sk               integer                       ,
  sr_reason_sk              integer                       ,
  sr_ticket_number          bigint                        ,
  sr_return_quantity        integer                       ,
  sr_return_amt             decimal(7,2)                  ,
  sr_return_tax             decimal(7,2)                  ,
  sr_return_amt_inc_tax     decimal(7,2)                  ,
  sr_fee                    decimal(7,2)                  ,
  sr_return_ship_cost       decimal(7,2)                  ,
  sr_refunded_cash          decimal(7,2)                  ,
  sr_reversed_charge        decimal(7,2)                  ,
  sr_store_credit           decimal(7,2)                  ,
  sr_net_loss               decimal(7,2)
)
PARTITION BY (sr_returned_date_sk)
EXTERNAL_VOLUME = 'exvol'
CATALOG = 'parquet_catalog_int'
BASE_LOCATION = 'store_returns/';

Automatic schema inference

The following example creates an Iceberg table using automatic schema inference. Snowflake detects column definitions from the Parquet files and evolves the schema on future refreshes.

CREATE OR REPLACE ICEBERG TABLE auto_schema_table
  EXTERNAL_VOLUME = 'exvol'
  CATALOG = 'parquet_catalog_int'
  BASE_LOCATION = 'auto_schema_table/';

You can also include column definitions to specify certain columns explicitly. Snowflake uses your definitions for those columns and infers the remaining ones. In this case, you must specify ENABLE_SCHEMA_INFERENCE = TRUE.

CREATE OR REPLACE ICEBERG TABLE auto_schema_table_col_spec (col1 INT)
  EXTERNAL_VOLUME = 'exvol'
  CATALOG = 'parquet_catalog_int'
  BASE_LOCATION = 'auto_schema_table_col_spec/'
  ENABLE_SCHEMA_INFERENCE = TRUE
  ENABLE_SCHEMA_EVOLUTION = TRUE;

Automatic Hive partition inference

The following example creates an Iceberg table over data with Hive-style partitioning (for example, year=2024/month=01/data.parquet). Using a catalog integration with TABLE_FORMAT = HIVE, Snowflake automatically infers the year and month partition columns from the directory names. No partition expressions are required.

-- Catalog integration for Hive partition inference
CREATE OR REPLACE CATALOG INTEGRATION hive_catalog_int
  CATALOG_SOURCE = OBJECT_STORE
  TABLE_FORMAT = HIVE
  ENABLED = TRUE;

-- Iceberg table: partition columns (year, month) are inferred automatically
CREATE OR REPLACE ICEBERG TABLE sales_data
  EXTERNAL_VOLUME = 'exvol'
  CATALOG = 'hive_catalog_int'
  BASE_LOCATION = 'sales/';

Selective file registration

For data lakes with very large numbers of files, you can create an empty table and then register files selectively by refreshing specific paths one at a time. This avoids a large initial bootstrap scan and lets you load only the partitions you need immediately.

Set REFRESH_ON_CREATE = FALSE to skip the initial file registration. You must provide explicit column definitions because schema inference requires files to be present at creation time.

CREATE ICEBERG TABLE my_large_table (
    id INT,
    event_type STRING,
    event_date DATE
)
EXTERNAL_VOLUME = 'exvol'
CATALOG = 'parquet_catalog_int'
BASE_LOCATION = 'data-lake/'
ENABLE_SCHEMA_INFERENCE = FALSE
REFRESH_ON_CREATE = FALSE;

After creation, the table is empty. Use ALTER ICEBERG TABLE ... REFRESH to register files (see Manual refresh). You can register the entire BASE_LOCATION at once, or target specific subdirectories or files:

-- Register all files under the table's BASE_LOCATION
ALTER ICEBERG TABLE my_large_table REFRESH;

-- Register files under a Hive-style partition prefix (for TABLE_FORMAT = HIVE
-- or hand-structured paths that use key=value directory naming)
ALTER ICEBERG TABLE my_large_table REFRESH '/year=2024/month=01/';
ALTER ICEBERG TABLE my_large_table REFRESH '/year=2024/month=02/';

-- Register files under a regular subdirectory (for TABLE_FORMAT = NONE with
-- custom directory layouts)
ALTER ICEBERG TABLE my_large_table REFRESH '/region/us-west/2024/';
ALTER ICEBERG TABLE my_large_table REFRESH '/region/us-east/2024/';

-- Register a specific Parquet file
ALTER ICEBERG TABLE my_large_table REFRESH '/year=2024/month=01/part-00001.parquet';

Note

  • The path argument to REFRESH is relative to the table’s BASE_LOCATION.
  • If you specify a path that does not exist, the refresh proceeds as if no path was specified.
  • When auto-refresh is enabled, you can’t perform a manual refresh. Disable auto-refresh first.
  • For Parquet Direct refresh syntax and examples, see Manual refresh. The ALTER ICEBERG TABLE … REFRESH SQL reference documents refresh for externally managed Iceberg tables, not Parquet Direct path registration.

Refresh the table

After you create an Iceberg table from Parquet files, you can refresh the table data. Refreshing updates the table with the most recent changes to your Parquet files in object storage. You can either automatically refresh the table data or manually refresh the table data.

Note

We recommend setting up automatic refresh for the Parquet source files.

Schema evolution

With ENABLE_SCHEMA_EVOLUTION = TRUE, table refresh synchronizes your table with the following schema changes to the Parquet source files:

  • New columns
  • Type widening for the following scenarios to adhere to the Apache Iceberg specification:
    • int to long
    • float to double
    • decimal(p,s) to decimal(p1,s)
Primitive columns
Original TypeWidened TypeNotes
intlongNUMBER(10,0) → NUMBER(19,0)
floatdoubleBoth map to Snowflake FLOAT
decimal(p1, s)decimal(p2, s)Precision can increase (for example, decimal(5,2) → decimal(7,2))
Struct fields

Type widening applies recursively to struct fields:

Original TypeWidened Type
struct<field:int>struct<field:long>
struct<field:float>struct<field:double>
struct<field:decimal(p1,s)>struct<field:decimal(p2,s)>

The following example shows type widening for a struct field:

struct<p1:int, p2:float, p3:decimal(5,2)>
    ↓
struct<p1:long, p2:double, p3:decimal(8,2)>
Array elements

Type widening applies to array element types:

Original TypeWidened Type
array<int>array<long>
array<float>array<double>
array<decimal(p1,s)>array<decimal(p2,s)>
Map keys and values

Type widening applies to both map keys and values:

Original TypeWidened Type
map<int, V>map<long, V>
map<K, float>map<K, double>
map<K, decimal(p1,s)>map<K, decimal(p2,s)>

The following example shows type widening for map fields:

map<int, string>           → map<long, string>
map<string, float>         → map<string, double>
map<string, decimal(5,2)>  → map<string, decimal(8,2)>

Auto-refresh

Parquet Direct supports event-driven automatic refresh for all three major cloud storage providers:

When you enable auto-refresh on a table that previously had auto-refresh disabled, Snowflake queues notifications for files that were added while auto-refresh was off and processes a full backfill refresh asynchronously.

For setup instructions, see Tutorials. For billing details, see Costs.

Manual refresh

Important

When auto refresh is enabled on a table, you can’t perform a manual refresh on the table. To perform a manual refresh on the table, auto refresh must be disabled.

After you create an Iceberg table from Parquet files, you can refresh the table data using the ALTER ICEBERG TABLE command.

ALTER ICEBERG TABLE [ IF EXISTS ] <table_name> REFRESH ['<relative_path>']

Where:

relative_path

Optional path to a Parquet file or a directory of Parquet files that you want to refresh.

Note

If you specify a relative path that does not exist, the table refresh proceeds as if no relative path was specified.

Refresh all files in the table’s BASE_LOCATION

To manually refresh all of the files in the table’s BASE_LOCATION, omit the relative path argument:

ALTER ICEBERG TABLE myIcebergTable REFRESH;

Refresh files in a subpath from the BASE_LOCATION

To manually refresh a set of Parquet files in a directory, specify a relative path to that directory from the table’s BASE_LOCATION:

ALTER ICEBERG TABLE myIcebergTable REFRESH '/relative/path/to/myParquetDataFiles';

Refresh a particular file

To manually refresh a particular Parquet file, specify a relative path to that file from the BASE_LOCATION:

ALTER ICEBERG TABLE myIcebergTable REFRESH '/relative/path/to/myParquetFile.parquet';

Refresh a particular partition

To manually refresh a particular partition, specify a relative path to that partition from the BASE_LOCATION:

ALTER ICEBERG TABLE store_returns REFRESH '/sr_returned_date_sk=20231201/';

Refresh monitoring

Use SYSTEM$PARQUET_DIRECT_AUTO_REFRESH_STATUS to monitor whether event-driven refresh is running as expected and to diagnose issues for Parquet Direct tables.

Note

This function is specific to Parquet Direct tables (tables created with TABLE_FORMAT = NONE or TABLE_FORMAT = HIVE). For externally managed Iceberg tables that use other catalog sources (such as Delta Direct or REST catalog tables), use SYSTEM$AUTO_REFRESH_STATUS instead.

SYSTEM$PARQUET_DIRECT_AUTO_REFRESH_STATUS('<table_name>')

Arguments

'table_name'

The name of the Parquet Direct Iceberg table for which you want to retrieve the current automated refresh status.

If using the fully qualified name, enclose the entire name in single quotes, including the database and schema. If the table name is case-sensitive or includes any special characters or spaces, you must use double quotes. Enclose the double quotes within the single quotes, for example, '"Table_Name"'.

Returns

The function returns a JSON object containing the following name/value pairs:

{
  "executionState": "<value>",
  "invalidExecutionStateReason": "<value>",
  "pendingFileCount": "<value>",
  "oldestFileTimestamp": "<value>",
  "notificationChannelName": "<value>",
  "numOutstandingMessagesOnChannel": "<value>",
  "lastReceivedMessageTimestamp": "<value>",
  "channelErrorMessage": "<value>",
  "lastErrorRecordTimestamp": "<value>"
}

Where:

executionState

Current execution state of the pipe that Snowflake uses to process auto-refresh notifications for the table.

Values:

  • RUNNING: Automated refresh is running as expected. This status doesn’t indicate whether Snowflake is actively processing event notifications at this moment.
  • STALLED: Automated refresh encountered an error and is attempting to recover.
  • STOPPED: Automated refresh encountered an unrecoverable error and has stopped. You must take action to resume automated refresh.
invalidExecutionStateReason

Error message associated with a STALLED or STOPPED execution state.

pendingFileCount

Number of Apache Parquet™ files queued for processing by the refresh pipe.

oldestFileTimestamp

Timestamp of the oldest file in the pending queue. Snowflake sets this timestamp when the file is added to the queue.

notificationChannelName

Name of the notification channel associated with the table’s auto-refresh pipe.

numOutstandingMessagesOnChannel

Approximate number of pending messages in the notification channel (for example, the SNS queue or Azure Storage queue).

lastReceivedMessageTimestamp

Timestamp of the most recent notification message received from the channel.

channelErrorMessage

Error message from the notification channel, if any.

lastErrorRecordTimestamp

Timestamp of the most recent error message from the notification channel.

Usage notes

  • Calling this function requires a role that has the OWNERSHIP privilege on the Iceberg table.
  • This function only works with Parquet Direct tables (TABLE_FORMAT = NONE or TABLE_FORMAT = HIVE). Calling it on a non-Parquet Direct table returns an error.
  • The function only returns data when AUTO_REFRESH = TRUE is set on the table. If auto-refresh is not enabled, the function returns a message indicating that auto-refresh is not enabled.

Example

Retrieve the automated refresh status for a Parquet Direct table my_parquet_table in the schema db1.schema1:

SELECT SYSTEM$PARQUET_DIRECT_AUTO_REFRESH_STATUS('db1.schema1.my_parquet_table');

Costs

Parquet Direct auto-refresh uses serverless compute and is billed as credits under the AUTOMATED_REFRESH_AND_DATA_REGISTRATION service type. There is no per-file charge: cost is based solely on the compute time required to process each refresh. No warehouse is required or consumed.

Monitoring costs

To view per-table auto-refresh credit usage, query the PIPE_USAGE_HISTORY view view. The Iceberg table name appears in the pipe_name column.

SELECT pipe_name, start_time, end_time, credits_used
  FROM snowflake.account_usage.pipe_usage_history
  WHERE pipe_name = '<your_table_name>'
  ORDER BY start_time DESC;

Tutorials

Auto-refresh for Amazon S3

If you manage your Parquet source files in Amazon S3, you can create an Iceberg table that uses Amazon SNS (Simple Notification Service) for automatic refresh.

This section provides instructions for creating an Iceberg table that automatically refreshes the Parquet source files.

Prerequisite: Create an Amazon SNS topic and subscription

  1. Create an SNS topic in your AWS account to handle all messages for the Snowflake external volume location on your S3 bucket.
  2. Subscribe your target destinations for the S3 event notifications (for example, other SQS queues or AWS Lambda workloads) to this topic. SNS publishes event notifications for your bucket to all subscribers to the topic.

For full instructions, see the SNS documentation.

Step 1: Subscribe the Snowflake SQS queue to your SNS topic

  1. Log in to the AWS Management Console.

  2. From the home dashboard, select Simple Notification Service (SNS).

  3. In the left-hand navigation pane, select Topics.

  4. Locate the topic for your S3 bucket. Note the topic ARN.

  5. Using a Snowflake client, query the SYSTEM$GET_AWS_SNS_IAM_POLICY system function with your SNS topic ARN:

    select system$get_aws_sns_iam_policy('<sns_topic_arn>');
    

    The function returns an IAM policy that grants a Snowflake SQS queue permission to subscribe to the SNS topic.

  6. Return to the AWS Management console. In the left-hand navigation pane, select Topics.

  7. Select the topic for your S3 bucket, then select Edit. The Edit page opens.

  8. Select Access policy - Optional to expand this area of the page.

  9. Merge the IAM policy addition from the SYSTEM$GET_AWS_SNS_IAM_POLICY function results into the JSON document.

  10. To allow S3 to publish event notifications for the bucket to the SNS topic, add an additional policy grant.

For example:

{
 "Sid":"s3-event-notifier",
 "Effect":"Allow",
 "Principal":{
    "Service":"s3.amazonaws.com"
 },
 "Action":"SNS:Publish",
 "Resource":"arn:aws:sns:us-west-2:001234567890:s3_mybucket",
 "Condition":{
    "ArnLike":{
       "aws:SourceArn":"arn:aws:s3:*:*:s3_mybucket"
    }
 }
 }

Merged IAM policy:

{
  "Version":"2008-10-17",
  "Id":"__default_policy_ID",
  "Statement":[
  {
     "Sid":"__default_statement_ID",
     "Effect":"Allow",
     "Principal":{
        "AWS":"*"
     }
     ..
  },
  {
     "Sid":"1",
     "Effect":"Allow",
     "Principal":{
       "AWS":"arn:aws:iam::123456789001:user/vj4g-a-abcd1234"
      },
      "Action":[
        "sns:Subscribe"
      ],
      "Resource":[
        "arn:aws:sns:us-west-2:001234567890:s3_mybucket"
      ]
  },
  {
     "Sid":"s3-event-notifier",
     "Effect":"Allow",
     "Principal":{
        "Service":"s3.amazonaws.com"
     },
     "Action":"SNS:Publish",
     "Resource":"arn:aws:sns:us-west-2:001234567890:s3_mybucket",
     "Condition":{
        "ArnLike":{
           "aws:SourceArn":"arn:aws:s3:*:*:s3_mybucket"
        }
     }
   }
]
 }
  1. Select Save changes.

Step 2: Create an external volume with your AWS SNS topic

To configure an external volume, complete the instructions for Configure an external volume for Amazon S3.

In Step 4: Creating an external volume in Snowflake, specify the following additional parameter:

AWS_SNS_TOPIC = {'<sns_topic_arn>'}

Specifies the Amazon Resource Name (ARN) of the Amazon SNS topic that handles all messages for your external volume location.

For example:

CREATE OR REPLACE EXTERNAL VOLUME auto_refresh_exvol
  STORAGE_LOCATIONS = (
    (
      NAME = 'my-s3-us-east-1'
      STORAGE_PROVIDER = 'S3'
      STORAGE_BASE_URL = 's3://s3_mybucket/'
      STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::0123456789102:role/my-role'
      AWS_SNS_TOPIC = 'arn:aws:sns:us-east-1:0123456789102:sns_topic'
    )
  );

Step 3: Create a catalog integration

Create a catalog integration by using the CREATE CATALOG INTEGRATION (Object storage) command.

Note

Snowflake does not support creating Iceberg tables from Parquet-based table definitions in the AWS Glue Data Catalog.

The following example creates a catalog integration for Parquet files in object storage.

CREATE OR REPLACE CATALOG INTEGRATION parquet_catalog_int
  CATALOG_SOURCE = OBJECT_STORE
  TABLE_FORMAT = NONE
  ENABLED = TRUE;

Step 4: Create an Iceberg table

Create an Iceberg table by using the CREATE ICEBERG TABLE command, setting the AUTO_REFRESH parameter equal to TRUE.

CREATE OR REPLACE ICEBERG TABLE my_s3_auto_refresh_table (
    first_name STRING,
    last_name STRING,
    amount NUMBER,
    create_date DATE
  )
  CATALOG = parquet_catalog_int
  EXTERNAL_VOLUME = myIcebergVolume
  BASE_LOCATION = 'relative_path_from_external_volume'
  AUTO_REFRESH = TRUE;
Example: Create an Iceberg table from Parquet files using automatic schema inference and evolution with auto refresh
CREATE OR REPLACE ICEBERG TABLE auto_schema_table
  EXTERNAL_VOLUME = 'exvol'
  CATALOG = 'parquet_catalog_int'
  BASE_LOCATION = 'auto_schema_table/'
  ENABLE_SCHEMA_INFERENCE = TRUE
  ENABLE_SCHEMA_EVOLUTION = TRUE
  AUTO_REFRESH = TRUE;

Alternatively, you can include a column definition to provide information about certain columns. Snowflake uses the definition to create those columns, then automatically detects other table columns. In this scenario, you must specify ENABLE_SCHEMA_INFERENCE = TRUE since you include a column definition.

CREATE OR REPLACE ICEBERG TABLE auto_schema_table_col_spec (col1 INT)
  EXTERNAL_VOLUME = 'exvol'
  CATALOG = 'parquet_catalog_int'
  BASE_LOCATION = 'auto_schema_table_col_spec/'
  ENABLE_SCHEMA_INFERENCE = TRUE
  ENABLE_SCHEMA_EVOLUTION = TRUE
  AUTO_REFRESH = TRUE;

Troubleshoot

To track the status of automatic refreshes for your Iceberg table, use the SYSTEM$PARQUET_DIRECT_AUTO_REFRESH_STATUS function.

For example:

SELECT SYSTEM$PARQUET_DIRECT_AUTO_REFRESH_STATUS('my_s3_auto_refresh_table');

Auto-refresh for Azure Blob Storage

If you manage your Parquet source files in Microsoft Azure, you can create an Iceberg table that uses Azure Event Grid for automatic refresh.

This section provides instructions for creating an Iceberg table that automatically refreshes the Parquet source files.

Supported accounts, APIs, and schemas

Snowflake supports the following types of blob storage accounts:

  • Blob storage
  • Data Lake Storage Gen2
  • General-purpose v2

Automatic refresh of your Iceberg table from Parquet files isn’t supported for Microsoft Fabric OneLake. For OneLake Iceberg tables from Parquet files, you must manually refresh the table. See Manual refresh.

Note

Only Microsoft.Storage.BlobCreated and Microsoft.Storage.BlobDeleted events trigger the refreshing of the Parquet source files. Adding new objects to blob storage triggers these events. Renaming a directory or object doesn’t trigger these events. Snowflake recommends that you only send supported events for Iceberg tables from Parquet files to reduce costs, event noise, and latency.

For cloud platform support, triggering automated refreshes of the Parquet source files using Azure Event Grid messages is supported by Snowflake accounts hosted on Microsoft Azure (Azure).

Snowflake supports the following Microsoft.Storage.BlobCreated APIs:

  • CopyBlob
  • PutBlob
  • PutBlockList
  • FlushWithClose
  • SftpCommit

Snowflake supports the following Microsoft.Storage.BlobDeleted APIs:

  • DeleteBlob
  • DeleteFile
  • SftpRemove

For Data Lake Storage Gen2 storage accounts, Microsoft.Storage.BlobCreated events are triggered when clients use the CreateFile and FlushWithClose operations. If the SSH File Transfer Protocol (SFTP) is used, Microsoft.Storage.BlobCreated events are triggered with SftpCreate and SftpCommit operations. The CreateFile or SftpCreate API alone does not indicate a commit of a file in the storage account. If the FlushWithClose or SftpCommit message is not sent, Snowflake does not refresh the Parquet source files.

Snowflake only supports the Azure Event Grid event schema; it doesn’t support the CloudEvents schema with Azure Event Grid.

Iceberg tables for Snowflake from Parquet files that you manage in object storage don’t support storage versioning.

Prerequisites

Before you proceed, ensure you meet the following prerequisites:

  • A role with the account-level CREATE EXTERNAL VOLUME privilege and the CREATE ICEBERG TABLE privilege on a schema.
  • Administrative access to Microsoft Azure. If you aren’t an Azure administrator, ask your Azure administrator to complete the steps in Step 1: Configure the Event Grid subscription.

Step 1: Configure the Event Grid subscription

This section describes how to set up an Event Grid subscription for Azure Storage events using the Azure CLI.

Create a resource group

An Event Grid topic provides an endpoint where the source (that is, Azure Storage) sends events. A topic is used for a collection of related events. Event Grid topics are Azure resources, and must be placed in an Azure resource group.

Execute the following command to create a resource group:

az group create --name <resource_group_name> --location <location>

Where:

  • {<resource_group_name>} is the name of the new resource group.
  • {<location>} is the location, or region in Snowflake terminology, of your Azure Storage account.
Enable the Event Grid resource provider

Execute the following command to register the Event Grid resource provider. Note that this step is only required if you have not previously used Event Grid with your Azure account:

az provider register --namespace Microsoft.EventGrid
az provider show --namespace Microsoft.EventGrid --query "registrationState"
Create a storage account for data files

Execute the following command to create a storage account to store your data files. This account must be either a Blob storage (that is, a BlobStorage kind) or GPv2 (that is, a StorageV2 kind) account, because only these two account types support event messages.

Note

If you already have a Blob storage or GPv2 account, you can use that account instead.

For example, create a Blob storage account:

az storage account create \
  --resource-group <resource_group_name> \
  --name <storage_account_name> \
  --sku Standard_LRS \
  --location <location> \
  --kind BlobStorage \
  --access-tier Hot

Where:

  • {<resource_group_name>} is the name of the resource group you created in Create a resource group.
  • {<storage_account_name>} is the name of the new storage account.
  • {<location>} is the location of your Azure Storage account.
Create a storage account for the storage queue

Execute the following command to create a storage account to host your storage queue. This account must be a GPv2 account, because only this kind of account supports event messages to a storage queue.

Note

If you already have a GPv2 account, you can use that account to host both your data files and your storage queue.

For example, create a GPv2 account:

az storage account create \
  --resource-group <resource_group_name> \
  --name <storage_account_name> \
  --sku Standard_LRS \
  --location <location> \
  --kind StorageV2

Where:

  • {<resource_group_name>} is the name of the resource group you created in Create a resource group.
  • {<storage_account_name>} is the name of the new storage account.
  • {<location>} is the location of your Azure Storage account.
Create a storage queue

A single Azure Queue Storage queue can collect the event messages for many Event Grid subscriptions. For best performance, Snowflake recommends creating a single storage queue to accommodate all of your subscriptions related to Snowflake.

Execute the following command to create a storage queue. A storage queue stores a set of messages, in this case event messages from Event Grid:

az storage queue create \
  --name <storage_queue_name> \
  --account-name <storage_account_name>

Where:

Export the storage account and queue IDs for reference

Execute the following commands to set environment variables for the storage account and queue IDs that will be requested later in these instructions:

Linux or macOS:

export storageid=$(az storage account show \
  --name <data_storage_account_name> \
  --resource-group <resource_group_name> \
  --query id --output tsv)
export queuestorageid=$(az storage account show \
  --name <queue_storage_account_name> \
  --resource-group <resource_group_name> \
  --query id --output tsv)
export queueid="$queuestorageid/queueservices/default/queues/<storage_queue_name>"

Windows:

set storageid=$(az storage account show \
  --name <data_storage_account_name> \
  --resource-group <resource_group_name> \
  --query id --output tsv)
set queuestorageid=$(az storage account show \
  --name <queue_storage_account_name> \
  --resource-group <resource_group_name> \
  --query id --output tsv)
set queueid="%queuestorageid%/queueservices/default/queues/<storage_queue_name>"

Where:

Install the Event Grid extension

Execute the following command to install the Event Grid extension for Azure CLI:

az extension add --name eventgrid
Create the Event Grid subscription

Execute the following command to create the Event Grid subscription. Subscribing to a topic informs Event Grid which events to track:

Linux or macOS:

az eventgrid event-subscription create \
  --source-resource-id $storageid \
  --name <subscription_name> \
  --endpoint-type storagequeue \
  --endpoint $queueid \
  --advanced-filter data.api stringin CopyBlob PutBlob PutBlockList \
    FlushWithClose SftpCommit DeleteBlob DeleteFile SftpRemove

Windows:

az eventgrid event-subscription create \
  --source-resource-id %storageid% \
  --name <subscription_name> \
  --endpoint-type storagequeue \
  --endpoint %queueid% \
  --advanced-filter data.api stringin CopyBlob PutBlob PutBlockList \
    FlushWithClose SftpCommit DeleteBlob DeleteFile SftpRemove

Where:

Step 2: Create a notification integration

A notification integration is a Snowflake object that provides an interface between Snowflake and a third-party cloud message queuing service such as Azure Event Grid.

Note

A single notification integration supports a single Azure Storage queue. Referencing the same storage queue in multiple notification integrations can result in missing data in target tables because event notifications are split between notification integrations.

Retrieve the storage queue URL and tenant ID
  1. Sign in to the Microsoft Azure portal.

  2. Navigate to Storage account » Queue service » Queues. Record the URL for the queue you created in Create a storage queue for reference later. The URL has the following format:

    https://<storage_account_name>.queue.core.windows.net/<storage_queue_name>
    
  3. Navigate to Azure Active Directory » Properties. Record the Tenant ID value for reference later. The directory ID, or tenant ID, is needed to grant Snowflake access to the Event Grid subscription.

Create the notification integration

Create a notification integration by using the CREATE NOTIFICATION INTEGRATION command.

Note

  • Only account administrators (users with the ACCOUNTADMIN role) or a role with the global CREATE INTEGRATION privilege can execute this SQL command.
  • The Azure service principal for notification integrations is different from the service principal created for storage integrations.
CREATE NOTIFICATION INTEGRATION <integration_name>
  ENABLED = true
  TYPE = QUEUE
  NOTIFICATION_PROVIDER = AZURE_STORAGE_QUEUE
  AZURE_STORAGE_QUEUE_PRIMARY_URI = '<queue_URL>'
  AZURE_TENANT_ID = '<directory_ID>';

Where:

For example:

CREATE NOTIFICATION INTEGRATION my_notification_int
  ENABLED = true
  TYPE = QUEUE
  NOTIFICATION_PROVIDER = AZURE_STORAGE_QUEUE
  AZURE_STORAGE_QUEUE_PRIMARY_URI = 'https://myqueue.queue.core.windows.net/mystoragequeue'
  AZURE_TENANT_ID = 'a123bcde-1234-5678-abc1-9abc12345678';
Grant Snowflake access to the storage queue
  1. Execute the DESCRIBE INTEGRATION command to retrieve the consent URL:

    DESC NOTIFICATION INTEGRATION <integration_name>;
    

    Where:

    Note the values in the following columns:

    AZURE_CONSENT_URL:

    URL to the Microsoft permissions request page.

    AZURE_MULTI_TENANT_APP_NAME:

    Name of the Snowflake client application created for your account. In a later step in this section, you will need to grant this application the permissions necessary to obtain an access token on your allowed topic.

  2. In a web browser, navigate to the URL in the AZURE_CONSENT_URL column. The page displays a Microsoft permissions request page.

  3. Select Accept. This action allows the Azure service principal created for your Snowflake account to obtain an access token on any resource inside your tenant. Obtaining an access token succeeds only if you grant the service principal the appropriate permissions on the container (see the next step).

    The Microsoft permissions request page redirects to the Snowflake corporate site (snowflake.com).

  4. Sign in to the Microsoft Azure portal.

  5. Navigate to Azure Active Directory » Enterprise applications. Verify that the Snowflake application identifier you recorded in Step 1 in this section is listed.

    Important

    If you delete the Snowflake application in Azure Active Directory at a later time, the notification integration stops working.

  6. Navigate to Queues » {<storage_queue_name>}, where {<storage_queue_name>} is the name of the storage queue you created in Create a storage queue.

  7. Select Access Control (IAM) » Add role assignment.

  8. Search for the Snowflake service principal. This is the identity in the AZURE_MULTI_TENANT_APP_NAME property in the DESC NOTIFICATION INTEGRATION output (in Step 1). Search for the string before the underscore in the AZURE_MULTI_TENANT_APP_NAME property.

    Important

    • It can take an hour or longer for Azure to create the Snowflake service principal requested through the Microsoft request page in this section. If the service principal is not available immediately, we recommend waiting an hour or two and then searching again.
    • If you delete the service principal, the notification integration stops working.
  9. Grant the Snowflake app the following permissions:

    • Role: Storage Queue Data Message Processor (the minimum required role), or Storage Queue Data Contributor.
    • Assign access to: Azure AD user, group, or service principal.
    • Select: The appDisplayName value.

    The Snowflake application identifier should now be listed under Storage Queue Data Message Processor or Storage Queue Data Contributor (on the same dialog).

Step 3: Create an external volume with your Azure storage queue

To configure an external volume, complete the instructions for Configure an external volume for Azure.

In Step 1: Create an external volume in Snowflake, specify the following additional parameter:

AZURE_STORAGE_QUEUE_PRIMARY_URI = {'<queue_URL>'}

Specifies the URL of the Azure Storage queue that handles all messages for your external volume location.

For example:

CREATE OR REPLACE EXTERNAL VOLUME auto_refresh_exvol
  STORAGE_LOCATIONS = (
    (
      NAME = 'my-azure-location'
      STORAGE_PROVIDER = 'AZURE'
      STORAGE_BASE_URL = 'azure://myaccount.blob.core.windows.net/mycontainer/'
      AZURE_TENANT_ID = 'a123b4c5-1234-123a-a12b-1a23b45678c9'
      AZURE_STORAGE_QUEUE_PRIMARY_URI = 'https://myqueue.queue.core.windows.net/mystoragequeue'
    )
  );

Step 4: Create a catalog integration

Create a catalog integration by using the CREATE CATALOG INTEGRATION (Object storage) command.

Note

Snowflake does not support creating Iceberg tables from Parquet-based table definitions in the Azure environment.

The following example creates a catalog integration for Parquet files in object storage.

CREATE OR REPLACE CATALOG INTEGRATION parquet_catalog_int
  CATALOG_SOURCE = OBJECT_STORE
  TABLE_FORMAT = NONE
  ENABLED = TRUE;

Step 5: Create an Iceberg table

Create an Iceberg table by using the CREATE ICEBERG TABLE command, setting the AUTO_REFRESH parameter equal to TRUE.

CREATE OR REPLACE ICEBERG TABLE my_azure_auto_refresh_table (
    first_name STRING,
    last_name STRING,
    amount NUMBER,
    create_date DATE
  )
  CATALOG = parquet_catalog_int
  EXTERNAL_VOLUME = myIcebergVolume
  BASE_LOCATION = 'relative_path_from_external_volume'
  AUTO_REFRESH = TRUE;
Example: Create an Iceberg table from Parquet files using automatic schema inference with auto refresh
CREATE OR REPLACE ICEBERG TABLE auto_schema_table
  EXTERNAL_VOLUME = 'exvol'
  CATALOG = 'parquet_catalog_int'
  BASE_LOCATION = 'auto_schema_table/'
  ENABLE_SCHEMA_INFERENCE = TRUE
  ENABLE_SCHEMA_EVOLUTION = TRUE
  AUTO_REFRESH = TRUE;

Alternatively, you can include a column definition to provide information about certain columns. Snowflake uses the definition to create those columns, then automatically detects other table columns. In this scenario, you must specify ENABLE_SCHEMA_INFERENCE = TRUE since you include a column definition.

CREATE OR REPLACE ICEBERG TABLE auto_schema_table_col_spec (col1 INT)
  EXTERNAL_VOLUME = 'exvol'
  CATALOG = 'parquet_catalog_int'
  BASE_LOCATION = 'auto_schema_table_col_spec/'
  ENABLE_SCHEMA_INFERENCE = TRUE
  ENABLE_SCHEMA_EVOLUTION = TRUE
  AUTO_REFRESH = TRUE;

Troubleshoot

To track the status of automatic refreshes for your Iceberg table, use the SYSTEM$PARQUET_DIRECT_AUTO_REFRESH_STATUS function.

For example:

SELECT SYSTEM$PARQUET_DIRECT_AUTO_REFRESH_STATUS('my_azure_auto_refresh_table');

Auto-refresh for Google Cloud Storage

If you manage your Parquet source files in Google Cloud Storage (GCS), you can create an Iceberg table that uses Google Cloud Pub/Sub for automatic refresh. Snowflake subscribes to a Pub/Sub subscription you create, so that Snowflake is notified when new files are added to or removed from your GCS bucket.

This section provides instructions for creating an Iceberg table that automatically refreshes the Parquet source files.

Prerequisites

Before you proceed, ensure you meet the following prerequisites:

  • A role with the account-level CREATE EXTERNAL VOLUME privilege and the CREATE ICEBERG TABLE privilege on a schema.
  • A role with the global CREATE INTEGRATION privilege to create a notification integration.
  • A Google Cloud Storage bucket that holds your Parquet source files.
  • Administrative access to Google Cloud to create a Pub/Sub topic, subscription, and manage IAM permissions.

Step 1: Create a Pub/Sub topic and subscription

Set up Google Cloud Pub/Sub so your GCS bucket can send event notifications to Snowflake. For a full walkthrough of the same pattern for external tables, see Refresh external tables automatically for Google Cloud Storage.

Create the Pub/Sub topic

Create a Pub/Sub topic using Cloud Shell or the Cloud SDK. Enable the topic to listen for activity in your GCS bucket:

gsutil notification create -t <topic> -f json -e OBJECT_FINALIZE -e OBJECT_DELETE gs://<bucket-name>

Where:

  • <topic> is the name for the topic.
  • <bucket-name> is the name of your GCS bucket.

If the topic already exists, the command uses it; otherwise, a new topic is created.

For more information, see Using Pub/Sub notifications for Cloud Storage.

Create the Pub/Sub subscription

Create a subscription with pull delivery to the Pub/Sub topic using the Cloud Console, the gcloud command-line tool, or the Cloud Pub/Sub API. For instructions, see Managing topics and subscriptions.

Note

Only Pub/Sub subscriptions that use the default pull delivery are supported with Snowflake. Push delivery isn’t supported.

Retrieve the Pub/Sub subscription ID

Record the subscription ID. You use it when you create the notification integration and external volume in Snowflake.

  1. Log into the Google Cloud Platform Console as a project editor.
  2. From the home dashboard, choose Big Data » Pub/Sub » Subscriptions.
  3. Copy the ID in the Subscription ID column for the topic subscription. The full name has the following format:
projects/<project_id>/subscriptions/<subscription_name>

Step 2: Create a notification integration

Create a notification integration by using the CREATE NOTIFICATION INTEGRATION command.

Note

Only account administrators (users with the ACCOUNTADMIN role) or a role with the global CREATE INTEGRATION privilege can execute this SQL command.

CREATE NOTIFICATION INTEGRATION my_gcs_notification_int
  ENABLED = TRUE
  TYPE = QUEUE
  NOTIFICATION_PROVIDER = GCP_PUBSUB
  GCP_PUBSUB_SUBSCRIPTION_NAME = '<subscription_id>';

Where <subscription_id> is the subscription ID you recorded in Step 1 (for example, projects/project-1234/subscriptions/sub2).

Step 3: Grant Snowflake access to the Pub/Sub subscription

After you create the notification integration, you must grant the Snowflake service account permission to consume messages from your Pub/Sub subscription.

  1. Execute the DESCRIBE INTEGRATION command to retrieve the Snowflake service account:
DESC NOTIFICATION INTEGRATION my_gcs_notification_int;

Note the value in the GCP_PUBSUB_SERVICE_ACCOUNT column. This is the Snowflake service account email that you must grant access.

  1. In the Google Cloud Console, navigate to Pub/Sub » Subscriptions.
  2. Select the subscription you created and click Add Principal.
  3. Add the Snowflake service account email from the GCP_PUBSUB_SERVICE_ACCOUNT column.
  4. Grant the service account the Pub/Sub Subscriber role (roles/pubsub.subscriber).

Step 4: Create an external volume with your Pub/Sub subscription

To configure an external volume, complete the instructions for Configure an external volume for Google Cloud Storage.

In Step 1: Create an external volume in Snowflake, specify the following additional parameter on the storage location:

GCP_PUBSUB_SUBSCRIPTION_NAME = '<subscription_id>'

Specifies the Google Cloud Pub/Sub subscription that handles event messages for your external volume location.

For example:

CREATE OR REPLACE EXTERNAL VOLUME auto_refresh_gcs_exvol
  STORAGE_LOCATIONS = (
    (
      NAME = 'my-gcs-us-central1'
      STORAGE_PROVIDER = 'GCS'
      STORAGE_BASE_URL = 'gcs://my_parquet_bucket/'
      GCP_PUBSUB_SUBSCRIPTION_NAME = 'projects/<project_id>/subscriptions/<subscription_name>'
    )
  );

Step 5: Create a catalog integration

Create a catalog integration by using the CREATE CATALOG INTEGRATION (Object storage) command. Set CATALOG_SOURCE = OBJECT_STORE and TABLE_FORMAT = NONE (or TABLE_FORMAT = HIVE for Hive-style partitioning).

CREATE OR REPLACE CATALOG INTEGRATION parquet_catalog_int
  CATALOG_SOURCE = OBJECT_STORE
  TABLE_FORMAT = NONE
  ENABLED = TRUE;

Step 6: Create an Iceberg table

Create an Iceberg table by using the CREATE ICEBERG TABLE command, setting the AUTO_REFRESH parameter equal to TRUE.

CREATE OR REPLACE ICEBERG TABLE my_gcs_auto_refresh_table (
    first_name STRING,
    last_name STRING,
    amount NUMBER,
    create_date DATE
  )
  CATALOG = parquet_catalog_int
  EXTERNAL_VOLUME = myGcsExternalVolume
  BASE_LOCATION = 'relative_path_from_external_volume'
  AUTO_REFRESH = TRUE;
Example: Create an Iceberg table with schema inference and auto refresh on GCS
CREATE OR REPLACE ICEBERG TABLE auto_schema_gcs_table
  EXTERNAL_VOLUME = 'gcs_exvol'
  CATALOG = 'parquet_catalog_int'
  BASE_LOCATION = 'auto_schema_table/'
  AUTO_REFRESH = TRUE;

Troubleshoot

To track the status of automatic refreshes for your Iceberg table, use the SYSTEM$PARQUET_DIRECT_AUTO_REFRESH_STATUS function.

For example:

SELECT SYSTEM$PARQUET_DIRECT_AUTO_REFRESH_STATUS('my_gcs_auto_refresh_table');

Data type mapping

When you define a column in a CREATE ICEBERG TABLE statement for Parquet source files, you must specify a Snowflake data type that maps to the Parquet data type used in your source files.

Note

In addition to data types that are compatible with Iceberg, the following non-Iceberg data types are also supported:

  • BYTE_ARRAY
  • INT96

For more information, see the data type mapping table below.

The following table shows how Parquet logical types map to physical types, and how the physical types map to Snowflake data types.

Parquet logical typeParquet physical typeSnowflake data type
NoneBOOLEANBOOLEAN

None

INT(bitWidth=8, isSigned=true)

INT(bitWidth=16, isSigned=true)

INT(bitWidth=32, isSigned=true)

INT32INT
None, INT(bitWidth=64, isSigned=true)INT64BIGINT
NoneFLOATFLOAT
NoneDOUBLEFLOAT
DECIMAL(P,S)

INT32

INT64

FIXED_LEN_BYTE_ARRAY(N)

DECIMAL(P,S)
DATEINT32DATE
TIME(isAdjustedToUTC=true, unit=MILLIS)INT32TIME(3)
TIME(isAdjustedToUTC=true, unit=MICROS)INT64TIME(6)
TIME(isAdjustedToUTC=true, unit=NANOS)INT64TIME(9)
NONEINT96TIMESTAMP_LTZ(9)
TIMESTAMP(isAdjustedToUTC=true, unit=MILLIS)INT64TIMESTAMP_NTZ(3)
TIMESTAMP(isAdjustedToUTC=true, unit=MICROS)INT64TIMESTAMP_NTZ(6)
TIMESTAMP(isAdjustedToUTC=true, unit=NANOS)INT64TIMESTAMP_NTZ(9)
STRINGBYTE_ARRAYVARCHAR
ENUMBYTE_ARRAYVARCHAR
JSONBYTE_ARRAYVARCHAR
UUIDFIXED_LEN_BYTE_ARRAY(16)BINARY(16)
NONEFIXED_LEN_BYTE_ARRAY(N)BINARY(L)

NONE


BSON

BYTE_ARRAYBINARY
INTERVALFIXED_LEN_BYTE_ARRAY(12)

BINARY(12)



Snowflake does not support a corresponding data type for the Parquet INTERVAL type, and reads the data from source files as binary data.

The following table shows how Parquet nested data types map to Snowflake data types.

Parquet logical nested typeSnowflake data type
NONEStructured OBJECT
LISTStructured ARRAY
MAPMAP

Considerations and limitations

  • By default, the maximum number of Parquet files that you can use to create an Iceberg table is ~2 million.

    To use more than this limit, contact Snowflake Support for assistance.

  • Parquet files that use any of the following features or data types are not supported:

    • Field IDs.
    • The DECIMAL data type with precision higher than 38.
    • LIST or MAP types with one-level or two-level representation.
    • Unsigned integer types (INT(signed=false)).
    • The FLOAT16 data type.
  • Snowflake does not support creating Iceberg tables from Parquet-based table definitions in the AWS Glue Data Catalog.

  • Generating Iceberg metadata using the SYSTEM$GET_ICEBERG_TABLE_INFORMATION function is not supported.

  • When auto refresh is enabled on a table, you can’t perform a manual refresh on the table. To perform a manual refresh on the table, auto refresh must be disabled.

  • Automatic refresh on Google Cloud Storage requires that your Snowflake account is hosted on Google Cloud Platform.

  • For Parquet Direct tables:

    • You can’t generate Iceberg metadata for these tables.
    • You can’t convert these tables to Snowflake-managed Iceberg tables.
    • You can’t perform DML operations on these tables; they are read only.
    • You can’t perform table maintenance on these tables.
  • The following Snowflake features aren’t supported for Parquet Direct tables:

    • Cloning
    • Replication
    • Append-only Streams and Standard Streams (Insert-only Streams are supported)
    • Dynamic tables
    • Search Optimization
    • Native apps
    • The VARIANT data type
    • The GEOGRAPHY data type
  • CTAS (CREATE TABLE AS SELECT): Creating a Parquet Direct table using CTAS is not supported. Using a Parquet Direct table as the source in a CTAS statement is supported.

  • CREATE TABLE LIKE: Creating a Parquet Direct table using CREATE TABLE LIKE is not supported. Using a Parquet Direct table as the source in CREATE TABLE LIKE is supported.

  • Data sharing is supported for Parquet Direct tables, including local sharing, cross-shard sharing, and Listing Auto Fulfillment (LAF) scenarios. However, Iceberg metadata generation is blocked at the destination.