- Categories:
INFER_SCHEMA¶
Automatically detects the file metadata schema in a set of staged data files that contain semi-structured data and retrieves the column definitions.
The GENERATE_COLUMN_DESCRIPTION function builds on the INFER_SCHEMA function output to simplify the creation of new tables, external tables, or views (using the appropriate CREATE <object> command) based on the column definitions of the staged files.
The CREATE TABLE or CREATE EXTERNAL TABLE command with the USING TEMPLATE clause can be executed to create a new table or external table with the column definitions derived from the INFER_SCHEMA function output.
Note
This feature is currently limited to Apache Parquet, Apache Avro, and ORC files.
Syntax¶
INFER_SCHEMA(
LOCATION => '{ internalStage | externalStage }'
, FILE_FORMAT => '<file_format_name>'
, FILES => '<file_name>' [ , '<file_name>' ] [ , ... ]
, IGNORE_CASE => TRUE | FALSE
)
Where:
internalStage ::= @[<namespace>.]<int_stage_name>[/<path>][/<filename>] | @~[/<path>][/<filename>]externalStage ::= @[<namespace>.]<ext_stage_name>[/<path>][/<filename>]
Arguments¶
LOCATION => '...'
Name of the internal or external stage where the files are stored. Optionally include a path to one or more files in the cloud storage location; otherwise, the INFER_SCHEMA function scans files in all subdirectories in the stage:
@[namespace.]int_stage_name[/path][/filename]
Files are in the specified named internal stage.
@[namespace.]ext_stage_name[/path][/filename]
Files are in the specified named external stage.
@~[/path][/filename]
Files are in the stage for the current user.
Note
This SQL function supports named stages (internal or external) and user stages only. It does not support table stages.
FILES => '<file_name>' [ , '<file_name>' ] [ , ... ]
Specifies a list of one or more files (separated by commas) in a set of staged files that contain semi-structured data. The files must already have been staged in either the Snowflake internal location or external location specified in the command. If any of the specified files cannot be found, the query will be aborted.
The maximum number of files names that can be specified is 1000.
Note
For external stages only (Amazon S3, Google Cloud Storage, or Microsoft Azure), the file path is set by concatenating the URL in the stage definition and the list of resolved file names.
However, Snowflake doesn’t insert a separator implicitly between the path and file names. You must explicitly include a separator (
/
) either at the end of the URL in the stage definition or at the beginning of each file name specified in this parameter.FILE_FORMAT => '<file_format_name>'
Name of the file format object that describes the data contained in the staged files. For more information, see CREATE FILE FORMAT.
IGNORE_CASE => TRUE | FALSE
Specifies whether column names detected from stage files are treated as case sensitive. By default, the value is FALSE, which means that Snowflake preserves the case of alphabetic characters when retrieving column names. If you specify the value as TRUE, column names are treated as case-insensitive and all column names are retrieved as uppercase letters.
Output¶
The function returns the following columns:
Column Name |
Data Type |
Description |
---|---|---|
COLUMN_NAME |
TEXT |
Name of a column in the staged files. |
TYPE |
TEXT |
Data type of the column. |
NULLABLE |
BOOLEAN |
Specifies whether rows in the column can store NULL instead of a value. Currently, the inferred nullability of a column can apply to one data file but not others in the scanned set. |
EXPRESSION |
TEXT |
Expression of the column in the format |
FILENAMES |
TEXT |
Names of the files that contain the column. |
ORDER_ID |
NUMBER |
Column order in the staged files. |
Examples¶
Retrieve the column definitions for Parquet files in the mystage
stage:
-- Create a file format that sets the file type as Parquet.
CREATE FILE FORMAT my_parquet_format
TYPE = parquet;
-- Query the INFER_SCHEMA function.
SELECT *
FROM TABLE(
INFER_SCHEMA(
LOCATION=>'@mystage'
, FILE_FORMAT=>'my_parquet_format'
)
);
+-------------+---------+----------+---------------------+--------------------------+----------+
| COLUMN_NAME | TYPE | NULLABLE | EXPRESSION | FILENAMES | ORDER_ID |
|-------------+---------+----------+---------------------+--------------------------|----------+
| continent | TEXT | True | $1:continent::TEXT | geography/cities.parquet | 0 |
| country | VARIANT | True | $1:country::VARIANT | geography/cities.parquet | 1 |
| COUNTRY | VARIANT | True | $1:COUNTRY::VARIANT | geography/cities.parquet | 2 |
+-------------+---------+----------+---------------------+--------------------------+----------+
Similar to the previous example, but specify a single Parquet file in the mystage
stage:
-- Query the INFER_SCHEMA function.
SELECT *
FROM TABLE(
INFER_SCHEMA(
LOCATION=>'@mystage/geography/cities.parquet'
, FILE_FORMAT=>'my_parquet_format'
)
);
+-------------+---------+----------+---------------------+--------------------------+----------+
| COLUMN_NAME | TYPE | NULLABLE | EXPRESSION | FILENAMES | ORDER_ID |
|-------------+---------+----------+---------------------+--------------------------|----------+
| continent | TEXT | True | $1:continent::TEXT | geography/cities.parquet | 0 |
| country | VARIANT | True | $1:country::VARIANT | geography/cities.parquet | 1 |
| COUNTRY | VARIANT | True | $1:COUNTRY::VARIANT | geography/cities.parquet | 2 |
+-------------+---------+----------+---------------------+--------------------------+----------+
Retrieve the column definitions for Parquet files in the mystage
stage with IGNORE_CASE specified as TRUE. In the returned output, all column names are retrieved as uppercase letters.
-- Query the INFER_SCHEMA function.
SELECT *
FROM TABLE(
INFER_SCHEMA(
LOCATION=>'@mystage'
, FILE_FORMAT=>'my_parquet_format'
, IGNORE_CASE=>TRUE
)
);
+-------------+---------+----------+----------------------------------------+--------------------------+----------+
| COLUMN_NAME | TYPE | NULLABLE | EXPRESSION | FILENAMES | ORDER_ID |
|-------------+---------+----------+---------------------+---------------------------------------------|----------+
| CONTINENT | TEXT | True | GET_IGNORE_CASE ($1, CONTINENT)::TEXT | geography/cities.parquet | 0 |
| COUNTRY | VARIANT | True | GET_IGNORE_CASE ($1, COUNTRY)::VARIANT | geography/cities.parquet | 1 |
+-------------+---------+----------+---------------------+---------------------------------------------+----------+