- Categories:
INFER_SCHEMA¶
Automatically detects the file metadata schema in a set of staged data files that contain semi-structured data and retrieves the column definitions.
The GENERATE_COLUMN_DESCRIPTION function builds on the INFER_SCHEMA function output to simplify the creation of new tables, external tables, or views (using the appropriate CREATE <object> command) based on the column definitions of the staged files.
The CREATE TABLE or CREATE EXTERNAL TABLE command with the USING TEMPLATE clause can be executed to create a new table or external table with the column definitions derived from the INFER_SCHEMA function output.
Note
This function supports Apache Parquet, Apache Avro, ORC, JSON, and CSV files.
Syntax¶
INFER_SCHEMA(
LOCATION => '{ internalStage | externalStage }'
, FILE_FORMAT => '<file_format_name>'
, FILES => ( '<file_name>' [ , '<file_name>' ] [ , ... ] )
, IGNORE_CASE => TRUE | FALSE
, MAX_FILE_COUNT => <num>
, MAX_RECORDS_PER_FILE => <num>
)
Where:
internalStage ::= @[<namespace>.]<int_stage_name>[/<path>][/<filename>] | @~[/<path>][/<filename>]externalStage ::= @[<namespace>.]<ext_stage_name>[/<path>][/<filename>]
Arguments¶
LOCATION => '...'
Name of the internal or external stage where the files are stored. Optionally include a path to one or more files in the cloud storage location; otherwise, the INFER_SCHEMA function scans files in all subdirectories in the stage:
@[namespace.]int_stage_name[/path][/filename]
Files are in the specified named internal stage.
@[namespace.]ext_stage_name[/path][/filename]
Files are in the specified named external stage.
@~[/path][/filename]
Files are in the stage for the current user.
Note
This SQL function supports named stages (internal or external) and user stages only. It does not support table stages.
FILES => ( 'file_name' [ , 'file_name' ] [ , ... ] )
Specifies a list of one or more files (separated by commas) in a set of staged files that contain semi-structured data. The files must already have been staged in either the Snowflake internal location or external location specified in the command. If any of the specified files cannot be found, the query will be aborted.
The maximum number of files names that can be specified is 1000.
Note
For external stages only (Amazon S3, Google Cloud Storage, or Microsoft Azure), the file path is set by concatenating the URL in the stage definition and the list of resolved file names.
However, Snowflake doesn’t insert a separator implicitly between the path and file names. You must explicitly include a separator (
/
) either at the end of the URL in the stage definition or at the beginning of each file name specified in this parameter.FILE_FORMAT => 'file_format_name'
Name of the file format object that describes the data contained in the staged files. For more information, see CREATE FILE FORMAT.
IGNORE_CASE => TRUE | FALSE
Specifies whether column names detected from stage files are treated as case sensitive. By default, the value is FALSE, which means that Snowflake preserves the case of alphabetic characters when retrieving column names. If you specify the value as TRUE, column names are treated as case-insensitive and all column names are retrieved as uppercase letters.
MAX_FILE_COUNT => num
Specifies the maximum number of files scanned from stage. This option is recommended for large number of files that have identical schema across files. Note that this option cannot determine which files are scanned. If you want to scan specific files, use the
FILES
option instead.MAX_RECORDS_PER_FILE => num
Specifies the maximum number of records scanned per file. This option only applies to CSV and JSON files. It is recommended to use this option for large files. But note that this option may affect the accuracy of schema detection.
Output¶
The function returns the following columns:
Column Name |
Data Type |
Description |
---|---|---|
COLUMN_NAME |
TEXT |
Name of a column in the staged files. |
TYPE |
TEXT |
Data type of the column. |
NULLABLE |
BOOLEAN |
Specifies whether rows in the column can store NULL instead of a value. Currently, the inferred nullability of a column can apply to one data file but not others in the scanned set. |
EXPRESSION |
TEXT |
Expression of the column in the format |
FILENAMES |
TEXT |
Names of the files that contain the column. |
ORDER_ID |
NUMBER |
Column order in the staged files. |
Usage notes¶
For CSV files, you can define column names by using the file format option
PARSE_HEADER = [ TRUE | FALSE ]
.If the option is set to TRUE, the first row headers will be used to determine column names.
The default value FALSE will return column names as c*, where * is the position of the column. Note that the SKIP_HEADER option is not supported with PARSE_HEADER = TRUE.
The PARSE_HEADER option isn’t supported for external tables.
For both CSV and JSON files, the following file format options are currently not supported: DATE_FORMAT, TIME_FORMAT, and TIMESTAMP_FORMAT.
The JSON TRIM_SPACE file format option is not supported.
The scientific annotations (e.g. 1E2) in JSON files are retrieved as REAL data type.
All the variations of timestamp data types are retrieved as TIMESTAMP_NTZ without any time zone information.
For both CSV and JSON files, all columns are identified as NULLABLE.
Creating an Iceberg table with the USING TEMPLATE clause (and column definitions derived from INFER_SCHEMA output) isn’t supported.
Examples¶
Retrieve the column definitions for Parquet files in the mystage
stage:
-- Create a file format that sets the file type as Parquet.
CREATE FILE FORMAT my_parquet_format
TYPE = parquet;
-- Query the INFER_SCHEMA function.
SELECT *
FROM TABLE(
INFER_SCHEMA(
LOCATION=>'@mystage'
, FILE_FORMAT=>'my_parquet_format'
)
);
+-------------+---------+----------+---------------------+--------------------------+----------+
| COLUMN_NAME | TYPE | NULLABLE | EXPRESSION | FILENAMES | ORDER_ID |
|-------------+---------+----------+---------------------+--------------------------|----------+
| continent | TEXT | True | $1:continent::TEXT | geography/cities.parquet | 0 |
| country | VARIANT | True | $1:country::VARIANT | geography/cities.parquet | 1 |
| COUNTRY | VARIANT | True | $1:COUNTRY::VARIANT | geography/cities.parquet | 2 |
+-------------+---------+----------+---------------------+--------------------------+----------+
Similar to the previous example, but specify a single Parquet file in the mystage
stage:
-- Query the INFER_SCHEMA function.
SELECT *
FROM TABLE(
INFER_SCHEMA(
LOCATION=>'@mystage/geography/cities.parquet'
, FILE_FORMAT=>'my_parquet_format'
)
);
+-------------+---------+----------+---------------------+--------------------------+----------+
| COLUMN_NAME | TYPE | NULLABLE | EXPRESSION | FILENAMES | ORDER_ID |
|-------------+---------+----------+---------------------+--------------------------|----------+
| continent | TEXT | True | $1:continent::TEXT | geography/cities.parquet | 0 |
| country | VARIANT | True | $1:country::VARIANT | geography/cities.parquet | 1 |
| COUNTRY | VARIANT | True | $1:COUNTRY::VARIANT | geography/cities.parquet | 2 |
+-------------+---------+----------+---------------------+--------------------------+----------+
Retrieve the column definitions for Parquet files in the mystage
stage with IGNORE_CASE specified as TRUE. In the returned output, all column names are retrieved as uppercase letters.
-- Query the INFER_SCHEMA function.
SELECT *
FROM TABLE(
INFER_SCHEMA(
LOCATION=>'@mystage'
, FILE_FORMAT=>'my_parquet_format'
, IGNORE_CASE=>TRUE
)
);
+-------------+---------+----------+----------------------------------------+--------------------------+----------+
| COLUMN_NAME | TYPE | NULLABLE | EXPRESSION | FILENAMES | ORDER_ID |
|-------------+---------+----------+---------------------+---------------------------------------------|----------+
| CONTINENT | TEXT | True | GET_IGNORE_CASE ($1, CONTINENT)::TEXT | geography/cities.parquet | 0 |
| COUNTRY | VARIANT | True | GET_IGNORE_CASE ($1, COUNTRY)::VARIANT | geography/cities.parquet | 1 |
+-------------+---------+----------+---------------------+---------------------------------------------+----------+
Retrieve the column definitions for JSON files in the mystage
stage:
-- Create a file format that sets the file type as JSON.
CREATE FILE FORMAT my_json_format
TYPE = json;
-- Query the INFER_SCHEMA function.
SELECT *
FROM TABLE(
INFER_SCHEMA(
LOCATION=>'@mystage/json/'
, FILE_FORMAT=>'my_json_format'
)
);
+-------------+---------------+----------+---------------------------+--------------------------+----------+
| COLUMN_NAME | TYPE | NULLABLE | EXPRESSION | FILENAMES | ORDER_ID |
|-------------+---------------+----------+---------------------------+--------------------------|----------+
| col_bool | BOOLEAN | True | $1:col_bool::BOOLEAN | json/schema_A_1.json | 0 |
| col_date | DATE | True | $1:col_date::DATE | json/schema_A_1.json | 1 |
| col_ts | TIMESTAMP_NTZ | True | $1:col_ts::TIMESTAMP_NTZ | json/schema_A_1.json | 2 |
+-------------+---------------+----------+---------------------------+--------------------------+----------+
Creates a table using the detected schema from staged JSON files.
CREATE TABLE mytable USING TEMPLATE ( SELECT ARRAY_AGG(OBJECT_CONSTRUCT(*)) FROM TABLE( INFER_SCHEMA( LOCATION=>'@mystage/json/', FILE_FORMAT=>'my_json_format' ) ));
Note
Using *
for ARRAY_AGG(OBJECT_CONSTRUCT())
may result in an error if the returned result is larger than 16MB. It is recommended to avoid using *
for larger result sets, and only use the required columns, COLUMN NAME
, TYPE
, and NULLABLE
, for the query. Optional column ORDER_ID
can be included when using WITHIN GROUP (ORDER BY order_id)
.
Retrieve the column definitions for CSV files in the mystage
stage and load the CSV files using MATCH_BY_COLUMN_NAME:
-- Create a file format that sets the file type as CSV.
CREATE FILE FORMAT my_csv_format
TYPE = csv
PARSE_HEADER = true;
-- Query the INFER_SCHEMA function.
SELECT *
FROM TABLE(
INFER_SCHEMA(
LOCATION=>'@mystage/csv/'
, FILE_FORMAT=>'my_csv_format'
)
);
+-------------+---------------+----------+---------------------------+--------------------------+----------+
| COLUMN_NAME | TYPE | NULLABLE | EXPRESSION | FILENAMES | ORDER_ID |
|-------------+---------------+----------+---------------------------+--------------------------|----------+
| col_bool | BOOLEAN | True | $1:col_bool::BOOLEAN | json/schema_A_1.csv | 0 |
| col_date | DATE | True | $1:col_date::DATE | json/schema_A_1.csv | 1 |
| col_ts | TIMESTAMP_NTZ | True | $1:col_ts::TIMESTAMP_NTZ | json/schema_A_1.csv | 2 |
+-------------+---------------+----------+---------------------------+--------------------------+----------+
-- Load the CSV file using MATCH_BY_COLUMN_NAME.
COPY INTO mytable FROM @mystage/csv/
FILE_FORMAT = (
FORMAT_NAME= 'my_csv_format'
)
MATCH_BY_COLUMN_NAME=CASE_INSENSITIVE;