Staging data¶
This topic provides best practices, general guidelines, and important considerations for preparing your data files for loading.
Organizing data by path¶
Both internal (i.e. Snowflake) and external (Amazon S3, Google Cloud Storage, or Microsoft Azure) stage references can include a path (or prefix in AWS terminology). When staging regular data sets, we recommend partitioning the data into logical paths that include identifying details such as geographical location or other source identifiers, along with the date when the data was written.
Organizing your data files by path lets you copy any fraction of the partitioned data into Snowflake with a single command. This allows you to execute concurrent COPY statements that match a subset of files, taking advantage of parallel operations.
For example, if you were storing data for a North American company by geographical location, you might include identifiers such as continent, country, and city in paths along with data write dates:
Canada/Ontario/Toronto/2016/07/10/05/
United_States/California/Los_Angeles/2016/06/01/11/
United_States/New York/New_York/2016/12/21/03/
United_States/California/San_Francisco/2016/08/03/17/
When you create a named stage, you can specify any part of a path. For example, create an external stage using one of the above example paths:
CREATE STAGE my_stage URL='s3://mybucket/United_States/California/Los_Angeles/' CREDENTIALS=(AWS_KEY_ID='1a2b3c' AWS_SECRET_KEY='4x5y6z');
You can also add a path when you stage files in an internal user or table stage. For example, stage mydata.csv
in a
specific path in the t1
table stage:
PUT file:///data/mydata.csv @%t1/United_States/California/Los_Angeles/2016/06/01/11/
When loading your staged data, narrow the path to the most granular level that includes your data for improved data load performance.
Use any of the following options to further confine the list of files to load:
If the file names match except for a suffix or extension, include the matching part of the file names in the path, e.g.:
COPY INTO t1 from @%t1/United_States/California/Los_Angeles/2016/06/01/11/mydata;Add the FILES or PATTERN options (see Options for selecting staged data files), e.g.:
COPY INTO t1 from @%t1/United_States/California/Los_Angeles/2016/06/01/11/ FILES=('mydata1.csv', 'mydata1.csv'); COPY INTO t1 from @%t1/United_States/California/Los_Angeles/2016/06/01/11/ PATTERN='.*mydata[^[0-9]{1,3}$$].csv';