Snowpark Connect for Spark 를 사용하여 클라우드 서비스 파일 데이터에 액세스¶

Snowpark Connect for Spark 를 사용하면 Amazon S3, Google Cloud Storage, Azure Blob과 같은 외부 클라우드 저장소 시스템과 직접 상호 작용할 수 있습니다. 클라우드 저장소에서 Snowflake로 데이터를 읽어오고, 데이터를 처리한 다음, 다시 쓸 수 있습니다.

예를 들어 Snowpark Connect for Spark 를 사용하여 다음 작업을 수행할 수 있습니다.

원시 데이터를 수집합니다.

파일(예: CSV, JSON 및 Parquet)을 Snowflake로 이동하기 전에 S3, Google Cloud 또는 Azure에 보관합니다.
다운스트림에서 사용할 데이터를 내보냅니다.

처리된 Snowpark DataFrames를 ML 학습을 위한 클라우드 저장소에 쓰고, 외부 파트너와 공유하거나 추가 Spark 기반 분석을 수행합니다.
하이브리드 파이프라인을 만듭니다.

기존 데이터 레이크와의 호환성을 유지하면서 파이프라인의 일부를 Snowflake에서 유지합니다.
규정을 준수하거나 비용을 절감합니다.

규정, 거버넌스 또는 예산 제약으로 인해 특정 데이터 세트를 외부에 저장합니다.

이 항목에 나열된 단계를 사용하여 이러한 클라우드 서비스 공급자에 저장된 파일에서 읽고 쓸 수 있습니다. Snowflake 외부 스테이지 또는 직접 액세스를 사용하여 파일에 액세스할 수 있습니다.

주의 사항¶

Snowpark Connect for Spark 를 사용하여 클라우드 서비스로 작업하려면 다음 주의 사항에 유의하세요.

인증—Snowpark Connect for Spark 는 클라우드 자격 증명을 자동으로 관리하지 않습니다. 액세스 키(AWS), 저장소 계정 키 또는 SAS 토큰(Azure)을 구성하거나 외부 스테이지를 직접 유지 관리해야 합니다. 자격 증명이 만료되거나 누락되면 읽기 및 쓰기가 실패합니다.
성능—클라우드 I/O는 네트워크 대역폭과 오브젝트 저장소 대기 시간에 따라 달라집니다. 많은 작은 파일을 읽으면 성능에 상당한 영향을 미칠 수 있습니다.
형식 지원 — 읽고 쓰는 파일 형식이 지원되는지 확인합니다. 현재 Snowpark Connect for Spark 에는 TEXT, CSV, JSON 및 Parquet를 포함하는 일반적인 형식과 동등한 수준의 호환성이 있습니다. 그러나 고급 기능(예: Parquet 파티션 검색 및 JSON 스키마 진화)은 Spark와 다를 수 있습니다.
권한 및 정책—클라우드 버킷에 쓰기 위해서는 적절한 IAM 및 ACL 정책이 필요합니다. Snowflake 역할과 클라우드 자격 증명 간에 정책이 일치하지 않으면 AccessDenied 오류가 발생할 수 있습니다.

모범 사례¶

우수한 성능의 가장 안정적인 통합을 얻으려면 다음 모범 사례를 따르세요.

안전한 임시 자격 증명을 사용하고 자격 증명을 자주 순환합니다.
데이터를 분할하고 및 버킷팅합니다.

Parquet을 작성할 때 자주 필터링되는 열을 기준으로 분할하여 스캔 비용을 줄입니다. 많은 수의 작은 파일 대신, 더 적은 수의 큰 파일을 사용합니다(예: 각각 100MB~500MB).
쓰기 시 스키마의 유효성을 검사합니다.

특히 JSON 및 CSV와 같은 반정형 형식의 경우 항상 스키마를 명시적으로 정의하세요. 이를 통해 Snowflake와 외부 데이터 간의 드리프트를 방지할 수 있습니다.
비용을 모니터링합니다.

비용을 절감하려면 쓰기 전에 파일을 통합하고 데이터를 필터링하는 것이 좋습니다. 클라우드 공급자 비용은 요청당 및 스캔된 바이트당 발생합니다.
API 호출을 표준화합니다.

기능과 매개 변수를 사용할 때는 문서화된 지침을 정확하게 따르고 임시 변경을 방지합니다. 이러한 방식으로 호환성을 유지하고 회귀를 방지하며 다양한 클라우드 공급자에서 예상되는 동작을 보장할 수 있습니다.

Snowflake 외부 스테이지를 사용하여 액세스¶

:doc:`Amazon S3에 대한 보안 액세스를 구성</user-guide/data-load-s3-config>`하여 S3 위치를 가리키는 외부 스테이지를 생성합니다.

외부 스테이지에서 읽습니다.

# Read CSV
spark.read.csv('@<your external stage name>/<file path>')
spark.read.option("header", True).csv('@<your external stage name>/<file path>') # read with header in file

# Write to CSV
df.write.csv('@<your external stage name>/<file path>')
df.write.option("header", True).csv('@<your external stage name>/<file path>') # write with header in file

# Read Text
spark.read.text('@<your external stage name>/<file path>')

# Write to Text
df.write.text('@<your external stage name>/<file path>')
df.write.format("text").mode("overwrite").save('@<your external stage name>/<file path>')

# Read Parquet
spark.read.parquet('@<your external stage name>/<file path>')

# Write to Parquet
df.write.parquet('@<your external stage name>/<file path>')

# Read JSON
spark.read.json('@<your external stage name>/<file path>')

# Write to JSON
df.write.json('@<your external stage name>/<file path>')

Copy

:doc:`Azure에 대한 보안 액세스를 구성</user-guide/data-load-azure-create-stage>`하여 Azure 컨테이너를 가리키는 외부 스테이지를 생성합니다.

외부 스테이지에서 읽습니다.

# Read CSV
spark.read.csv('@<your external stage name>/<file path>')
spark.read.option("header", True).csv('@<your external stage name>/<file path>')
# read with header in file

# Write to CSV
df.write.csv('@<your external stage name>/<file path>')
df.write.option("header", True).csv('@<your external stage name>/<file path>') # write with header in file

# Read Text
spark.read.text('@<your external stage name>/<file path>')

# Write to Text
df.write.text('@<your external stage name>/<file path>')
df.write.format("text").mode("overwrite").save('@<your external stage name>/<file path>')

# Read Parquet
spark.read.parquet('@<your external stage name>/<file path>')

# Write to Parquet
df.write.parquet('@<your external stage name>/<file path>')

# Read JSON
spark.read.json('@<your external stage name>/<file path>')

# Write to JSON
df.write.json('@<your external stage name>/<file path>')

Copy

:doc:`Google Cloud에 대한 보안 액세스를 구성</user-guide/data-load-gcs-config>`하여 Google Cloud Storage 버킷을 가리키는 외부 스테이지를 생성합니다.

외부 스테이지에서 읽습니다.

# Read CSV
spark.read.csv('@<your external stage name>/<file path>')
spark.read.option("header", True).csv('@<your external stage name>/<file path>') # read with header in file

# Write to CSV
df.write.csv('@<your external stage name>/<file path>')
df.write.option("header", True).csv('@<your external stage name>/<file path>') # write with header in file

# Read Text
spark.read.text('@<your external stage name>/<file path>')

# Write to Text
df.write.text('@<your external stage name>/<file path>')
df.write.format("text").mode("overwrite").save('@<your external stage name>/<file path>')

# Read Parquet
spark.read.parquet('@<your external stage name>/<file path>')

# Write to Parquet
df.write.parquet('@<your external stage name>/<file path>')

# Read JSON
spark.read.json('@<your external stage name>/<file path>')

# Write to JSON
df.write.json('@<your external stage name>/<file path>')

Copy

직접 액세스를 사용한 액세스¶

여기에 설명된 단계와 코드를 사용하여 클라우드 서비스 공급자의 파일에 직접 액세스할 수 있습니다.

AWS 자격 증명을 사용하여 Spark 구성을 설정합니다.

# For S3 related access with public/private buckets, please add these config change
spark.conf.set("spark.hadoop.fs.s3a.connection.ssl.enabled","false")
spark.conf.set("spark.hadoop.fs.s3a.impl","org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.conf.set("spark.jars.packages","org.apache.hadoop:hadoop-aws:3.3.2")

# For private S3 access, please also provide credentials
spark.conf.set("spark.hadoop.fs.s3a.access.key","<AWS_ACCESS_KEY_ID>")
spark.conf.set("spark.hadoop.fs.s3a.secret.key","<AWS_SECRET_ACCESS_KEY>")
spark.conf.set("spark.hadoop.fs.s3a.session.token","<AWS_SESSION_TOKEN>")

Copy

S3로 직접 읽고 씁니다.

# Read CSV
spark.read.csv('s3a://<bucket name>/<file path>')
spark.read.option("header", True).csv('s3a://<bucket name>/<file path>') # read with header in file

# Write to CSV
df.write.csv('s3a://<bucket name>/<file path>')
df.write.option("header", True).csv('s3a://<bucket name>/<file path>') # write with header in file

# Read Text
spark.read.text('s3a://<bucket name>/<file path>')

# Write to Text
df.write.text('s3a://<bucket name>/<file path>')
df.write.format("text").mode("overwrite").save('s3a://<bucket name>/<file path>')

# Read Parquet
spark.read.parquet('s3a://<bucket name>/<file path>')

# Write to Parquet
df.write.parquet('s3a://<bucket name>/<file path>')

# Read JSON
spark.read.json('s3a://<bucket name>/<file path>')

# Write to JSON
df.write.json('s3a://<bucket name>/<file path>')

Copy

Azure 자격 증명으로 Spark 구성을 설정합니다.

# For private Azure access, please also provide blob SAS token
#   * Make sure all required permissions are in place before proceeding
spark.conf.set("fs.azure.sas.fixed.token.<storage-account>.dfs.core.windows.net","<Shared Access Token>")

Copy

Azure로 직접 읽고 씁니다.

# Read CSV
spark.read.csv('wasbs://<container name>@<storage account name>.blob.core.windows.net/<bucket name>/<file path>')
spark.read.option("header", True).csv('wasbs://<container name>@<storage account name>.blob.core.windows.net/<bucket name>/<file path>') # read with header in file

# Write to CSV
df.write.csv('wasbs://<container name>@<storage account name>.blob.core.windows.net/<bucket name>/<file path>')
df.write.option("header", True).csv('wasbs://<container name>@<storage account name>.blob.core.windows.net/<bucket name>/<file path>') # write with header in file

# Read Text
spark.read.text('wasbs://<container name>@<storage account name>.blob.core.windows.net/<bucket name>/<file path>')

# Write to Text
df.write.text('wasbs://<container name>@<storage account name>.blob.core.windows.net/<bucket name>/<file path>')
df.write.format("text").mode("overwrite").save('wasbs://<container name>@<storage account name>.blob.core.windows.net/<bucket name>/<file path>')

# Read Parquet
spark.read.parquet('wasbs://<container name>@<storage account name>.blob.core.windows.net/<bucket name>/<file path>')

# Write to Parquet
df.write.parquet('wasbs://<container name>@<storage account name>.blob.core.windows.net/<bucket name>/<file path>')

# Read JSON
spark.read.json('wasbs://<container name>@<storage account name>.blob.core.windows.net/<bucket name>/<file path>')

# Write to JSON
df.write.json('wasbs://<container name>@<storage account name>.blob.core.windows.net/<bucket name>/<file path>')

Copy