Snowpark Python에서 DataFrame으로 작업하기¶

Snowpark에서 데이터를 쿼리하고 처리하는 주요 방법은 DataFrame을 통하는 것입니다. 이 항목에서는 DataFrame으로 작업하는 방법을 설명합니다.

이 항목의 내용:

데이터를 검색하고 조작하려면 DataFrame 클래스를 사용합니다. DataFrame은 느리게 평가되는 관계형 데이터 세트를 나타내며, 이는 특정 동작이 트리거될 때만 실행됩니다. 어떤 의미에서 DataFrame은 데이터를 검색하기 위해 평가되어야 하는 쿼리와 같습니다.

DataFrame으로 데이터를 가져오려면 다음을 수행하십시오.

DataFrame을 구성하여 데이터 세트의 데이터 소스를 지정합니다.

예를 들어, DataFrame을 만들어 테이블, 외부 CSV 파일, 로컬 데이터 또는 SQL 문 실행의 데이터를 보유할 수 있습니다.
DataFrame의 데이터 세트를 변환하는 방법을 지정합니다.

예를 들어, 어느 열을 선택해야 하는지, 행 필터링 방법, 결과 정렬 및 그룹화 방법 등을 지정할 수 있습니다.
DataFrame으로 데이터를 가져오는 문을 실행합니다.

DataFrame으로 데이터를 가져오려면 동작을 수행하는 메서드(예: collect() 메서드)를 호출해야 합니다.

다음 섹션에서는 이러한 단계를 더 자세히 설명합니다.

이 섹션의 예 설정하기¶

이 섹션의 일부 예에서는 DataFrame을 사용하여 sample_product_data 라는 테이블을 쿼리합니다. 이러한 예를 실행하려면 다음 SQL 문을 실행하여 이 테이블을 만들고 일부 데이터로 테이블을 채울 수 있습니다.

Snowpark Python을 사용하여 SQL 문을 실행할 수 있습니다.

session.sql('CREATE OR REPLACE TABLE sample_product_data (id INT, parent_id INT, category_id INT, name VARCHAR, serial_number VARCHAR, key INT, "3rd" INT)').collect()

Copy

[Row(status='Table SAMPLE_PRODUCT_DATA successfully created.')]

session.sql("""
INSERT INTO sample_product_data VALUES
(1, 0, 5, 'Product 1', 'prod-1', 1, 10),
(2, 1, 5, 'Product 1A', 'prod-1-A', 1, 20),
(3, 1, 5, 'Product 1B', 'prod-1-B', 1, 30),
(4, 0, 10, 'Product 2', 'prod-2', 2, 40),
(5, 4, 10, 'Product 2A', 'prod-2-A', 2, 50),
(6, 4, 10, 'Product 2B', 'prod-2-B', 2, 60),
(7, 0, 20, 'Product 3', 'prod-3', 3, 70),
(8, 7, 20, 'Product 3A', 'prod-3-A', 3, 80),
(9, 7, 20, 'Product 3B', 'prod-3-B', 3, 90),
(10, 0, 50, 'Product 4', 'prod-4', 4, 100),
(11, 10, 50, 'Product 4A', 'prod-4-A', 4, 100),
(12, 10, 50, 'Product 4B', 'prod-4-B', 4, 100)
""").collect()

Copy

[Row(number of rows inserted=12)]

테이블이 만들어졌는지 확인하려면 다음을 실행합니다.

session.sql("SELECT count(*) FROM sample_product_data").collect()

Copy

[Row(COUNT(*)=12)]

Python 워크시트에서 예제 설정하기¶

Python 워크시트 에서 이러한 예를 설정하고 실행하려면 샘플 테이블을 만들고 Python 워크시트를 설정하십시오.

SQL 워크시트를 만들고 다음을 실행합니다.

CREATE OR REPLACE TABLE sample_product_data
  (id INT, parent_id INT, category_id INT, name VARCHAR, serial_number VARCHAR, key INT, "3rd" INT);

INSERT INTO sample_product_data VALUES
  (1, 0, 5, 'Product 1', 'prod-1', 1, 10),
  (2, 1, 5, 'Product 1A', 'prod-1-A', 1, 20),
  (3, 1, 5, 'Product 1B', 'prod-1-B', 1, 30),
  (4, 0, 10, 'Product 2', 'prod-2', 2, 40),
  (5, 4, 10, 'Product 2A', 'prod-2-A', 2, 50),
  (6, 4, 10, 'Product 2B', 'prod-2-B', 2, 60),
  (7, 0, 20, 'Product 3', 'prod-3', 3, 70),
  (8, 7, 20, 'Product 3A', 'prod-3-A', 3, 80),
  (9, 7, 20, 'Product 3B', 'prod-3-B', 3, 90),
  (10, 0, 50, 'Product 4', 'prod-4', 4, 100),
  (11, 10, 50, 'Product 4A', 'prod-4-A', 4, 100),
  (12, 10, 50, 'Product 4B', 'prod-4-B', 4, 100);

SELECT count(*) FROM sample_product_data;

Copy

Python 워크시트 를 생성하여 sample_product_data 테이블을 생성하는 데 사용한 SQL 워크시트와 동일한 데이터베이스 및 스키마 컨텍스트를 설정합니다.

Python 워크시트에서 이 항목의 예제를 사용하려면 처리기 함수(예: main) 내에서 예제를 사용하고 함수로 전달되는 Session 오브젝트를 사용하여 DataFrames를 생성합니다.

예를 들어 session 오브젝트의 table 메서드를 호출하여 테이블에 대한 DataFrame을 생성합니다.

import snowflake.snowpark as snowpark
from snowflake.snowpark.functions import col

def main(session: snowpark.Session):
  df_table = session.table("sample_product_data")

Copy

DataFrame 오브젝트의 show 메서드를 호출하는 것과 같이 함수에서 생성된 출력을 검토하려면 Output 탭을 사용하십시오.

함수에서 반환된 값을 검사하려면 Settings » Return type 에서 반환 값의 데이터 타입을 선택하고 Results 탭을 사용하십시오.

함수가 DataFrame을 반환하는 경우 Table 의 기본 반환 유형을 사용합니다.
함수가 DataFrame 오브젝트의 collect 메서드에서 Row 의 list 를 반환하는 경우 반환 유형으로 Variant 를 사용하십시오.
함수가 문자열로 캐스팅할 수 있는 다른 값을 반환하거나 값을 반환하지 않는 경우 String 을 반환 유형으로 사용하십시오.

자세한 내용은 Python 워크시트 실행하기 섹션을 참조하십시오.

DataFrame 구성하기¶

DataFrame을 구성하려면 Session 클래스의 메서드와 속성을 사용할 수 있습니다. 다음 각 메서드는 서로 다른 형식의 데이터 원본에서 DataFrame을 구성합니다.

로컬 개발 환경에서 이러한 예제를 실행하거나 Python 워크시트 에 정의된 main 함수 내에서 호출할 수 있습니다.

테이블, 뷰 또는 스트림의 데이터에서 DataFrame을 만들려면 다음과 같이 table 메서드를 호출하십시오.

# Create a DataFrame from the data in the "sample_product_data" table.
df_table = session.table("sample_product_data")

# To print out the first 10 rows, call df_table.show()

Copy

지정된 값에서 DataFrame을 만들려면 create_dataframe 메서드를 호출하십시오.

# Create a DataFrame with one column named a from specified values.
df1 = session.create_dataframe([1, 2, 3, 4]).to_df("a")
df1.show()
# To return the DataFrame as a table in a Python worksheet use return instead of show()
# return df1

Copy

-------
|"A"  |
-------
|1    |
|2    |
|3    |
|4    |
-------

4개의 열 ‘a’, ‘b’, ‘c’, ‘d’가 있는 DataFrame을 만듭니다.

# Create a DataFrame with 4 columns, "a", "b", "c" and "d".
df2 = session.create_dataframe([[1, 2, 3, 4]], schema=["a", "b", "c", "d"])
df2.show()
# To return the DataFrame as a table in a Python worksheet use return instead of show()
# return df2

Copy

-------------------------
|"A"  |"B"  |"C"  |"D"  |
-------------------------
|1    |2    |3    |4    |
-------------------------

4개의 열 ‘a’, ‘b’, ‘c’, ‘d’가 있는 다른 DataFrame을 만듭니다.

# Create another DataFrame with 4 columns, "a", "b", "c" and "d".
from snowflake.snowpark import Row
df3 = session.create_dataframe([Row(a=1, b=2, c=3, d=4)])
df3.show()
# To return the DataFrame as a table in a Python worksheet use return instead of show()
# return df3

Copy

-------------------------
|"A"  |"B"  |"C"  |"D"  |
-------------------------
|1    |2    |3    |4    |
-------------------------

DataFrame을 만들고 스키마를 지정합니다.

# Create a DataFrame and specify a schema
from snowflake.snowpark.types import IntegerType, StringType, StructType, StructField
schema = StructType([StructField("a", IntegerType()), StructField("b", StringType())])
df4 = session.create_dataframe([[1, "snow"], [3, "flake"]], schema)
df4.show()
# To return the DataFrame as a table in a Python worksheet use return instead of show()
# return df4

Copy

---------------
|"A"  |"B"    |
---------------
|1    |snow   |
|3    |flake  |
---------------

값 범위를 포함하는 DataFrame을 만들려면 다음과 같이 range 메서드를 호출하십시오.

# Create a DataFrame from a range
# The DataFrame contains rows with values 1, 3, 5, 7, and 9 respectively.
df_range = session.range(1, 10, 2).to_df("a")
df_range.show()
# To return the DataFrame as a table in a Python worksheet use return instead of show()
# return df_range

Copy

-------
|"A"  |
-------
|1    |
|3    |
|5    |
|7    |
|9    |
-------

스테이지에 있는 파일의 데이터를 보유할 DataFrame을 만들려면 read 속성을 사용해 DataFrameReader 오브젝트를 가져오십시오. DataFrameReader 오브젝트에서 파일의 데이터 형식에 해당하는 메서드를 다음과 같이 호출하십시오.

from snowflake.snowpark.types import StructType, StructField, StringType, IntegerType

# Create DataFrames from data in a stage.
df_json = session.read.json("@my_stage2/data1.json")
df_catalog = session.read.schema(StructType([StructField("name", StringType()), StructField("age", IntegerType())])).csv("@stage/some_dir")

Copy

SQL 쿼리 결과를 보유할 DataFrame을 만들려면 다음과 같이 sql 메서드를 호출하십시오.

# Create a DataFrame from a SQL query
df_sql = session.sql("SELECT name from sample_product_data")
df_sql.show()
# To return the DataFrame as a table in a Python worksheet use return instead of show()
# return df_sql

Copy

--------------
|"NAME"      |
--------------
|Product 1   |
|Product 1A  |
|Product 1B  |
|Product 2   |
|Product 2A  |
|Product 2B  |
|Product 3   |
|Product 3A  |
|Product 3B  |
|Product 4   |
--------------

sql 메서드를 사용하여 테이블과 스테이징된 파일에서 데이터를 검색하는 SELECT 문을 실행할 수 있지만, table 메서드와 read 속성을 사용하면 개발 도구에서 더 나은 구문 강조 표시, 오류 강조 표시 및 지능형 코드 완성 기능이 제공됩니다.

데이터 세트 변환 방법 지정하기¶

선택할 열과 결과를 필터링, 정렬, 그룹화하는 등의 작업 방법을 지정하려면 데이터 세트를 변환하는 DataFrame 메서드를 호출하십시오. 이러한 메서드에서 열을 식별하려면 열로 평가되는 col 함수 또는 식을 사용하십시오. 열 및 식 지정하기 섹션을 참조하십시오.

예:

반환되어야 하는 행을 지정하려면 다음과 같이 filter 메서드를 호출하십시오.

# Import the col function from the functions module.
# Python worksheets import this function by default
from snowflake.snowpark.functions import col

# Create a DataFrame for the rows with the ID 1
# in the "sample_product_data" table.

# This example uses the == operator of the Column object to perform an
# equality check.
df = session.table("sample_product_data").filter(col("id") == 1)
df.show()
# To return the DataFrame as a table in a Python worksheet use return instead of show()
return df

Copy

------------------------------------------------------------------------------------
|"ID"  |"PARENT_ID"  |"CATEGORY_ID"  |"NAME"     |"SERIAL_NUMBER"  |"KEY"  |"3rd"  |
------------------------------------------------------------------------------------
|1     |0            |5              |Product 1  |prod-1           |1      |10     |
------------------------------------------------------------------------------------

선택해야 하는 열을 지정하려면 다음과 같이 select 메서드를 호출하십시오.

# Import the col function from the functions module.
from snowflake.snowpark.functions import col

# Create a DataFrame that contains the id, name, and serial_number
# columns in the "sample_product_data" table.
df = session.table("sample_product_data").select(col("id"), col("name"), col("serial_number"))
df.show()
# To return the DataFrame as a table in a Python worksheet use return instead of show()
return df

Copy

---------------------------------------
|"ID"  |"NAME"      |"SERIAL_NUMBER"  |
---------------------------------------
|1     |Product 1   |prod-1           |
|2     |Product 1A  |prod-1-A         |
|3     |Product 1B  |prod-1-B         |
|4     |Product 2   |prod-2           |
|5     |Product 2A  |prod-2-A         |
|6     |Product 2B  |prod-2-B         |
|7     |Product 3   |prod-3           |
|8     |Product 3A  |prod-3-A         |
|9     |Product 3B  |prod-3-B         |
|10    |Product 4   |prod-4           |
---------------------------------------

또한 다음과 같은 열을 참조할 수도 있습니다.

# Import the col function from the functions module.
from snowflake.snowpark.functions import col

df_product_info = session.table("sample_product_data")
df1 = df_product_info.select(df_product_info["id"], df_product_info["name"], df_product_info["serial_number"])
df2 = df_product_info.select(df_product_info.id, df_product_info.name, df_product_info.serial_number)
df3 = df_product_info.select("id", "name", "serial_number")

Copy

각 메서드는 변환된 새 DataFrame 오브젝트를 반환합니다. 이 메서드는 원래 DataFrame 오브젝트에 영향을 주지 않습니다. 여러 변환을 적용하려는 경우, 이전 메서드 호출에서 반환된 새 DataFrame 오브젝트에서 각 후속 변환 메서드를 호출하여 메서드 호출을 연결 할 수 있습니다.

이러한 변환 방법에서는 SQL 문을 생성하는 방법을 지정하고 Snowflake 데이터베이스에서 데이터를 검색하지 않습니다. DataFrame 평가 동작 수행하기 에 설명된 동작 메서드는 데이터 검색을 수행합니다.

DataFrame 조인하기¶

DataFrame 오브젝트를 조인하려면 다음과 같이 join 메서드를 호출하십시오.

# Create two DataFrames to join
df_lhs = session.create_dataframe([["a", 1], ["b", 2]], schema=["key", "value1"])
df_rhs = session.create_dataframe([["a", 3], ["b", 4]], schema=["key", "value2"])
# Create a DataFrame that joins the two DataFrames
# on the column named "key".
df_lhs.join(df_rhs, df_lhs.col("key") == df_rhs.col("key")).select(df_lhs["key"].as_("key"), "value1", "value2").show()
# To return the DataFrame as a table in a Python worksheet use return instead of show()
return df_lhs.join(df_rhs, df_lhs.col("key") == df_rhs.col("key")).select(df_lhs["key"].as_("key"), "value1", "value2")

Copy

-------------------------------
|"KEY"  |"VALUE1"  |"VALUE2"  |
-------------------------------
|a      |1         |3         |
|b      |2         |4         |
-------------------------------

두 DataFrames에 모두 조인할 동일한 열이 있는 경우 다음 예제 구문을 사용할 수 있습니다.

# Create two DataFrames to join
df_lhs = session.create_dataframe([["a", 1], ["b", 2]], schema=["key", "value1"])
df_rhs = session.create_dataframe([["a", 3], ["b", 4]], schema=["key", "value2"])
# If both dataframes have the same column "key", the following is more convenient.
df_lhs.join(df_rhs, ["key"]).show()
# To return the DataFrame as a table in a Python worksheet use return instead of show()
return df_lhs.join(df_rhs, ["key"])

Copy

-------------------------------
|"KEY"  |"VALUE1"  |"VALUE2"  |
-------------------------------
|a      |1         |3         |
|b      |2         |4         |
-------------------------------

& 연산자를 사용하여 조인 식을 연결할 수도 있습니다.

# Create two DataFrames to join
df_lhs = session.create_dataframe([["a", 1], ["b", 2]], schema=["key", "value1"])
df_rhs = session.create_dataframe([["a", 3], ["b", 4]], schema=["key", "value2"])
# Use & operator connect join expression. '|' and ~ are similar.
df_joined_multi_column = df_lhs.join(df_rhs, (df_lhs.col("key") == df_rhs.col("key")) & (df_lhs.col("value1") < df_rhs.col("value2"))).select(df_lhs["key"].as_("key"), "value1", "value2")
df_joined_multi_column.show()
# To return the DataFrame as a table in a Python worksheet use return instead of show()
return df_joined_multi_column

Copy

-------------------------------
|"KEY"  |"VALUE1"  |"VALUE2"  |
-------------------------------
|a      |1         |3         |
|b      |2         |4         |
-------------------------------

자체 조인을 수행하려면 DataFrame을 복사해야 합니다.

# copy the DataFrame if you want to do a self-join
from copy import copy

# Create two DataFrames to join
df_lhs = session.create_dataframe([["a", 1], ["b", 2]], schema=["key", "value1"])
df_rhs = session.create_dataframe([["a", 3], ["b", 4]], schema=["key", "value2"])
df_lhs_copied = copy(df_lhs)
df_self_joined = df_lhs.join(df_lhs_copied, (df_lhs.col("key") == df_lhs_copied.col("key")) & (df_lhs.col("value1") == df_lhs_copied.col("value1")))

Copy

DataFrames에 겹치는 열이 있을 때, Snowpark는 임의로 생성된 접두사를 조인 결과의 열에 추가합니다.

# Create two DataFrames to join
df_lhs = session.create_dataframe([["a", 1], ["b", 2]], schema=["key", "value1"])
df_rhs = session.create_dataframe([["a", 3], ["b", 4]], schema=["key", "value2"])
df_lhs.join(df_rhs, df_lhs.col("key") == df_rhs.col("key")).show()
# To return the DataFrame as a table in a Python worksheet use return instead of show()
return df_lhs.join(df_rhs, df_lhs.col("key") == df_rhs.col("key"))

Copy

-----------------------------------------------------
|"l_av5t_KEY"  |"VALUE1"  |"r_1p6k_KEY"  |"VALUE2"  |
-----------------------------------------------------
|a             |1         |a             |3         |
|b             |2         |b             |4         |
-----------------------------------------------------

Column.alias 를 사용하여 겹치는 열의 이름을 바꿀 수 있습니다.

# Create two DataFrames to join
df_lhs = session.create_dataframe([["a", 1], ["b", 2]], schema=["key", "value1"])
df_rhs = session.create_dataframe([["a", 3], ["b", 4]], schema=["key", "value2"])
df_lhs.join(df_rhs, df_lhs.col("key") == df_rhs.col("key")).select(df_lhs["key"].alias("key1"), df_rhs["key"].alias("key2"), "value1", "value2").show()
# To return the DataFrame as a table in a Python worksheet use return instead of show()
return df_lhs.join(df_rhs, df_lhs.col("key") == df_rhs.col("key")).select(df_lhs["key"].alias("key1"), df_rhs["key"].alias("key2"), "value1", "value2")

Copy

-----------------------------------------
|"KEY1"  |"KEY2"  |"VALUE1"  |"VALUE2"  |
-----------------------------------------
|a       |a       |1         |3         |
|b       |b       |2         |4         |
-----------------------------------------

임의의 접두사를 방지하려면 겹치는 열에 추가할 접미사를 지정할 수도 있습니다.

# Create two DataFrames to join
df_lhs = session.create_dataframe([["a", 1], ["b", 2]], schema=["key", "value1"])
df_rhs = session.create_dataframe([["a", 3], ["b", 4]], schema=["key", "value2"])
df_lhs.join(df_rhs, df_lhs.col("key") == df_rhs.col("key"), lsuffix="_left", rsuffix="_right").show()
# To return the DataFrame as a table in a Python worksheet use return instead of show()
return df_lhs.join(df_rhs, df_lhs.col("key") == df_rhs.col("key"), lsuffix="_left", rsuffix="_right")

Copy

--------------------------------------------------
|"KEY_LEFT"  |"VALUE1"  |"KEY_RIGHT"  |"VALUE2"  |
--------------------------------------------------
|a           |1         |a            |3         |
|b           |2         |b            |4         |
--------------------------------------------------

이 예제에서는 DataFrame.col 을 사용하여 조인에 사용할 열을 지정합니다. 열을 지정하는 더 많은 방법은 열 및 식 지정하기 섹션을 참조하십시오.

테이블을 다른 열에 있는 테이블 자체와 조인해야 하는 경우, 단일 DataFrame으로 자체 조인을 수행할 수 없습니다. 다음 예에서는 단일 DataFrame을 사용하여 자체 조인을 수행하는데, "id" 에 대한 열 식이 조인의 왼쪽과 오른쪽에 있기 때문에 실패합니다.

from snowflake.snowpark.exceptions import SnowparkJoinException

df = session.table("sample_product_data")
# This fails because columns named "id" and "parent_id"
# are in the left and right DataFrames in the join.
try:
  df_joined = df.join(df, col("id") == col("parent_id")) # fails
except SnowparkJoinException as e:
  print(e.message)

Copy

You cannot join a DataFrame with itself because the column references cannot be resolved correctly. Instead, create a copy of the DataFrame with copy.copy(), and join the DataFrame with this copy.

# This fails because columns named "id" and "parent_id"
# are in the left and right DataFrames in the join.
try:
  df_joined = df.join(df, df["id"] == df["parent_id"])   # fails
except SnowparkJoinException as e:
  print(e.message)

Copy

You cannot join a DataFrame with itself because the column references cannot be resolved correctly. Instead, create a copy of the DataFrame with copy.copy(), and join the DataFrame with this copy.

대신 Python의 기본 제공 copy() 메서드를 사용하여 DataFrame 오브젝트의 복제본을 만들고 두 DataFrame 오브젝트를 사용하여 다음 조인을 수행하십시오.

from copy import copy

# Create a DataFrame object for the "sample_product_data" table for the left-hand side of the join.
df_lhs = session.table("sample_product_data")
# Clone the DataFrame object to use as the right-hand side of the join.
df_rhs = copy(df_lhs)

# Create a DataFrame that joins the two DataFrames
# for the "sample_product_data" table on the
# "id" and "parent_id" columns.
df_joined = df_lhs.join(df_rhs, df_lhs.col("id") == df_rhs.col("parent_id"))
df_joined.count()

Copy

열 및 식 지정하기¶

이러한 변환 메서드를 호출할 때 열 또는 열을 사용하는 식을 지정해야 할 수 있습니다. 예를 들어, select 메서드를 호출할 때, 선택할 열을 지정해야 합니다.

열을 참조하려면 snowflake.snowpark.functions 모듈에서 col 함수를 호출하여 Column 오브젝트를 만드십시오.

# Import the col function from the functions module.
from snowflake.snowpark.functions import col

df_product_info = session.table("sample_product_data").select(col("id"), col("name"))
df_product_info.show()
# To return the DataFrame as a table in a Python worksheet use return instead of show()
return df_product_info

Copy

---------------------
|"ID"  |"NAME"      |
---------------------
|1     |Product 1   |
|2     |Product 1A  |
|3     |Product 1B  |
|4     |Product 2   |
|5     |Product 2A  |
|6     |Product 2B  |
|7     |Product 3   |
|8     |Product 3A  |
|9     |Product 3B  |
|10    |Product 4   |
---------------------

참고

리터럴에 대한 Column 오브젝트를 만들려면 리터럴을 열 오브젝트로 사용하기 섹션을 참조하십시오.

필터, 프로젝션, 조인 조건 등을 지정할 때 식에서 Column 오브젝트를 사용할 수 있습니다. 예:

filter 메서드와 함께 Column 오브젝트를 사용하여 필터 조건을 지정할 수 있습니다.

# Specify the equivalent of "WHERE id = 20"
# in a SQL SELECT statement.
df_filtered = df.filter(col("id") == 20)

Copy

df = session.create_dataframe([[1, 3], [2, 10]], schema=["a", "b"])
# Specify the equivalent of "WHERE a + b < 10"
# in a SQL SELECT statement.
df_filtered = df.filter((col("a") + col("b")) < 10)
df_filtered.show()
# To return the DataFrame as a table in a Python worksheet use return instead of show()
return df_filtered

Copy

-------------
|"A"  |"B"  |
-------------
|1    |3    |
-------------

select 메서드와 함께 Column 오브젝트를 사용하여 별칭을 정의할 수 있습니다.

df = session.create_dataframe([[1, 3], [2, 10]], schema=["a", "b"])
# Specify the equivalent of "SELECT b * 10 AS c"
# in a SQL SELECT statement.
df_selected = df.select((col("b") * 10).as_("c"))
df_selected.show()
# To return the DataFrame as a table in a Python worksheet use return instead of show()
return df_selected

Copy

-------
|"C"  |
-------
|30   |
|100  |
-------

join 메서드와 함께 Column 오브젝트를 사용하여 조인 조건을 정의할 수 있습니다.

dfX = session.create_dataframe([[1], [2]], schema=["a_in_X"])
dfY = session.create_dataframe([[1], [3]], schema=["b_in_Y"])
# Specify the equivalent of "X JOIN Y on X.a_in_X = Y.b_in_Y"
# in a SQL SELECT statement.
df_joined = dfX.join(dfY, col("a_in_X") == col("b_in_Y")).select(dfX["a_in_X"].alias("the_joined_column"))
df_joined.show()
# To return the DataFrame as a table in a Python worksheet use return instead of show()
return df_joined

Copy

-----------------------
|"THE_JOINED_COLUMN"  |
-----------------------
|1                    |
-----------------------

동일한 이름을 가진 두 개의 서로 다른 DataFrame 오브젝트의 열을 참조하는 경우(예: 해당 열의 DataFrame 조인), 한 DataFrame 오브젝트의 DataFrame.col 메서드를 사용하여 해당 오브젝트의 열을 참조할 수 있습니다(예: df1.col("name") 및 df2.col("name")).

다음 예에서는 DataFrame.col 메서드를 사용하여 특정 DataFrame의 열을 참조하는 방법을 보여줍니다. 이 예에서는 key 라는 열이 있는 두 개의 DataFrame 오브젝트를 조인합니다. 이 예에서는 새로 만든 DataFrame의 열 이름을 Column.as 메서드를 사용하여 변경합니다.

# Create two DataFrames to join
df_lhs = session.create_dataframe([["a", 1], ["b", 2]], schema=["key", "value"])
df_rhs = session.create_dataframe([["a", 3], ["b", 4]], schema=["key", "value"])
# Create a DataFrame that joins two other DataFrames (df_lhs and df_rhs).
# Use the DataFrame.col method to refer to the columns used in the join.
df_joined = df_lhs.join(df_rhs, df_lhs.col("key") == df_rhs.col("key")).select(df_lhs.col("key").as_("key"), df_lhs.col("value").as_("L"), df_rhs.col("value").as_("R"))
df_joined.show()
# To return the DataFrame as a table in a Python worksheet use return instead of show()
return df_joined

Copy

---------------------
|"KEY"  |"L"  |"R"  |
---------------------
|a      |1    |3    |
|b      |2    |4    |
---------------------

오브젝트 식별자(테이블 이름, 열 이름 등) 주위에 큰따옴표 사용하기¶

지정하는 데이터베이스, 스키마, 테이블, 스테이지의 이름은 Snowflake 식별자 요구 사항 을 준수해야 합니다.

대/소문자를 구분하는 열이 있는 테이블을 만듭니다.

session.sql("""
  create or replace temp table "10tablename"(
  id123 varchar, -- case insensitive because it's not quoted.
  "3rdID" varchar, -- case sensitive.
  "id with space" varchar -- case sensitive.
)""").collect()
# Add return to the statement to return the collect() results in a Python worksheet

Copy

[Row(status='Table 10tablename successfully created.')]

그런 다음 테이블에 값을 추가합니다.

session.sql("""insert into "10tablename" (id123, "3rdID", "id with space") values ('a', 'b', 'c')""").collect()
# Add return to the statement to return the collect() results in a Python worksheet

Copy

[Row(number of rows inserted=1)]

그런 다음 테이블의 DataFrame을 생성하고 테이블을 쿼리합니다.

df = session.table('"10tablename"')
df.show()
# To return the DataFrame as a table in a Python worksheet use return instead of show()
return df

Copy

---------------------------------------
|"ID123"  |"3rdID"  |"id with space"  |
---------------------------------------
|a        |b        |c                |
---------------------------------------

이름을 지정하면 Snowflake는 해당 이름을 대문자로 간주합니다. 예를 들어, 다음 호출은 동일합니다.

df.select(col("id123")).collect()
# Prepend a return statement to return the collect() results in a Python worksheet

Copy

[Row(ID123='a')]

이름이 식별자 요구 사항을 준수하지 않는 경우, 이름 주위에 큰따옴표(")를 사용해야 합니다. 문자열 리터럴 내에서 큰따옴표 문자를 이스케이프하려면 백슬래시(\)를 사용하십시오. 예를 들어, 다음 테이블 이름은 문자나 밑줄로 시작하지 않으므로 이름 주위에 큰따옴표를 사용해야 합니다.

df = session.table("\"10tablename\"")

Copy

또는 백슬래시 대신 작은따옴표를 사용하여 문자열 리터럴 내에서 큰따옴표 문자를 이스케이프할 수 있습니다.

df = session.table('"10tablename"')

Copy

열 이름을 지정할 때는 이름 주위에 큰따옴표를 사용할 필요가 없습니다. Snowpark 라이브러리는 이름이 식별자 요구 사항을 준수하지 않는 경우 열 이름을 자동으로 큰따옴표로 묶습니다.

df.select(col("3rdID")).collect()
# Prepend a return statement to return the collect() results in a Python worksheet

Copy

[Row(3rdID='b')]

다른 예로, 다음 호출은 동일합니다.

df.select(col("id with space")).collect()
# Prepend a return statement to return the collect() results in a Python worksheet

Copy

[Row(id with space='c')]

df.select(col("\"id with space\"")).collect()
# Prepend a return statement to return the collect() results in a Python worksheet

Copy

[Row(id with space='c')]

열 이름 주위에 이미 큰따옴표를 추가한 경우, 라이브러리는 이름 주위에 큰따옴표를 추가로 삽입하지 않습니다.

경우에 따라 다음과 같이 열 이름에 큰따옴표 문자가 포함될 수 있습니다.

session.sql('''
  create or replace temp table quoted(
  "name_with_""air""_quotes" varchar,
  """column_name_quoted""" varchar
)''').collect()
# Prepend a return statement to return the collect() results in a Python worksheet

Copy

[Row(status='Table QUOTED successfully created.')]

session.sql('''insert into quoted ("name_with_""air""_quotes", """column_name_quoted""") values ('a', 'b')''').collect()
# Prepend a return statement to return the collect() results in a Python worksheet

Copy

[Row(number of rows inserted=1)]

식별자 요구 사항 에 설명된 대로, 큰따옴표로 묶인 식별자 내의 각 큰따옴표 문자에는 두 개의 큰따옴표 문자를 사용해야 합니다(예: "name_with_""air""_quotes" 및 """column_name_quoted""").

df_table = session.table("quoted")
df_table.select("\"name_with_\"\"air\"\"_quotes\"").collect()
# Prepend a return statement to return the collect() results in a Python worksheet

Copy

[Row(name_with_"air"_quotes='a')]

df_table.select("\"\"\"column_name_quoted\"\"\"").collect()
# Prepend a return statement to return the collect() results in a Python worksheet

Copy

[Row("column_name_quoted"='b')]

식별자가 큰따옴표로 묶인 경우(명시적으로 따옴표를 추가했는지 또는 라이브러리가 따옴표를 추가했는지 여부와 관계없음), Snowflake는 해당 식별자를 대/소문자를 구분하는 것으로 취급합니다.

# The following calls are NOT equivalent!
# The Snowpark library adds double quotes around the column name,
# which makes Snowflake treat the column name as case-sensitive.
df.select(col("id with space")).collect()
# Prepend a return statement to return the collect() results in a Python worksheet

Copy

[Row(id with space='c')]

다음 예와 비교한 결과입니다.

from snowflake.snowpark.exceptions import SnowparkSQLException
try:
  df.select(col("ID WITH SPACE")).collect()
except SnowparkSQLException as e:
  print(e.message)

Copy

000904 (42000): SQL compilation error: error line 1 at position 7
invalid identifier '"ID WITH SPACE"'

리터럴을 열 오브젝트로 사용하기¶

Column 오브젝트를 인자로 받는 메서드에서 리터럴을 사용하려면 snowflake.snowpark.functions 모듈의 lit 함수에 리터럴을 전달하여 리터럴에 대한 Column 오브젝트를 만드십시오. 예:

# Import for the lit and col functions.
from snowflake.snowpark.functions import col, lit

# Show the first 10 rows in which num_items is greater than 5.
# Use `lit(5)` to create a Column object for the literal 5.
df_filtered = df.filter(col("num_items") > lit(5))

Copy

열 오브젝트를 특정 형식으로 캐스팅하기¶

Column 오브젝트를 특정 형식으로 캐스팅하려면 cast 메서드를 호출하고 snowflake.snowpark.types 모듈에서 형식 오브젝트를 전달합니다. 예를 들어, 리터럴을 정밀도가 5이고 스케일이 2인 NUMBER 로 캐스팅하려면 다음을 수행합니다.

# Import for the lit function.
from snowflake.snowpark.functions import lit

 # Import for the DecimalType class.
from snowflake.snowpark.types import DecimalType

decimal_value = lit(0.05).cast(DecimalType(5,2))

Copy

메서드 호출 연결하기¶

DataFrame 오브젝트를 변환하는 각 메서드 는 변환이 적용된 새 DataFrame 오브젝트를 반환하므로 메서드 호출을 연결 하여 추가 방식으로 변환되는 새 DataFrame을 생성할 수 있습니다.

다음 예는 다음과 같이 구성된 DataFrame을 반환합니다.

sample_product_data 테이블을 쿼리합니다.
id = 1 인 행을 반환합니다.

name 및 serial_number 열을 선택합니다.

df_product_info = session.table("sample_product_data").filter(col("id") == 1).select(col("name"), col("serial_number"))
df_product_info.show()
# To return the DataFrame as a table in a Python worksheet use return instead of show()
return df_product_info

Copy

-------------------------------
|"NAME"     |"SERIAL_NUMBER"  |
-------------------------------
|Product 1  |prod-1           |
-------------------------------

이 예제에서:

session.table("sample_product_data") 는 sample_product_data 테이블에 대한 DataFrame을 반환합니다.

DataFrame에는 아직 테이블의 데이터가 포함되어 있지 않지만, 오브젝트에는 테이블의 열 정의가 포함되어 있습니다.
filter(col("id") == 1) 은 id = 1 인 행을 반환하도록 설정된 sample_product_data 테이블에 대한 DataFrame을 반환합니다.

DataFrame에는 아직 테이블의 일치하는 행이 포함되어 있지 않습니다. 일치하는 행은 동작 메서드를 호출 할 때까지 검색되지 않습니다.
select(col("name"), col("serial_number")) 는 id = 1 인 sample_product_data 테이블의 행에 대한 name 및 serial_number 열을 포함하는 DataFrame을 반환합니다.

메서드 호출을 연결할 때는 호출 순서가 중요합니다. 각 메서드 호출은 변환된 DataFrame을 반환합니다. 변환된 DataFrame에서 후속 호출이 작동하는지 확인하십시오.

Snowpark Python을 사용할 때 SQL 문에서 동등한 키워드(SELECT 및 WHERE)를 사용하는 것과는 다른 순서로 select 및 filter 메서드 호출을 수행해야 할 수도 있음을 명심하십시오.

열 정의 검색하기¶

DataFrame에 대한 데이터 세트의 열 정의를 검색하려면 schema 속성을 호출하십시오. 이 메서드는 StructField 오브젝트의 list 를 포함하는 StructType 오브젝트를 반환합니다. 각 StructField 오브젝트에는 열 정의가 포함됩니다.

# Import the StructType
from snowflake.snowpark.types import *
# Get the StructType object that describes the columns in the
# underlying rowset.
table_schema = session.table("sample_product_data").schema
table_schema
StructType([StructField('ID', LongType(), nullable=True), StructField('PARENT_ID', LongType(), nullable=True), StructField('CATEGORY_ID', LongType(), nullable=True), StructField('NAME', StringType(), nullable=True), StructField('SERIAL_NUMBER', StringType(), nullable=True), StructField('KEY', LongType(), nullable=True), StructField('"3rd"', LongType(), nullable=True)])

Copy

반환된 StructType 오브젝트에서 열 이름은 항상 정규화됩니다. 인용되지 않은 식별자는 대문자로 반환되고, 인용된 식별자는 정의된 정확한 대/소문자로 반환됩니다.

다음 예에서는 ID 및 3rd 라는 열이 포함된 DataFrame을 만듭니다. 열 이름 3rd 의 경우, Snowpark 라이브러리는 이름이 식별자 요구 사항을 준수하지 않기 때문에 자동으로 이름을 큰따옴표("3rd")로 묶습니다.

이 예에서는 schema 속성을 호출한 다음, 반환된 StructType 오브젝트에서 names 속성을 호출하여 열 이름의 list 를 가져옵니다. 이름은 schema 속성에서 반환된 StructType 에서 정규화됩니다.

# Create a DataFrame containing the "id" and "3rd" columns.
df_selected_columns = session.table("sample_product_data").select(col("id"), col("3rd"))
# Print out the names of the columns in the schema.
# This prints List["ID", "\"3rd\""]
df_selected_columns.schema.names

Copy

['ID', '"3rd"']

DataFrame 평가 동작 수행하기¶

앞서 언급했듯이 DataFrame은 느리게 평가됩니다. 즉, 사용자가 동작을 수행할 때까지 SQL 문은 실행을 위해 서버로 전송되지 않습니다. 동작은 DataFrame이 평가되도록 하고 해당 SQL 문을 실행을 위해 서버로 보냅니다.

다음 메서드는 동작을 수행합니다.

클래스	메서드	설명
`DataFrame`	`collect`	DataFrame을 평가하고 결과 데이터 세트를 `Row` 오브젝트의 `list` 로 반환합니다.
`DataFrame`	`count`	DataFrame을 평가하고 행 수를 반환합니다.
`DataFrame`	`show`	DataFrame을 평가하고 콘솔에 행을 출력합니다. 이 메서드는 행 수를 10(기본값)으로 제한합니다.
`DataFrameWriter`	`save_as_table`	DataFrame의 데이터를 지정된 테이블에 저장합니다. 테이블에 데이터 저장하기 섹션을 참조하십시오.

예를 들어, 테이블에 대해 쿼리를 실행하고 결과를 반환하려면 다음과 같이 collect 메서드를 호출하십시오.

# Create a DataFrame with the "id" and "name" columns from the "sample_product_data" table.
# This does not execute the query.
df = session.table("sample_product_data").select(col("id"), col("name"))

# Send the query to the server for execution and
# return a list of Rows containing the results.
results = df.collect()
# Use a return statement to return the collect() results in a Python worksheet
# return results

Copy

쿼리를 실행하고 결과 수를 반환하려면 다음과 같이 count 메서드를 호출하십시오.

# Create a DataFrame for the "sample_product_data" table.
df_products = session.table("sample_product_data")

# Send the query to the server for execution and
# print the count of rows in the table.
print(df_products.count())
12

Copy

쿼리를 실행하고 결과를 콘솔에 출력하려면 다음과 같이 show 메서드를 호출하십시오.

# Create a DataFrame for the "sample_product_data" table.
df_products = session.table("sample_product_data")

# Send the query to the server for execution and
# print the results to the console.
# The query limits the number of rows to 10 by default.
df_products.show()
# To return the DataFrame as a table in a Python worksheet use return instead of show()
return df_products

Copy

-------------------------------------------------------------------------------------
|"ID"  |"PARENT_ID"  |"CATEGORY_ID"  |"NAME"      |"SERIAL_NUMBER"  |"KEY"  |"3rd"  |
-------------------------------------------------------------------------------------
|1     |0            |5              |Product 1   |prod-1           |1      |10     |
|2     |1            |5              |Product 1A  |prod-1-A         |1      |20     |
|3     |1            |5              |Product 1B  |prod-1-B         |1      |30     |
|4     |0            |10             |Product 2   |prod-2           |2      |40     |
|5     |4            |10             |Product 2A  |prod-2-A         |2      |50     |
|6     |4            |10             |Product 2B  |prod-2-B         |2      |60     |
|7     |0            |20             |Product 3   |prod-3           |3      |70     |
|8     |7            |20             |Product 3A  |prod-3-A         |3      |80     |
|9     |7            |20             |Product 3B  |prod-3-B         |3      |90     |
|10    |0            |50             |Product 4   |prod-4           |4      |100    |
-------------------------------------------------------------------------------------

행 수를 20개로 제한하는 방법은 다음과 같습니다.

# Create a DataFrame for the "sample_product_data" table.
df_products = session.table("sample_product_data")

# Limit the number of rows to 20, rather than 10.
df_products.show(20)
# All rows are returned when you use return in a Python worksheet to return the DataFrame as a table
return df_products

Copy

-------------------------------------------------------------------------------------
|"ID"  |"PARENT_ID"  |"CATEGORY_ID"  |"NAME"      |"SERIAL_NUMBER"  |"KEY"  |"3rd"  |
-------------------------------------------------------------------------------------
|1     |0            |5              |Product 1   |prod-1           |1      |10     |
|2     |1            |5              |Product 1A  |prod-1-A         |1      |20     |
|3     |1            |5              |Product 1B  |prod-1-B         |1      |30     |
|4     |0            |10             |Product 2   |prod-2           |2      |40     |
|5     |4            |10             |Product 2A  |prod-2-A         |2      |50     |
|6     |4            |10             |Product 2B  |prod-2-B         |2      |60     |
|7     |0            |20             |Product 3   |prod-3           |3      |70     |
|8     |7            |20             |Product 3A  |prod-3-A         |3      |80     |
|9     |7            |20             |Product 3B  |prod-3-B         |3      |90     |
|10    |0            |50             |Product 4   |prod-4           |4      |100    |
|11    |10           |50             |Product 4A  |prod-4-A         |4      |100    |
|12    |10           |50             |Product 4B  |prod-4-B         |4      |100    |
-------------------------------------------------------------------------------------

참고

DataFrame의 열 정의를 가져오기 위해 schema 속성을 호출하는 경우, 동작 메서드를 호출할 필요가 없습니다.

테이블에 데이터 저장하기¶

DataFrame의 내용을 테이블에 저장하려면 다음을 수행하십시오.

write 속성을 호출하여 DataFrameWriter 오브젝트를 가져옵니다.
DataFrameWriter 오브젝트에서 mode 메서드를 호출하고 모드를 지정합니다. 자세한 내용은 API 설명서 를 참조하십시오. 이 메서드는 지정된 모드로 구성된 새 DataFrameWriter 오브젝트를 반환합니다.
DataFrameWriter 오브젝트에서 save_as_table 메서드를 호출하여 DataFrame의 내용을 지정된 테이블에 저장합니다.

데이터를 테이블에 저장하는 SQL 문을 실행하기 위해 별도의 메서드(예: collect)를 호출할 필요가 없습니다.

예:

df.write.mode("overwrite").save_as_table("table1")

Copy

DataFrame에서 뷰 만들기¶

DataFrame에서 뷰를 만들려면 새 뷰를 즉시 생성하는 create_or_replace_view 메서드를 호출하십시오.

import os
database = os.environ["snowflake_database"]  # use your own database and schema
schema = os.environ["snowflake_schema"]
view_name = "my_view"
df.create_or_replace_view(f"{database}.{schema}.{view_name}")

Copy

[Row(status='View MY_VIEW successfully created.')]

Python 워크시트에서는 데이터베이스 및 스키마의 컨텍스트에서 워크시트를 실행하므로 다음을 실행하여 뷰를 생성할 수 있습니다.

# Define a DataFrame
df_products = session.table("sample_product_data")
# Define a View name
view_name = "my_view"
# Create the view
df_products.create_or_replace_view(f"{view_name}")
# return the view name
return view_name + " successfully created"
my_view successfully created

Copy

create_or_replace_view 를 호출하여 만든 뷰는 영구적입니다. 해당 뷰가 더 이상 필요하지 않으면 뷰를 수동으로 삭제 할 수 있습니다.

또는 임시 뷰를 만드는 create_or_replace_temp_view 메서드를 사용하십시오. 임시 뷰는 자신이 생성된 세션에서만 사용할 수 있습니다.

스테이지에서 파일 작업하기¶

이 섹션에서는 Snowflake 스테이지에서 파일의 데이터를 쿼리하는 방법을 설명합니다. 파일에 대한 다른 작업의 경우, SQL 문을 사용 하십시오.

Snowflake 스테이지에서 파일의 데이터를 쿼리하려면 다음과 같이 DataFrameReader 클래스를 사용하십시오.

Session 클래스의 read 메서드를 호출하여 DataFrameReader 오브젝트에 액세스합니다.
파일이 CSV 형식인 경우, 파일의 필드를 설명합니다. 이를 위해 다음을 수행하십시오.
1. 파일의 필드를 설명하는 StructType 오브젝트의 list 로 구성된 StructField 오브젝트를 만듭니다.
2. 각 StructField 오브젝트에 대해 다음을 지정합니다.
  - 필드의 이름입니다.
  - 필드의 데이터 타입(snowflake.snowpark.types 모듈에서 오브젝트로 지정됨).
  - 필드가 null을 허용하는지 여부입니다.
  예:
  from snowflake.snowpark.types import * schema_for_data_file = StructType([ StructField("id", StringType()), StructField("name", StringType()) ])
  
  Copy
3. DataFrameReader 오브젝트에서 schema 속성을 호출하여 StructType 오브젝트를 전달합니다.
  
  예:
  df_reader = session.read.schema(schema_for_data_file)
  Copy
  schema 속성은 지정된 필드가 포함된 파일을 읽도록 구성된 DataFrameReader 오브젝트를 반환합니다.
  
  다른 형식(JSON 등)의 파일에는 이 작업을 수행할 필요가 없습니다. 이러한 파일의 경우, DataFrameReader 는 데이터를 필드 이름이 $1 인 VARIANT 형식의 단일 필드로 취급합니다.
데이터를 어떻게 읽어야 하는지에 대한 추가 정보를 지정해야 하는 경우(예: 데이터가 압축되어 있거나 CSV 파일이 필드를 구분하기 위해 쉼표 대신 세미콜론을 사용하는 경우), DataFrameReader 오브젝트의 option 또는 options 메서드를 호출합니다.

option 메서드는 설정하려는 옵션의 이름과 값을 받아 연결된 여러 호출을 결합할 수 있게 해주는 반면, options 메서드는 옵션 이름과 그에 해당하는 값으로 구성된 사전을 받습니다.

파일 형식 옵션의 이름과 값은 CREATE FILE FORMAT 에 대한 설명서 를 참조하십시오.

COPY INTO TABLE 설명서 에 설명된 복사 옵션을 설정할 수도 있습니다. 복사 옵션을 설정하는 경우, DataFrame으로 데이터를 가져올 때 부담이 더 드는 실행 전략이 발생할 수 있습니다.

다음 예에서는 압축되어 있지 않고 필드 구분 기호에 세미콜론을 사용하는 CSV 파일의 데이터를 쿼리하도록 DataFrameReader 오브젝트를 설정합니다.
df_reader = df_reader.option("field_delimiter", ";").option("COMPRESSION", "NONE")
Copy
option 및 options 메서드는 지정된 옵션으로 구성된 DataFrameReader 오브젝트를 반환합니다.
파일 형식에 해당하는 메서드(예: csv 메서드)를 호출하여 파일 위치를 전달합니다.
df = df_reader.csv("@s3_ts_stage/emails/data_0_0_0.csv")
Copy
파일 형식에 해당하는 메서드는 해당 파일의 데이터를 보유하도록 구성된 DataFrame 오브젝트를 반환합니다.
DataFrame 오브젝트 메서드를 사용하여 데이터 세트에 필요한 모든 변환(예: 특정 필드 선택, 행 필터링 등)을 수행 합니다.

예를 들어, my_stage 라는 스테이지의 JSON 파일에서 color 요소를 추출하려면 다음을 수행합니다.
# Import the sql_expr function from the functions module. from snowflake.snowpark.functions import sql_expr df = session.read.json("@my_stage").select(sql_expr("$1:color"))
Copy
앞에서 설명했듯이 CSV 이외의 형식(예: JSON)으로 된 파일에 대해 DataFrameReader 는 파일의 데이터를 이름이 $1 인 단일 VARIANT 열로 처리합니다.

이 예에서는 snowflake.snowpark.functions 모듈의 sql_expr 함수를 사용하여 color 요소에 대한 경로를 지정합니다.

sql_expr 함수는 입력 인자를 해석하거나 수정하지 않습니다. 이 함수는 Snowpark API에서 아직 지원하지 않는 SQL의 식 및 코드 조각을 구성하는 것만 지원합니다.
동작 메서드를 호출 하여 파일의 데이터를 쿼리합니다.

테이블용 DataFrame의 경우와 마찬가지로, 사용자가 동작 메서드를 호출할 때까지 DataFrame으로 데이터를 가져오지 않습니다.

반정형 데이터로 작업하기¶

DataFrame을 사용하여 반정형 데이터 (예: JSON 데이터)를 쿼리하고 액세스할 수 있습니다. 다음 섹션에서는 DataFrame에서 반정형 데이터로 작업하는 방법을 설명합니다.

반정형 데이터 탐색하기
반정형 데이터에서 명시적으로 값 캐스팅하기
오브젝트 배열을 행으로 평면화하기

참고

이 섹션의 예에서는 예에서 사용된 샘플 데이터 의 샘플 데이터를 사용합니다.

반정형 데이터 탐색하기¶

반정형 데이터의 특정 필드 또는 요소를 참조하려면 Column 오브젝트의 다음 메서드를 사용하십시오.

col_object["<필드_이름>"] 특성을 가져와 OBJECT(또는 OBJECT를 포함하는 VARIANT)의 필드에 대한 Column 오브젝트를 반환합니다.
col_object[<인덱스>] 를 사용하여 ARRAY(또는 ARRAY를 포함하는 VARIANT)의 요소에 대한 Column 오브젝트를 반환합니다.

참고

경로의 필드 이름이나 요소가 불규칙하여 위에서 설명한 인덱싱을 사용하기 어려운 경우 get, get_ignore_case 또는 get_path 를 대안으로 사용할 수 있습니다.

예를 들어 다음 코드는 샘플 데이터 의 src 열에 있는 오브젝트의 dealership 필드를 선택합니다.

from snowflake.snowpark.functions import col

df = session.table("car_sales")
df.select(col("src")["dealership"]).show()

Copy

코드는 다음 출력을 출력합니다.

----------------------------
|"""SRC""['DEALERSHIP']"   |
----------------------------
|"Valley View Auto Sales"  |
|"Tindel Toyota"           |
----------------------------

참고

DataFrame의 값은 문자열 리터럴로 반환되기 때문에 큰따옴표로 묶입니다. 이러한 값을 특정 타입으로 캐스팅하려면 반정형 데이터에서 명시적으로 값 캐스팅하기 섹션을 참조하십시오.

또한 메서드 호출을 연결 하여 특정 필드나 요소에 대한 경로를 탐색할 수 있습니다.

예를 들어 다음 코드는 salesperson 오브젝트의 name 필드를 선택합니다.

df = session.table("car_sales")
df.select(df["src"]["salesperson"]["name"]).show()

Copy

코드는 다음 출력을 출력합니다.

------------------------------------
|"""SRC""['SALESPERSON']['NAME']"  |
------------------------------------
|"Frank Beasley"                   |
|"Greg Northrup"                   |
------------------------------------

다른 예로, 다음 코드는 차량 배열을 포함하는 vehicle 필드의 첫 번째 요소를 선택합니다. 이 예에서는 첫 번째 요소의 price 필드도 선택합니다.

df = session.table("car_sales")
df.select(df["src"]["vehicle"][0]).show()
df.select(df["src"]["vehicle"][0]["price"]).show()

Copy

코드는 다음 출력을 출력합니다.

---------------------------
|"""SRC""['VEHICLE'][0]"  |
---------------------------
|{                        |
|  "extras": [            |
|    "ext warranty",      |
|    "paint protection"   |
|  ],                     |
|  "make": "Honda",       |
|  "model": "Civic",      |
|  "price": "20275",      |
|  "year": "2017"         |
|}                        |
|{                        |
|  "extras": [            |
|    "ext warranty",      |
|    "rust proofing",     |
|    "fabric protection"  |
|  ],                     |
|  "make": "Toyota",      |
|  "model": "Camry",      |
|  "price": "23500",      |
|  "year": "2017"         |
|}                        |
---------------------------

------------------------------------
|"""SRC""['VEHICLE'][0]['PRICE']"  |
------------------------------------
|"20275"                           |
|"23500"                           |
------------------------------------

앞서 언급한 방법으로 필드에 액세스하는 다른 방법으로, 필드 이름이나 경로의 요소가 불규칙한 경우 get, get_ignore_case 또는 get_path 함수를 사용할 수 있습니다.

예를 들어 다음 코드 줄은 모두 오브젝트의 지정된 필드 값을 출력합니다.

from snowflake.snowpark.functions import get, get_path, lit

df.select(get(col("src"), lit("dealership"))).show()
df.select(col("src")["dealership"]).show()

Copy

마찬가지로, 다음 코드 줄은 모두 오브젝트의 지정된 경로에 있는 필드 값을 출력합니다.

df.select(get_path(col("src"), lit("vehicle[0].make"))).show()
df.select(col("src")["vehicle"][0]["make"]).show()

Copy

반정형 데이터에서 명시적으로 값 캐스팅하기¶

기본적으로 필드 및 요소의 값은 위의 예와 같이 문자열 리터럴(큰따옴표 포함)로 반환됩니다.

예기치 않은 결과를 방지하려면 cast 메서드를 호출하여 값을 특정 타입으로 캐스팅합니다. 예를 들어, 다음 코드는 캐스팅이 없는 값과 있는 값을 출력합니다.

# Import the objects for the data types, including StringType.
from snowflake.snowpark.types import *

df = session.table("car_sales")
df.select(col("src")["salesperson"]["id"]).show()
df.select(col("src")["salesperson"]["id"].cast(StringType())).show()

Copy

코드는 다음 출력을 출력합니다.

----------------------------------
|"""SRC""['SALESPERSON']['ID']"  |
----------------------------------
|"55"                            |
|"274"                           |
----------------------------------

---------------------------------------------------
|"CAST (""SRC""['SALESPERSON']['ID'] AS STRING)"  |
---------------------------------------------------
|55                                               |
|274                                              |
---------------------------------------------------

오브젝트 배열을 행으로 평면화하기¶

반정형 데이터를 DataFrame으로 “평면화”해야 하는 경우(예: 배열의 모든 오브젝트에 대한 행 생성) join_table_function 메서드를 사용하여 flatten 을 호출합니다. 이 메서드는 FLATTEN SQL 함수와 동일합니다. 오브젝트 또는 배열에 대한 경로를 전달하는 경우 메서드는 오브젝트 또는 배열의 각 필드 또는 요소에 대한 행을 포함하는 DataFrame을 반환합니다.

예를 들어, 샘플 데이터 에서 src:customer 는 고객에 대한 정보를 포함하는 오브젝트의 배열입니다. 각 오브젝트는 name 및 address 필드를 포함합니다.

이 경로를 flatten 함수에 전달하는 경우:

df = session.table("car_sales")
df.join_table_function("flatten", col("src")["customer"]).show()

Copy

메서드는 DataFrame을 반환합니다.

----------------------------------------------------------------------------------------------------------------------------------------------------------
|"SRC"                                      |"SEQ"  |"KEY"  |"PATH"  |"INDEX"  |"VALUE"                            |"THIS"                               |
----------------------------------------------------------------------------------------------------------------------------------------------------------
|{                                          |1      |NULL   |[0]     |0        |{                                  |[                                    |
|  "customer": [                            |       |       |        |         |  "address": "San Francisco, CA",  |  {                                  |
|    {                                      |       |       |        |         |  "name": "Joyce Ridgely",         |    "address": "San Francisco, CA",  |
|      "address": "San Francisco, CA",      |       |       |        |         |  "phone": "16504378889"           |    "name": "Joyce Ridgely",         |
|      "name": "Joyce Ridgely",             |       |       |        |         |}                                  |    "phone": "16504378889"           |
|      "phone": "16504378889"               |       |       |        |         |                                   |  }                                  |
|    }                                      |       |       |        |         |                                   |]                                    |
|  ],                                       |       |       |        |         |                                   |                                     |
|  "date": "2017-04-28",                    |       |       |        |         |                                   |                                     |
|  "dealership": "Valley View Auto Sales",  |       |       |        |         |                                   |                                     |
|  "salesperson": {                         |       |       |        |         |                                   |                                     |
|    "id": "55",                            |       |       |        |         |                                   |                                     |
|    "name": "Frank Beasley"                |       |       |        |         |                                   |                                     |
|  },                                       |       |       |        |         |                                   |                                     |
|  "vehicle": [                             |       |       |        |         |                                   |                                     |
|    {                                      |       |       |        |         |                                   |                                     |
|      "extras": [                          |       |       |        |         |                                   |                                     |
|        "ext warranty",                    |       |       |        |         |                                   |                                     |
|        "paint protection"                 |       |       |        |         |                                   |                                     |
|      ],                                   |       |       |        |         |                                   |                                     |
|      "make": "Honda",                     |       |       |        |         |                                   |                                     |
|      "model": "Civic",                    |       |       |        |         |                                   |                                     |
|      "price": "20275",                    |       |       |        |         |                                   |                                     |
|      "year": "2017"                       |       |       |        |         |                                   |                                     |
|    }                                      |       |       |        |         |                                   |                                     |
|  ]                                        |       |       |        |         |                                   |                                     |
|}                                          |       |       |        |         |                                   |                                     |
|{                                          |2      |NULL   |[0]     |0        |{                                  |[                                    |
|  "customer": [                            |       |       |        |         |  "address": "New York, NY",       |  {                                  |
|    {                                      |       |       |        |         |  "name": "Bradley Greenbloom",    |    "address": "New York, NY",       |
|      "address": "New York, NY",           |       |       |        |         |  "phone": "12127593751"           |    "name": "Bradley Greenbloom",    |
|      "name": "Bradley Greenbloom",        |       |       |        |         |}                                  |    "phone": "12127593751"           |
|      "phone": "12127593751"               |       |       |        |         |                                   |  }                                  |
|    }                                      |       |       |        |         |                                   |]                                    |
|  ],                                       |       |       |        |         |                                   |                                     |
|  "date": "2017-04-28",                    |       |       |        |         |                                   |                                     |
|  "dealership": "Tindel Toyota",           |       |       |        |         |                                   |                                     |
|  "salesperson": {                         |       |       |        |         |                                   |                                     |
|    "id": "274",                           |       |       |        |         |                                   |                                     |
|    "name": "Greg Northrup"                |       |       |        |         |                                   |                                     |
|  },                                       |       |       |        |         |                                   |                                     |
|  "vehicle": [                             |       |       |        |         |                                   |                                     |
|    {                                      |       |       |        |         |                                   |                                     |
|      "extras": [                          |       |       |        |         |                                   |                                     |
|        "ext warranty",                    |       |       |        |         |                                   |                                     |
|        "rust proofing",                   |       |       |        |         |                                   |                                     |
|        "fabric protection"                |       |       |        |         |                                   |                                     |
|      ],                                   |       |       |        |         |                                   |                                     |
|      "make": "Toyota",                    |       |       |        |         |                                   |                                     |
|      "model": "Camry",                    |       |       |        |         |                                   |                                     |
|      "price": "23500",                    |       |       |        |         |                                   |                                     |
|      "year": "2017"                       |       |       |        |         |                                   |                                     |
|    }                                      |       |       |        |         |                                   |                                     |
|  ]                                        |       |       |        |         |                                   |                                     |
|}                                          |       |       |        |         |                                   |                                     |
----------------------------------------------------------------------------------------------------------------------------------------------------------

이 DataFrame에서 사용자는 VALUE 필드의 각 오브젝트에서 name 및 address 필드를 선택할 수 있습니다.

df.join_table_function("flatten", col("src")["customer"]).select(col("value")["name"], col("value")["address"]).show()

Copy

-------------------------------------------------
|"""VALUE""['NAME']"   |"""VALUE""['ADDRESS']"  |
-------------------------------------------------
|"Joyce Ridgely"       |"San Francisco, CA"     |
|"Bradley Greenbloom"  |"New York, NY"          |
-------------------------------------------------

다음 코드는 값을 특정 타입으로 캐스팅 하고 열 이름을 변경하여 이전 예에 추가합니다.

df.join_table_function("flatten", col("src")["customer"]).select(col("value")["name"].cast(StringType()).as_("Customer Name"), col("value")["address"].cast(StringType()).as_("Customer Address")).show()

Copy

-------------------------------------------
|"Customer Name"     |"Customer Address"  |
-------------------------------------------
|Joyce Ridgely       |San Francisco, CA   |
|Bradley Greenbloom  |New York, NY        |
-------------------------------------------

SQL 문 실행하기¶

지정한 SQL 문을 실행하려면 Session 클래스에서 sql 메서드를 호출하고, 실행할 문을 전달하십시오. 이 메서드는 DataFrame을 반환합니다.

사용자가 동작 메서드를 호출 할 때까지 SQL 문은 실행되지 않습니다.

# Get the list of the files in a stage.
# The collect() method causes this SQL statement to be executed.
session.sql("create or replace temp stage my_stage").collect()

Copy

# Prepend a return statement to return the collect() results in a Python worksheet
[Row(status='Stage area MY_STAGE successfully created.')]

stage_files_df = session.sql("ls @my_stage").collect()
# Prepend a return statement to return the collect() results in a Python worksheet
# Resume the operation of a warehouse.
# Note that you must call the collect method to execute
# the SQL statement.
session.sql("alter warehouse if exists my_warehouse resume if suspended").collect()

Copy

# Prepend a return statement to return the collect() results in a Python worksheet
[Row(status='Statement executed successfully.')]

# Set up a SQL statement to copy data from a stage to a table.
session.sql("copy into sample_product_data from @my_stage file_format=(type = csv)").collect()
# Prepend a return statement to return the collect() results in a Python worksheet

Copy

[Row(status='Copy executed with 0 files processed.')]

DataFrame을 변환하는 메서드 (예: filter, select 등)를 호출하려는 경우, 이러한 메서드는 기본 SQL 문이 SELECT 문인 경우에만 작동합니다. 다른 종류의 SQL 문에는 변환 메서드가 지원되지 않습니다.

df = session.sql("select id, parent_id from sample_product_data where id < 10")
# Because the underlying SQL statement for the DataFrame is a SELECT statement,
# you can call the filter method to transform this DataFrame.
results = df.filter(col("id") < 3).select(col("id")).collect()
# Prepend a return statement to return the collect() results in a Python worksheet

# In this example, the underlying SQL statement is not a SELECT statement.
df = session.sql("ls @my_stage")
# Calling the filter method results in an error.
try:
  df.filter(col("size") > 50).collect()
except SnowparkSQLException as e:
  print(e.message)

Copy

000904 (42000): SQL compilation error: error line 1 at position 104
invalid identifier 'SIZE'

Snowpark 쿼리 동시 제출하기¶

참고

이 기능을 사용하려면 Python용 Snowpark 라이브러리 버전 1.24 이상, 서버 버전 8.46 이상이 필요합니다.

스레드 안전 세션 오브젝트를 사용하면 동일한 세션을 사용하면서 Snowpark Python 코드의 다른 부분을 동시에 실행할 수 있습니다. 이렇게 하면 여러 DataFrames 에 대한 변환과 같은 여러 작업을 동시에 실행할 수 있습니다. 이는 Snowflake 서버에서 독립적으로 처리할 수 있는 쿼리로 작업할 때 특히 유용하며, 보다 전통적인 멀티스레딩 접근 방식과 일치합니다.

Python의 Global Interpreter Lock(GIL)은 여러 네이티브 스레드가 동시에 Python 바이트코드를 실행하는 것을 방지하여 Python 오브젝트에 대한 액세스를 보호하는 뮤텍스입니다. I/O 바운드 작업은 I/O 작업 중에 GIL 이 릴리스되기 때문에 Python의 스레딩 모델을 활용할 수 있지만, CPU 바운딩 스레드는 한 번에 하나의 스레드만 실행할 수 있기 때문에 진정한 병렬성을 달성하지 못합니다.

또한, 저장 프로시저와 같이 Snowflake 내부에서 사용하는 경우, Snowpark Python 서버는 Global Interpreter Lock(GIL)을 Snowflake에 쿼리를 제출하기 전에 릴리스하여 관리합니다. 이렇게 하면 별도의 스레드에서 여러 쿼리를 큐에 넣을 때 진정한 동시성을 달성할 수 있습니다. 이러한 관리를 통해 Snowpark는 여러 스레드가 동시에 쿼리를 제출할 수 있도록 하여 최적의 병렬 실행을 보장합니다.

Snowpark에서 스레드 안전 세션 오브젝트 사용의 이점¶

여러 개의 DataFrame 작업을 동시에 실행할 수 있는 기능은 사용자에게 다음과 같은 이점을 제공할 수 있습니다.

향상된 성능: 스레드 안전 세션 오브젝트를 사용하면 여러 개의 Snowpark Python 쿼리를 동시에 실행할 수 있으므로 전체 런타임이 단축됩니다. 예를 들어 여러 테이블을 독립적으로 처리해야 하는 경우 이 기능을 사용하면 각 테이블의 처리가 완료될 때까지 기다렸다가 다음 테이블을 시작할 필요가 없으므로 작업을 완료하는 데 걸리는 시간이 크게 단축됩니다.
효율적인 컴퓨팅 활용: 쿼리를 동시에 제출하면 Snowflake의 컴퓨팅 리소스가 효율적으로 사용되어 유휴 시간을 줄일 수 있습니다.
사용성: 스레드 안전 세션 오브젝트는 Python의 기본 멀티스레딩(API)과 원활하게 통합되므로 개발자는 Python의 기본 제공 도구를 활용하여 스레드 동작을 제어하고 병렬 실행을 최적화할 수 있습니다.

스레드 안전 세션 오브젝트와 비동기 작업은 사용 사례에 따라 서로를 보완할 수 있습니다. 비동기 작업은 작업이 완료될 때까지 기다릴 필요가 없을 때 유용하며, 스레드 풀 관리 없이 비차단 실행을 허용합니다. 반면 스레드 안전 세션 오브젝트는 클라이언트 측에서 여러 쿼리를 동시에 제출하려는 경우에 유용합니다. 경우에 따라 코드 블록에 비동기 작업도 포함할 수 있으므로 두 가지 방법을 함께 효과적으로 사용할 수 있습니다.

다음은 스레드 안전 세션 오브젝트로 데이터 파이프라인을 개선할 수 있는 몇 가지 예시입니다.

예 1: 여러 테이블의 동시 로딩¶

이 예는 3개의 스레드를 사용하여 COPY INTO 명령을 동시에 실행하여 3개의 서로 다른 CSV 파일에서 3개의 개별 테이블로 데이터를 로드하는 것을 보여줍니다.

import threading
from snowflake.snowpark import Session

# Define the list of tables
tables = ["customers", "orders", "products"]

# Function to copy data from stage to tables
def execute_copy(table_name):
    try:
        # Read data from the stage using DataFrameReader
        df = (
            session.read.option("SKIP_HEADER", 1)
            .option("PATTERN", f"{table_name}[.]csv")
            .option("FORCE", True)
            .csv(f"@my_stage")
        )

        # Copy data into the target table
        df.copy_into_table(
            table_name=table_name, target_columns=session.table(table_name).columns
        )

    except Exception as e:
        print(f"Failed to copy data into {table_name}, Error: {e}")

# Create an empty list of threads
threads = []

# Loop through and start a thread for each table
for table in tables:
    thread = threading.Thread(target=execute_copy, args=(table,))
    threads.append(thread)
    thread.start()

# Wait for all threads to finish
for thread in threads:
    thread.join()

Copy

예 2: 여러 테이블의 동시 처리¶

이 예는 여러 스레드를 사용하여 각 고객 트랜잭션 테이블(transaction_customer1, transaction_customer2, transaction_customer3)의 결과 테이블에 데이터를 동시에 필터링, 집계 및 삽입하는 방법을 보여줍니다.

from concurrent.futures import ThreadPoolExecutor
from snowflake.snowpark import Session
from snowflake.snowpark.functions import col, month, sum, lit

# List of customers
customers = ["customer1", "customer2", "customer3"]

# Define a function to process each customer transaction table
def process_customer_table(customer_name):
    table_name = f"transaction_{customer_name}"

    try:
        # Load the customer transaction table
        df = session.table(table_name)
        print(f"Processing {table_name}...")

        # Filter data by positive values and non null categories
        df_filtered = df.filter((col("value") > 0) & col("category").is_not_null())

        # Perform aggregation: Sum of value by category and month
        df_aggregated = df_filtered.with_column("month", month(col("date"))).with_column("customer_name", lit(customer_name)).group_by(col("category"), col("month"), col("customer_name")).agg(sum("value").alias("total_value"))

        # Save the processed data into a new result table
        df_aggregated.show()
        df_aggregated.write.save_as_table("aggregate_customers", mode="append")
        print(f"Data from {table_name} processed and saved")

    except Exception as e:
        print(f"Error processing {table_name}: {e}")

# Using ThreadPoolExecutor to handle concurrency
with ThreadPoolExecutor(max_workers=3) as executor:
    # Submit tasks for each customer table
    executor.map(process_customer_table, customers)

# Display the results from the aggregate table
session.table("aggregate_customers").show()

Copy

스레드 안전 세션 오브젝트 사용의 제한 사항¶

여러 트랜잭션을 동시에 관리해야 하는 경우 단일 세션의 여러 스레드는 동시 트랜잭션을 지원하지 않으므로 여러 세션 오브젝트를 사용하는 것이 중요합니다.
다른 스레드가 활성화되어 있는 동안 세션 런타임 구성(데이터베이스, 스키마, 웨어하우스와 같은 Snowflake 세션 변수 및 cte_optimization_enabled, sql_simplifier_enabled와 같은 클라이언트 측 구성 포함)을 변경하면 예기치 않은 동작이 발생할 수 있습니다. 충돌을 방지하려면 서로 다른 스레드에 별도의 구성이 필요한 경우 별도의 세션 오브젝트를 사용하는 것이 가장 좋습니다. 예를 들어 서로 다른 데이터베이스에서 병렬로 작업을 수행해야 하는 경우 각 스레드가 동일한 세션을 공유하지 않고 고유한 세션 오브젝트를 갖도록 하십시오.

DataFrame의 내용을 Pandas DataFrame으로 반환하기¶

DataFrame의 내용을 Pandas DataFrame으로 반환하려면 to_pandas 메서드를 사용하십시오.

예:

python_df = session.create_dataframe(["a", "b", "c"])
pandas_df = python_df.to_pandas()

Copy

Snowpark DataFrames vs Snowpark pandas DataFrame: 어떤 것을 선택해야 할까요?¶

Snowpark Python 라이브러리를 설치하면 DataFrames API 또는 pandas on Snowflake 를 사용할 수 있습니다.

Snowpark DataFrames는 PySpark를 기반으로 개발되었으며, Snowpark pandas는 Snowpark DataFrame 의 기능을 확장하고 pandas 사용자에게 익숙한 인터페이스를 제공하여 마이그레이션과 도입을 용이하게 해줍니다. 사용 사례와 선호도에 따라 다른 APIs를 사용하는 것이 좋습니다.

다음과 같은 경우 Snowpark pandas를 사용합니다…	다음과 같은 경우 Snowpark DataFrames을 사용합니다…
pandas로 작성된 코드를 선호하거나 기존 코드가 있는 경우	Spark로 작성된 코드를 선호하거나 기존 코드가 있는 경우
대화형 분석 및 반복 탐색이 포함된 워크플로가 있음	일괄 처리 및 제한된 반복 개발을 포함하는 워크플로가 있음
즉시 실행되는 DataFrame 작업 수행에 익숙함	나중에 평가되는 DataFrame 작업 수행에 익숙함
작업 중 데이터가 일관되고 정렬되는 것을 선호함	데이터가 정렬되지 않아도 괜찮음
사용하기 쉬운 API를 위해 Snowpark DataFrames에 비해 성능이 약간 느려도 괜찮음	사용 편의성보다 성능이 더 중요합니다

구현 관점에서 보면, Snowpark DataFrames와 pandas DataFrames는 의미 체계가 다릅니다. Snowpark DataFrames는 원본 데이터 소스에서 작업을 수행하고 가장 최근에 업데이트된 데이터를 가져오며 작업 순서를 유지하지 않는 PySpark를 모델링한 것입니다. Snowpark pandas는 데이터의 스냅샷으로 연산하고 연산 중에 순서를 유지하며 순서 기반 위치 인덱싱을 허용하는 팬더를 모델로 합니다. 순서 유지 관리 기능은 대화형 데이터 분석에서 데이터를 시각적으로 검사하는 데 유용합니다.

자세한 내용은 Snowpark DataFrames와 함께 pandas on Snowflake 사용하기 섹션을 참조하십시오.