snowflake.snowpark.DataFrameAIFunctions.extract¶
- DataFrameAIFunctions.extract(input_column: Union[snowflake.snowpark.column.Column, str], *, response_format: Optional[Union[Dict[str, str], List]] = None, output_column: Optional[str] = None) snowflake.snowpark.DataFrame[source]¶
Extract structured information from text or files using a response schema.
- Parameters:
input_column – The column (Column object or column name as string) containing the text or FILE data to extract information from. Use
to_filefor staged file paths.response_format –
The schema describing information to extract. Supports:
Simple object schema (dict) mapping feature names to extraction prompts:
{'name': 'What is the last name of the employee?', 'address': 'What is the address of the employee?'}Array of strings containing the information to be extracted:
['What is the last name of the employee?', 'What is the address of the employee?']Array of arrays containing two strings (feature name and extraction prompt):
[['name', 'What is the last name of the employee?'], ['address', 'What is the address of the employee?']]Array of strings with colon-separated feature names and extraction prompts:
['name: What is the last name of the employee?', 'address: What is the address of the employee?']
output_column – The name of the output column to be appended. If not provided, a column named
AI_EXTRACT_OUTPUTis appended.
- Returns:
A new DataFrame with an appended JSON object containing the extracted fields under
response.
Examples:
>>> # Extract from text string >>> from snowflake.snowpark.functions import col >>> df = session.create_dataframe([ ... ["John Smith lives in San Francisco and works for Snowflake"], ... ], schema=["text"]) >>> result_df = df.ai.extract( ... input_column="text", ... response_format={'name': 'What is the first name of the employee?', 'city': 'What is the address of the employee?'}, ... output_column="extracted", ... ) >>> result_df.select("EXTRACTED").show() -------------------------------- |"EXTRACTED" | -------------------------------- |{ | | "response": { | | "city": "San Francisco", | | "name": "John" | | } | |} | -------------------------------- >>> # Extract using array format >>> df = session.create_dataframe( ... [ ... ["Alice Johnson works in Seattle"], ... ["Bob Williams works in Portland"], ... ], ... schema=["text"] ... ) >>> result_df = df.ai.extract( ... input_column=col("text"), ... response_format=[["name", "What is the first name?"], ["city", "What city do they work in?"]], ... output_column="info", ... ) >>> result_df.show() ------------------------------------------------------------ |"TEXT" |"INFO" | ------------------------------------------------------------ |Alice Johnson works in Seattle |{ | | | "response": { | | | "city": "Seattle", | | | "name": "Alice" | | | } | | |} | |Bob Williams works in Portland |{ | | | "response": { | | | "city": "Portland", | | | "name": "Bob" | | | } | | |} | ------------------------------------------------------------ >>> # Extract lists using List: prefix >>> df = session.create_dataframe( ... [["Python, Java, and JavaScript are popular programming languages"]], ... schema=["text"] ... ) >>> result_df = df.ai.extract( ... input_column="text", ... response_format=[["languages", "List: What programming languages are mentioned?"]], ... output_column="extracted", ... ) >>> result_df.select("EXTRACTED").show() ---------------------- |"EXTRACTED" | ---------------------- |{ | | "response": { | | "languages": [ | | "Python", | | "Java", | | "JavaScript" | | ] | | } | |} | ---------------------- >>> # Extract from file >>> from snowflake.snowpark.functions import to_file >>> _ = session.sql("CREATE OR REPLACE TEMP STAGE mystage ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')").collect() >>> _ = session.file.put("tests/resources/invoice.pdf", "@mystage", auto_compress=False) >>> df = session.create_dataframe([["@mystage/invoice.pdf"]], schema=["file_path"]) >>> result_df = df.ai.extract( ... input_column=to_file(col("file_path")), ... response_format=[["date", "What is the invoice date?"], ["amount", "What is the amount?"]], ... output_column="info", ... ) >>> result_df.select("INFO").show() -------------------------------- |"INFO" | -------------------------------- |{ | | "response": { | | "amount": "USD $950.00", | | "date": "Nov 26, 2016" | | } | |} | --------------------------------
This function or method is experimental since 1.39.0.