snowflake.snowpark.DataFrameAIFunctions.extract¶

DataFrameAIFunctions.extract(input_column: Union[snowflake.snowpark.column.Column, str], *, response_format: Optional[Union[Dict[str, str], List]] = None, output_column: Optional[str] = None) → snowflake.snowpark.DataFrame[source]¶

Extract structured information from text or files using a response schema.

Parameters:

input_column – The column (Column object or column name as string) containing the text or FILE data to extract information from. Use to_file for staged file paths.
response_format –
The schema describing information to extract. Supports:
- Simple object schema (dict) mapping feature names to extraction prompts: {'name': 'What is the last name of the employee?', 'address': 'What is the address of the employee?'}
- Array of strings containing the information to be extracted: ['What is the last name of the employee?', 'What is the address of the employee?']
- Array of arrays containing two strings (feature name and extraction prompt): [['name', 'What is the last name of the employee?'], ['address', 'What is the address of the employee?']]
- Array of strings with colon-separated feature names and extraction prompts: ['name: What is the last name of the employee?', 'address: What is the address of the employee?']
output_column – The name of the output column to be appended. If not provided, a column named AI_EXTRACT_OUTPUT is appended.

Returns:

A new DataFrame with an appended JSON object containing the extracted fields under response.

Examples:

>>> # Extract from text string
>>> from snowflake.snowpark.functions import col
>>> df = session.create_dataframe([
...     ["John Smith lives in San Francisco and works for Snowflake"],
... ], schema=["text"])
>>> result_df = df.ai.extract(
...     input_column="text",
...     response_format={'name': 'What is the first name of the employee?', 'city': 'What is the address of the employee?'},
...     output_column="extracted",
... )
>>> result_df.select("EXTRACTED").show()
--------------------------------
|"EXTRACTED"                   |
--------------------------------
|{                             |
|  "error": null,              |
|  "response": {               |
|    "city": "San Francisco",  |
|    "name": "John"            |
|  }                           |
|}                             |
--------------------------------


>>> # Extract using array format
>>> df = session.create_dataframe(
...     [
...         ["Alice Johnson works in Seattle"],
...         ["Bob Williams works in Portland"],
...     ],
...     schema=["text"]
... )
>>> result_df = df.ai.extract(
...     input_column=col("text"),
...     response_format=[["name", "What is the first name?"], ["city", "What city do they work in?"]],
...     output_column="info",
... )
>>> result_df.show()
------------------------------------------------------------
|"TEXT"                          |"INFO"                   |
------------------------------------------------------------
|Alice Johnson works in Seattle  |{                        |
|                                |  "error": null,         |
|                                |  "response": {          |
|                                |    "city": "Seattle",   |
|                                |    "name": "Alice"      |
|                                |  }                      |
|                                |}                        |
|Bob Williams works in Portland  |{                        |
|                                |  "error": null,         |
|                                |  "response": {          |
|                                |    "city": "Portland",  |
|                                |    "name": "Bob"        |
|                                |  }                      |
|                                |}                        |
------------------------------------------------------------


>>> # Extract lists using List: prefix
>>> df = session.create_dataframe(
...     [["Python, Java, and JavaScript are popular programming languages"]],
...     schema=["text"]
... )
>>> result_df = df.ai.extract(
...     input_column="text",
...     response_format=[["languages", "List: What programming languages are mentioned?"]],
...     output_column="extracted",
... )
>>> result_df.select("EXTRACTED").show()
----------------------
|"EXTRACTED"         |
----------------------
|{                   |
|  "error": null,    |
|  "response": {     |
|    "languages": [  |
|      "Python",     |
|      "Java",       |
|      "JavaScript"  |
|    ]               |
|  }                 |
|}                   |
----------------------


>>> # Extract from file
>>> from snowflake.snowpark.functions import to_file
>>> _ = session.sql("CREATE OR REPLACE TEMP STAGE mystage ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')").collect()
>>> _ = session.file.put("tests/resources/invoice.pdf", "@mystage", auto_compress=False)
>>> df = session.create_dataframe([["@mystage/invoice.pdf"]], schema=["file_path"])
>>> result_df = df.ai.extract(
...     input_column=to_file(col("file_path")),
...     response_format=[["date", "What is the invoice date?"], ["amount", "What is the amount?"]],
...     output_column="info",
... )
>>> result_df.select("INFO").show()
--------------------------------
|"INFO"                        |
--------------------------------
|{                             |
|  "error": null,              |
|  "response": {               |
|    "amount": "USD $950.00",  |
|    "date": "Nov 26, 2016"    |
|  }                           |
|}                             |
--------------------------------

Copy

This function or method is experimental since 1.39.0.