snowflake.snowpark.functions.ai_extract¶

snowflake.snowpark.functions.ai_extract(input: Union[Column, str], response_format: Union[dict, list]) → Column[source]¶

Extracts information from an input string or file based on the specified response format.

Parameters:

input – Either: - A string or Column containing text to extract information from - A FILE type Column representing a document to extract from
response_format –
Information to be extracted in one of the following formats:
- Simple object schema (dict) mapping feature names to extraction prompts: {'name': 'What is the last name of the employee?', 'address': 'What is the address of the employee?'}
- Array of strings containing the information to be extracted: ['What is the last name of the employee?', 'What is the address of the employee?']
- Array of arrays containing two strings (feature name and extraction prompt): [['name', 'What is the last name of the employee?'], ['address', 'What is the address of the employee?']]
- Array of strings with colon-separated feature names and extraction prompts: ['name: What is the last name of the employee?', 'address: What is the address of the employee?']

Returns:

A Column containing a JSON object with the extracted information.

Note

You can either ask questions in natural language or describe information to be extracted (e.g., ‘City, street, ZIP’ instead of ‘What is the address?’)
To extract a list, add ‘List:’ at the beginning of each question
Maximum of 100 features can be extracted
Documents must be no more than 125 pages long
Maximum output length is 512 tokens per question

Supported file formats: PDF, PNG, PPTX, EML, DOC, DOCX, JPEG, JPG, HTM, HTML, TEXT, TXT, TIF, TIFF Files must be less than 100 MB in size.

Examples:

>>> # Extract from text string
>>> df = session.range(1).select(
...     ai_extract(
...         'John Smith lives in San Francisco and works for Snowflake',
...         {'name': 'What is the first name of the employee?', 'city': 'What is the address of the employee?'}
...     ).alias("extracted")
... )
>>> df.show()
--------------------------------
|"EXTRACTED"                   |
--------------------------------
|{                             |
|  "error": null,              |
|  "response": {               |
|    "city": "San Francisco",  |
|    "name": "John"            |
|  }                           |
|}                             |
--------------------------------


>>> # Extract using array format
>>> df = session.create_dataframe(
...     ["Alice Johnson works in Seattle", "Bob Williams works in Portland"],
...     schema=["text"]
... )
>>> extracted_df = df.select(
...     col("text"),
...     ai_extract(col("text"), [['name', 'What is the first name?'], ['city', 'What city do they work in?']]).alias("info")
... )
>>> extracted_df.show()
------------------------------------------------------------
|"TEXT"                          |"INFO"                   |
------------------------------------------------------------
|Alice Johnson works in Seattle  |{                        |
|                                |  "error": null,         |
|                                |  "response": {          |
|                                |    "city": "Seattle",   |
|                                |    "name": "Alice"      |
|                                |  }                      |
|                                |}                        |
|Bob Williams works in Portland  |{                        |
|                                |  "error": null,         |
|                                |  "response": {          |
|                                |    "city": "Portland",  |
|                                |    "name": "Bob"        |
|                                |  }                      |
|                                |}                        |
------------------------------------------------------------


>>> # Extract lists using List: prefix
>>> df = session.range(1).select(
...     ai_extract(
...         'Python, Java, and JavaScript are popular programming languages',
...         [['languages', 'List: What programming languages are mentioned?']]
...     ).alias("extracted")
... )
>>> df.show()
----------------------
|"EXTRACTED"         |
----------------------
|{                   |
|  "error": null,    |
|  "response": {     |
|    "languages": [  |
|      "Python",     |
|      "Java",       |
|      "JavaScript"  |
|    ]               |
|  }                 |
|}                   |
----------------------



>>> # Extract from file
>>> _ = session.sql("CREATE OR REPLACE TEMP STAGE mystage ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')").collect()
>>> _ = session.file.put("tests/resources/invoice.pdf", "@mystage", auto_compress=False)
>>> df = session.range(1).select(
...     ai_extract(
...         to_file('@mystage/invoice.pdf'),
...         [['date', 'What is the date of the invoice?'], ['amount', 'What is the amount of the invoice?']]
...     ).alias("extracted")
... )
>>> df.show()
--------------------------------
|"EXTRACTED"                   |
--------------------------------
|{                             |
|  "error": null,              |
|  "response": {               |
|    "amount": "USD $950.00",  |
|    "date": "Nov 26, 2016"    |
|  }                           |
|}                             |
--------------------------------

Copy