Document AI

What is Document AI

Document AI is a Snowflake AI feature that uses Arctic-TILT, a proprietary large language model (LLM), to extract data from documents. Document AI processes documents of various formats and extracts information from both text-heavy paragraphs and the content in a graphical form, such as logos, handwritten text (signatures), or checkmarks. With Document AI, you can prepare pipelines for continuous processing of new documents of a specific type, such as invoices or finance statements.

Document AI provides both zero-shot extraction and fine-tuning. Zero-shot means that the foundation model can locate and extract information specific to a document type, even if the model has never seen the document before. This is because the foundation model is trained on a large volume of various documents, so the model broadly understands the type of document being processed.

Additionally, you can fine-tune the Snowflake Arctic-TILT model to improve your results by training the model on the documents specific to your use case. The fine-tuned model (including the training data used) is available only to you and is not shared with other Snowflake customers.

When to use Document AI

Document AI is best used when:

  • You want to turn unstructured data from documents into structured data in tables.

  • You want to create pipelines for continuous processing of new documents of a specific type.

  • Business users with domain knowledge prepare the model, and the data engineers working with SQL prepare pipelines to automate the processing of new documents.

How Document AI works

Working with Document AI is divided into two phases:

  • Preparing a Document AI model build

    You can think of the model build as representing a single type of a document or a use case; for example, a model build for extracting information from invoice documents. The Document AI model build includes the model, the data values to be extracted, and the documents uploaded to test and train the model.

    You prepare the model build through a Document AI user interface in Snowsight. The interface lets you create a model build, upload documents to test and train the model, define data values (information to be extracted) by asking questions using natural language, evaluate the model, and publish the model build or fine-tune the model to improve the results.

    For more information, see Prepare a Document AI model build.

  • Extracting information from documents

    When the model build is ready, you can begin extracting information from documents by running an extracting query, which uses the <model_build_name>!PREDICT method. You can then use the extracting query to create pipelines for continuous processing with streams and tasks.

    For more information, see Extract information with Document AI.

    Note

    The documents to be processed using the <model_build_name>!PREDICT method must be stored in an internal or external stage.

Document AI overview

To get started with Document AI, see Tutorial: Create a document processing pipeline with Document AI.

Document AI model version history

All model builds created after August 6, 2024 use a new version of the Arctic-TILT model.

Model version release date

Model version improvements

August 6, 2024

June 21, 2024

  • Extraction of lists of values

  • Checkbox identification

  • Query paraphrasing recognition to improve recognizing queries built as sentences, such as Give me the date of the agreement