Document AI

What is Document AI

Document AI is a Snowflake AI feature that uses Arctic-TILT, a proprietary large language model (LLM), to extract data from documents. Document AI processes documents of various formats and extracts information from both text-heavy paragraphs and the content in a graphical form, such as logos, handwritten text (signatures), or checkmarks. With Document AI, you can prepare pipelines for continuous processing of new documents of a specific type, such as invoices or finance statements.

Document AI provides both zero-shot extraction and fine-tuning. Zero-shot means that the foundation model can locate and extract information specific to a document type, even if the model has never seen the document before. This is because the foundation model is trained on a large volume of various documents, so the model broadly understands the type of document being processed.

Additionally, you can fine-tune the Snowflake Arctic-TILT model to improve your results by training the model on the documents specific to your use case. The fine-tuned model (including the training data used) is available only to you and is not shared with other Snowflake customers.

When to use Document AI

Document AI is best used when:

  • You want to turn unstructured data from documents into structured data in tables.

  • You want to create pipelines for continuous processing of new documents of a specific type.

  • Business users with domain knowledge prepare the model, and the data engineers working with SQL prepare pipelines to automate the processing of new documents.

How Document AI works

Document AI consists of the following:

  • A user interface to create a model build, evaluate the Document AI model using natural language, and optionally fine-tune the model to improve the results

    You can think of the model build as representing a single type of a document or a use case; for example, a model build for extracting information from invoice documents. The Document AI model build includes the model, the data values to be extracted, and the documents uploaded to test and train the model.

  • An extracting query, which uses the <model_build_name>!PREDICT method to extract information from documents

    You can then use the extracting query to create pipelines for continuous processing with streams and tasks.

Note

The documents to be processed using the <model_build_name>!PREDICT method must be stored in an internal or external stage.