Prepare a Document AI model build

This topic describes preparing a Document AI model build.

You create and manage Document AI model builds in Snowsight. The Document AI model build represents a single type of document; for example, a model build for extracting information from invoice documents. The Document AI model build includes the model, the data values to be extracted, and the documents uploaded to test and train the model.

The Document AI model build is an instance of the DOCUMENT_INTELLIGENCE class. Snowflake provides the DOCUMENT_INTELLIGENCE class in the SNOWFLAKE.ML schema. For more information about classes, see Snowflake classes.

In Snowsight, the Document AI model build view is divided into the following tabs:

  • Build Details: View information about the model build, such as the number of documents, number of data values to extract, model accuracy, and extracting query.

  • Documents: Review the list of documents uploaded to test and train the model.

  • Values: View the list of data values to extract.

For more information about roles and privileges for Document AI, see Setting up Document AI.

Create a Document AI model build

  1. Sign in to Snowsight using an account role that is granted the SNOWFLAKE.DOCUMENT_INTELLIGENCE_CREATOR role.

  2. In the navigation menu, select AI & ML » Document AI.

  3. Select a warehouse.

    The list of existing model builds appears.

  4. Select + Build.

  5. In the dialog that appears, enter a name for your model build, select its location (database and schema), and then select Create.

    The model build is created.

Note

  • Document AI does not support double quotes around identifiers for the database and schema.

  • Document AI does not support altering a database or a schema where the model build is located.

Delete a Document AI model build

Attention

When you delete the Document AI model build, you delete the model and all uploaded documents used to train the model. Before you delete a model build, ensure that it isn’t part of a document processing pipeline. If you delete a model build used in a document processing pipeline, the pipeline will fail.

Snowflake does not keep any model build data, so deleted model builds and training data cannot be recovered; they must be recreated.

To delete a Document AI model build, including the documents uploaded to the model build:

  1. Sign in to Snowsight.

  2. In the navigation menu, select AI & ML » Document AI.

  3. Select a warehouse.

  4. Select the (more) menu next to the model build name, and then select Delete.

  5. To confirm deletion, in the Delete Build dialog, select Delete.

Upload documents to a Document AI model build

To test and train the Document AI model, manually add the documents to your model build in Snowsight.

Note

Before you upload documents to the model build, ensure that the documents meet the requirements listed in Prepare your documents for Document AI.

To upload documents to an existing Document AI model build:

  1. Sign in to Snowsight.

  2. In the navigation menu, select AI & ML » Document AI.

  3. Select a warehouse.

  4. From the list of model builds, select the name of the build to add documents to.

  5. Select the Build Details tab.

  6. Select Upload documents.

  7. Select Browse, or drag the documents to a dialog.

  8. Select Upload.

After you upload the document, you can view its status in the Documents tab.

The document can have one of the following statuses:

  • Processing: The document is being processed by OCR.

  • To review: The OCR process was successful and you can now review the document.

  • In progress: The review is in progress, meaning that you have at least one value defined for this document.

  • Accepted: You reviewed the document and accepted all values.

  • Error: An error occurred during OCR.

Delete documents from a Document AI model build

Attention

You can’t delete documents that were used for training.

When you delete a document, you also delete the reviewed data values in that document.

To delete documents from a Document AI model build:

  1. Sign in to Snowsight.

  2. In the navigation menu, select AI & ML » Document AI.

  3. Select a warehouse.

  4. From the list of model builds, select the name of the model build.

  5. Select the Documents tab.

  6. Select the (more) menu next to the document name, and then select Delete.

  7. To confirm deletion, in the Delete Document dialog, select Delete.

Define values for a Document AI model build

Data values are the information you want to extract from documents. A value consists of a value name and a question asked in natural language. For more information about optimizing questions for the model, see Question optimization for extracting information with Document AI.

To define values for the Document AI model build:

  1. Sign in to Snowsight.

  2. In the navigation menu, select AI & ML » Document AI.

  3. Select a warehouse.

  4. From the list of model builds, select the name of the model build to define values for.

  5. Select the Build Details tab.

  6. Select Define values.

  7. In the Documents review view, select + Value.

  8. For each value, enter a value name and a question.

As a result of this procedure, the model provides an answer to the question and a confidence score. The confidence score describes how confident the model is that the answer is correct. For example, a confidence score of 0.9 means that there is 90% confidence that the answer is correct.

Review answers and evaluate results

Before you use the Document AI model to extract information or decide to train the model through fine-tuning, you need to review the answers that the model provides.

When you review the answers, you might encounter the following scenarios:

Returned answer

User action

Correct

Select the checkmark. Confirm only the answers that are fully correct.

Incorrect

Enter the correct value manually.

To review the value provided by the model after you manually changed the value, select the down arrow.

List of answers

To remove answers from the list or to add more answers, select the (more) menu.

None

If the document contains the answer, enter the value manually.

If the document does not contain the answer, confirm the empty response by selecting the checkmark.

Tip

You can re-order the answers by dragging them.

Evaluate a Document AI model

To evaluate a Document AI model (either the foundation model or the fine-tuned model), analyze the accuracy. Accuracy describes how often the model provides a correct answer. A higher accuracy indicates that the model is better at extraction. To see the accuracy, review the answers to all questions.

To view the accuracy:

  1. Sign in to Snowsight.

  2. In the navigation menu, select AI & ML » Document AI.

  3. Select a warehouse.

  4. From the list of model builds, select the name of the model build to evaluate.

  5. Select the Build Details tab, which displays Model accuracy.

If the Document AI model reliably answers your questions and the accuracy is satisfactory, publish the model build. See Publish a Document AI model build.

To improve the results of the Document AI model, train the model. See Train a Document AI model.

Tip

To evaluate the Document AI model after training, review the newly uploaded documents.

Publish a Document AI model build

  1. Sign in to Snowsight.

  2. In the navigation menu, select AI & ML » Document AI.

  3. Select a warehouse.

  4. From the list of model builds, select the name of the model build to publish.

  5. Select the Build Details tab.

  6. Under Model accuracy, select Publish version.

  7. In the dialog that appears, select Publish to confirm.

After you publish the model build, you can see an extracting query.

If you added new data values (asked new questions) after you trained the model or published the model build, you must publish the model build again.

Train a Document AI model

The foundation model (the Snowflake Arctic-TILT model) has already been pre-trained and fine-tuned, but you can improve its accuracy by fine-tuning the foundation model on your documents through supervised fine-tuning. During training, the parameters of the foundation model are adapted to the documents and annotations you provided. All the models resulting from training are saved in the model build within your account.

Snowflake recommends reviewing the results for at least 20 documents before training.

Tip

To assess the quality of the model, split your documents into two sets. Review one set of documents, and use the unreviewed documents to assess the model after training.

To start training the model:

  1. Sign in to Snowsight.

  2. In the navigation menu, select AI & ML » Document AI.

  3. Select a warehouse.

  4. From the list of model builds, select the name of the model build to train.

  5. Select the Build Details tab.

  6. Under Model accuracy, select Train model.

  7. In the dialog that appears, select Start training to confirm.

When the training is complete, a notification appears.

You can now re-evaluate your Document AI model. To see the accuracy for the fine-tuned model after the training, review the second set of documents. Note that you can fine-tune your model multiple times to get satisfactory results.

You do not need to publish the model build if you trained the model and did not add new data values (ask new questions) after the training.

Note

You can start multiple trainings for multiple model builds at the same time. Note that the trainings are queued, and you can run no more than three trainings at the same time.

Training time estimation

Training time of a Document AI model depends on both the number of values to be extracted and the number of pages in a document.

The following table lists the estimated training time for a batch of 20 documents (the minimum number required for training) and 10 values, depending on the number of pages in each document.

Number of pages in each document

Estimated training time for 20 documents (hours)

1

0.5

10

1.5

25

4

50

8

75

12.5

100

16.5

125

20.5

Note

The table lists estimated training time. Note that the actual time needed for training might vary. Generally, doubling the number of values or the number of documents doubles training time.

The maximum training time is 48 hours. If the amount of your data might exceed that limit, the training possibility will be blocked.