Fine-tuning `arctic-extract` models¶

You can now fine-tune arctic-extract models using the Snowflake Cortex Fine-tuning function and Snowflake Datasets. The fine-tuned model can then be used for inference with the AI_EXTRACT function.

Syntax¶

For specific syntax, usage notes, and examples, see:

FINETUNE ('CREATE') (SNOWFLAKE.CORTEX)¶

Creates a fine-tuning job.

Syntax¶

SNOWFLAKE.CORTEX.FINETUNE(
  'CREATE',
  '@<database>.<schema>.<model_name>',
  'arctic-extract',
  '<training_dataset>'
  [
    , '<validation_dataset>'
  ]
)

Required parameters¶

'CREATE': Specifies that you want to create a fine-tuning job.
'training_dataset': Dataset object to use for training. For more information, see Dataset requirements.

Optional parameters¶

'validation_dataset': Dataset object to use for validation. For more information, see Dataset requirements.

Note

The options parameter is not supported for fine-tuning arctic-extract models. The number of epochs is automatically determined by the system.

Access control requirements¶


Privilege	Object	Notes
USAGE or OWNERSHIP	DATABASE	The database that the Dataset object is stored in.
USAGE or OWNERSHIP	SCHEMA	The schema that the Dataset object is stored in.
READ or OWNERSHIP	STAGE	The stage that the document files are stored in.
USAGE or OWNERSHIP	SCHEMA	The schema that the fine-tuned model is stored in.
CREATE MODEL	SCHEMA	The schema that the fine-tuned model is stored in.

Additionally, to use the FINETUNE function, the ACCOUNTADMIN role must grant the SNOWFLAKE.CORTEX_USER database role to the user who will call the function. See LLM Functions required privileges topic for details.

Example¶

  SELECT SNOWFLAKE.CORTEX.FINETUNE(
  'CREATE',
  '@database.schema.model_name',
  'arctic-extract`,
  'snow://dataset/training_ds/versions/2',
  'snow://dataset/validation_ds/versions/4'
);

FINETUNE ('DESCRIBE') (SNOWFLAKE.CORTEX)¶

Describes the properties of a fine-tuning job.

For syntax and parameters, see FINETUNE ('DESCRIBE') (SNOWFLAKE.CORTEX).

An example output for a successful job when fine-tuning arctic-extract model:

{
  "base_model":"arctic-extract",
  "created_on":1717004388348,
  "finished_on":1717004691577,
  "id":"ft_6556e15c-8f12-4d94-8cb0-87e6f2fd2299",
  "model":"mydb.myschema.my_tuned_model",
  "progress":1.0,
  "status":"SUCCESS",
  "training_data":"snow://dataset/training_ds/versions/2",
  "trained_tokens":2670734,
  "training_result":{"validation_loss":1.0138969421386719,"training_loss":0.6477728401547047},
  "validation_data":"snow://dataset/validation_ds/versions/4",
}

Dataset requirements¶

The Dataset used for training and validation must contain the following columns:

File:

A string containing the file path to the document for extraction. For example: @db.schema.stage/file.pdf

Prompt:

A JSON value that specifies key and question pairs for extraction in one of the formats supported by the responseFormat argument of the AI_EXTRACT function.

For more information, see AI_EXTRACT.

Response:

A JSON object containing key and response pairs.

Note

Column names are case-insensitive and can be in any order in the Dataset; however, all required columns (File, Prompt, and Response) must be present for the Dataset to be valid. Additional columns in the Dataset are ignored.

When preparing the Dataset, note the following:

The schema of the fine-tuned model is the unique set of all questions in the Dataset.
The answers in the Response column should match the questions in the Prompt column by matching keys in the Prompt and Response columns.
You don’t have to specify the same set of questions for every document.
To improve model accuracy, add a prompt and response row for each question, even if the model’s default response is correct. This action confirms that the default answer is accurate.

For more information about Datasets, see Snowflake Datasets.

Example Dataset¶

File	Prompt	Response
`file1.pdf`	`{"date": "What is the date?", "total": "What is the total amount?"}`	`{"date": "2024-06-30", "total": "82.50"}`
`file2.pdf`	`[["invoice_number", "What is the invoice number?"], ["vendor", "What is the vendor name?"]]`	`{"invoice_number": "543433434", "vendor": "Example Corp"}`
`file3.pdf`	{ "schema": { "type": "object", "properties": { "deductions": { "description": "Deductions", "type": "object", "properties": { "deductions_name": { "type": "array" }, "current": { "type": "array" } } } } } }	{ "deductions": { "deductions_name": [ "Federal Tax", "Wyoming State Tax", "SDI", "Soc Sec / OASDI", "Health Insurance Tax", "None" ], "current": [ "82.50", "64.08", "None", "13.32", "91.74", "21.46" ] } }

File

Prompt

Response

file1.pdf

{"date": "What is the date?", "total": "What is the total amount?"}

{"date": "2024-06-30", "total": "82.50"}

file2.pdf

[["invoice_number", "What is the invoice number?"], ["vendor", "What is the vendor name?"]]

{"invoice_number": "543433434", "vendor": "Example Corp"}

file3.pdf

{
  "schema":
  {
    "type": "object",
    "properties": {
      "deductions": {
        "description": "Deductions",
        "type": "object",
        "properties": {
          "deductions_name": {
            "type": "array"
          },
          "current": {
            "type": "array"
          }
        }
      }
    }
  }
}

{
  "deductions": {
    "deductions_name": [
      "Federal Tax",
      "Wyoming State Tax",
      "SDI",
      "Soc Sec / OASDI",
      "Health Insurance Tax",
      "None"
    ],
    "current": [
      "82.50",
      "64.08",
      "None",
      "13.32",
      "91.74",
      "21.46"
    ]
  }
}

Note

When you create the Dataset, set the response to None if the document does not contain an answer to the question.

Usage notes¶

Snowflake recommends using at least 20 documents for fine-tuning.
Supported file formats for documents are:
- PDF
- PNG
- JPG, JPEG
- TIFF, TIF
The maximum number of pages per document is:
- 64 pages for AWS US West 2 (Oregon) and AWS Europe Central 1 (Frankfurt)
- 125 pages for AWS US East 1 (N. Virginia) and Azure East US 2 (Virginia)
The maximum number of unique document files in the Dataset is 1,000. You can reference the same document file multiple times.
A limit exists on how many questions and documents can be in a fine-tuning job. Number of questions multiplied by total number of pages in all document files in the Dataset must be equal or less than 50,000.

For example, some valid combinations are:

Number of questions

Number of pages

Number of document file references [1]

10

1

5,000

100

1

500

10

10

500

25

10

200

Number of questions	Number of pages	Number of document file references [1]
10	1	5,000
100	1	500
10	10	500
25	10	200

Create a fine-tuning job¶

To create a fine-tuning job, you must create a Dataset object that contains the training data. The following example shows how to create a Dataset object and use the Dataset to create a fine-tuning job for an arctic-extract model.

Create the table which will contain the training data:
```
CREATE OR REPLACE TABLE my_data_table (f FILE, p VARCHAR, r VARCHAR);
```

Populate the table with the training data:

INSERT INTO my_data_table (f, p, r)
SELECT TO_FILE('@db.schema.stage', '1.pdf'), '{"net": "What is the net value?"}', '{"net": "3,762.56"}';

Create the Dataset object:
```
CREATE OR REPLACE DATASET my_dataset;
```

Create a new version of the Dataset that adds the training data, using the FL_GET_STAGE and the FL_GET_RELATIVE_PATH functions to get the file paths:

ALTER DATASET my_dataset
ADD VERSION 'v1' FROM (
  SELECT FL_GET_STAGE(f) || '/' || FL_GET_RELATIVE_PATH(f) AS "file",
       p AS "prompt",
       r AS "response"
  FROM my_data_table
);

Create a fine-tuning job:

SELECT SNOWFLAKE.CORTEX.FINETUNE(
  'CREATE',
  'my_tuned_model',
  'arctic-extract',
  'snow://dataset/db.schema.my_dataset/versions/v1'
);

Use your fine-tuned `arctic-extract` model for inference¶

To use the fine-tuned arctic-extract model for inference, ensure you have the following privileges on the model object:

OWNERSHIP
USAGE
READ

To use the fine-tuned arctic-extract model for inference with the AI_EXTRACT function, specify the model using the model parameter as shown in the following example:

SELECT AI_EXTRACT(
  model => 'db.schema.my_tuned_model',
  file => TO_FILE('@db.schema.files','document.pdf')
);

You can overwrite questions used for fine-tuning by using the responseFormat parameter as shown in the following example:

SELECT AI_EXTRACT(
  model => 'db.schema.my_tuned_model',
  file => TO_FILE('@db.schema.files','document.pdf'),
  responseFormat => [['name', 'What is the first name of the employee?'], ['city', 'Where does the employee live?']]
);

For more information, see AI_EXTRACT.

Tip

You can copy your fine-tuned arctic-extract model between databases and/or schemas within an account or between accounts. For more information, see Copy arctic-extract models between databases, schemas, and accounts.

Fine-tuning arctic-extract models¶

Syntax¶

FINETUNE ('CREATE') (SNOWFLAKE.CORTEX)¶

Syntax¶

Required parameters¶

Optional parameters¶

Access control requirements¶

Example¶

FINETUNE ('DESCRIBE') (SNOWFLAKE.CORTEX)¶

Dataset requirements¶

Example Dataset¶

Usage notes¶

Create a fine-tuning job¶

Use your fine-tuned arctic-extract model for inference¶

Fine-tuning `arctic-extract` models¶

Use your fine-tuned `arctic-extract` model for inference¶