Fine-tuning arctic-extract models¶
You can now fine-tune arctic-extract models using the Snowflake Cortex Fine-tuning
function and Snowflake Datasets. The fine-tuned model can then be used for inference with the
AI_EXTRACT function.
Syntax¶
For specific syntax, usage notes, and examples, see:
FINETUNE ('CREATE') (SNOWFLAKE.CORTEX)¶
Creates a fine-tuning job.
Syntax¶
SNOWFLAKE.CORTEX.FINETUNE(
'CREATE',
'@<database>.<schema>.<model_name>',
'arctic-extract',
'<training_dataset>'
[
, '<validation_dataset>'
]
)
Required parameters¶
'CREATE'Specifies that you want to create a fine-tuning job.
'training_dataset'Dataset object to use for training. For more information, see Dataset requirements.
Optional parameters¶
'validation_dataset'Dataset object to use for validation. For more information, see Dataset requirements.
Note
The options parameter is not supported for fine-tuning arctic-extract models. The number of epochs is automatically
determined by the system.
Access control requirements¶
Privilege |
Object |
Notes |
|---|---|---|
USAGE or OWNERSHIP |
DATABASE |
The database that the Dataset object is stored in. |
USAGE or OWNERSHIP |
SCHEMA |
The schema that the Dataset object is stored in. |
READ or OWNERSHIP |
STAGE |
The stage that the document files are stored in. |
USAGE or OWNERSHIP |
SCHEMA |
The schema that the fine-tuned model is stored in. |
CREATE MODEL |
SCHEMA |
The schema that the fine-tuned model is stored in. |
Additionally, to use the FINETUNE function, the ACCOUNTADMIN role must grant the SNOWFLAKE.CORTEX_USER database role to the user who will call the function. See LLM Functions required privileges topic for details.
Example¶
SELECT SNOWFLAKE.CORTEX.FINETUNE(
'CREATE',
'@database.schema.model_name',
'arctic-extract`,
'snow://dataset/training_ds/versions/2',
'snow://dataset/validation_ds/versions/4'
);
FINETUNE ('DESCRIBE') (SNOWFLAKE.CORTEX)¶
Describes the properties of a fine-tuning job.
For syntax and parameters, see FINETUNE ('DESCRIBE') (SNOWFLAKE.CORTEX).
An example output for a successful job when fine-tuning arctic-extract model:
{
"base_model":"arctic-extract",
"created_on":1717004388348,
"finished_on":1717004691577,
"id":"ft_6556e15c-8f12-4d94-8cb0-87e6f2fd2299",
"model":"mydb.myschema.my_tuned_model",
"progress":1.0,
"status":"SUCCESS",
"training_data":"snow://dataset/training_ds/versions/2",
"trained_tokens":2670734,
"training_result":{"validation_loss":1.0138969421386719,"training_loss":0.6477728401547047},
"validation_data":"snow://dataset/validation_ds/versions/4",
}
Dataset requirements¶
The Dataset used for training and validation must contain the following columns:
- File:
A string containing the file path to the document for extraction. For example:
@db.schema.stage/file.pdf- Prompt:
Key and question pairs for extraction in one of the formats supported by the
responseFormatargument of the AI_EXTRACT function.For more information, see AI_EXTRACT.
- Response:
A JSON object containing key and response pairs.
Note
Column names are case-insensitive and can be in any order in the Dataset; however, all required columns
(File, Prompt, and Response) must be present for the Dataset to be valid. Additional columns in the Dataset are ignored.
When preparing the Dataset, note the following:
The schema of the fine-tuned model is the unique set of all questions in the Dataset.
The answers in the
Responsecolumn should match the questions in thePromptcolumn by matching keys in thePromptandResponsecolumns.You don’t have to specify the same set of questions for every document.
To improve model accuracy, add a prompt and response row for each question, even if the model’s default response is correct. This action confirms that the default answer is accurate.
For more information about Datasets, see Snowflake Datasets.
Example Dataset¶
File |
Prompt |
Response |
|---|---|---|
|
|
|
|
|
|
|
{
"schema":
{
"type": "object",
"properties": {
"deductions": {
"description": "Deductions",
"type": "object",
"properties": {
"deductions_name": {
"type": "array"
},
"current": {
"type": "array"
}
}
}
}
}
}
|
{
"deductions": {
"deductions_name": [
"Federal Tax",
"Wyoming State Tax",
"SDI",
"Soc Sec / OASDI",
"Health Insurance Tax",
"None"
],
"current": [
"82.50",
"64.08",
"None",
"13.32",
"91.74",
"21.46"
]
}
}
|
Note
When you create the Dataset, set the response to None if the document does not contain an answer to the question.
Usage notes¶
Snowflake recommends using at least 20 documents for fine-tuning.
Supported file formats for documents are:
PDF
PNG
JPG, JPEG
TIFF, TIF
The maximum number of pages per document is:
64 pages for AWS US West 2 (Oregon) and AWS Europe Central 1 (Frankfurt)
125 pages for AWS US East 1 (N. Virginia) and Azure East US 2 (Virginia)
The maximum number of unique document files in the Dataset is 1,000. You can reference the same document file multiple times.
A limit exists on how many questions and documents can be in a fine-tuning job. Number of questions multiplied by total number of pages in all document files in the Dataset must be equal or less than 50,000.
For example, some valid combinations are:
Number of questions
Number of pages
Number of document file references [1]
10
1
5,000
100
1
500
10
10
500
25
10
200
Create a fine-tuning job¶
To create a fine-tuning job, you must create a Dataset object that contains the training data. The following example shows how to
create a Dataset object and use the Dataset to create a fine-tuning job for an arctic-extract model.
Create the table which will contain the training data:
CREATE OR REPLACE TABLE my_data_table (f FILE, p VARCHAR, r VARCHAR);
Populate the table with the training data:
INSERT INTO my_data_table (f, p, r) SELECT TO_FILE('@db.schema.stage', '1.pdf'), '{"net": "What is the net value?"}', '{"net": "3,762.56"}';
Create the Dataset object:
CREATE OR REPLACE DATASET my_dataset;
Create a new version of the Dataset that adds the training data, using the FL_GET_STAGE and the FL_GET_RELATIVE_PATH functions to get the file paths:
ALTER DATASET my_dataset ADD VERSION 'v1' FROM ( SELECT FL_GET_STAGE(f) || '/' || FL_GET_RELATIVE_PATH(f) AS "file", p AS "prompt", r AS "response" FROM my_data_table );
Create a fine-tuning job:
SELECT SNOWFLAKE.CORTEX.FINETUNE( 'CREATE', 'my_tuned_model', 'arctic-extract', 'snow://dataset/db.schema.my_dataset/versions/v1' );
Use your fine-tuned arctic-extract model for inference¶
To use the fine-tuned arctic-extract model for inference, ensure you have the following privileges on the model object:
OWNERSHIP
USAGE
READ
To use the fine-tuned arctic-extract model for inference with the AI_EXTRACT function,
specify the model using the model parameter as shown in the following example:
SELECT AI_EXTRACT(
model => 'db.schema.my_tuned_model',
file => TO_FILE('@db.schema.files','document.pdf')
);
You can overwrite questions used for fine-tuning by using the responseFormat parameter as shown in the following example:
SELECT AI_EXTRACT(
model => 'db.schema.my_tuned_model',
file => TO_FILE('@db.schema.files','document.pdf'),
responseFormat => [['name', 'What is the first name of the employee?'], ['city', 'Where does the employee live?']]
);
For more information, see AI_EXTRACT.