Resource definition and ingestion processes

After the connector is configured it can start ingesting the data. However, usually some more information is needed before it can ingest the data from the source system. Most of the systems persist the data with at least some granularity, be it tables, repositories, files, or reports. The Snowflake Native SDK for Connectors uses a term resource regardless of the name in the original system. To identify resources and customize settings for their ingestion, resource_ingestion_definitions are being used. Additionally, the actual process of ingestion is organized into ingestion_processes, which consist of multiple ingestion_runs. This abstraction makes it easier to track, schedule and differentiate processes.

Requirements

This section requires at least the following SQL files to be executed during native app installation:

  • ingestion/create_resource.sql

  • ingestion/ingestion_definitions_view.sql

  • ingestion/ingestion_process.sql

  • ingestion/ingestion_run.sql

  • ingestion/resource_ingestion_definition.sql

Resource ingestion definition

Resource ingestion definition is a generic entity that contains the definition of the source data in the source system. To keep it as generic as possible the system specific options are persisted as variants in the underlying STATE.RESOURCE_INGESTION_DEFINITION table. However, the Java definition of the repository ResourceIngestionDefinitionRepository is a generic interface to have better control over typing.

Since most of the resource ingestion definition can be customized by during the implementation, then it is up to the developer to decide how to use the generic fields and then make use of them during ingestion.

The most important customizable properties of the resource ingestion definition are:

  • parent_id

This optional parameter allows linking resource definitions with each other, for example, to inherit a part of the configuration.

  • resource_id

This variant should allow the identification of a resource in the source system, it should be unique.

  • ingestion_configurations

This property actual configuration of the ingestion, each definition can have multiple configurations, for example if for some reason the same resource should be ingested at two different schedules or saved into multiple sink tables. This property has some required fields inside of it, but still allows some flexibility when it comes to defining custom configuration and destination of the data.

  • resource_metadata

This property should contain any additional information that is needed, but does not fit into above mentioned fields.

Adding new resource

To add a new resource to the system, simply call PUBLIC.CREATE_RESOURCE(name VARCHAR, resource_id VARIANT, ingestion_configurations VARIANT, id VARCHAR DEFAULT NULL, enabled BOOLEAN DEFAULT FALSE, resource_metadata VARIANT DEFAULT NULL) procedure. The provided resource will be saved in the STATE.RESOURCE_INGESTION_DEFINITION table and corresponding ingestion_process will be created if it was enabled from the get-go. This resource uses CreateResourceHandler Java class as the entry point.

Viewing resources

The configured resources definitions can be viewed using a provided view called PUBLIC.INGESTION_DEFINITIONS. However, it only returns basic information about each resource. All the custom configurations are not visible to the end user, especially because some of them can be generated internally by the connector’s logic.

Ingestion process

Ingestion process is an entity representing enabled process of ingesting a defined resource. It is created once a resource is added or enabled and should be completed once it’s deleted or disabled. In a way it is kind of like a background process in the operating system, it can be alive but not necessarily doing any work at the particular moment. Whenever the ingestion is actually running it can be transitioned to IN_PROGRESS state, otherwise it can remain in SCHEDULED state. When dispatching work scheduler retrieves all the SCHEDULED processes and runs ingestion for them.

The ingestion process can be also used to define different types of ingestion, for example, say that on a daily basis connector loads some data, but for some reason some old data is corrupted and needs to be reloaded. If that’s the case then a new process type can be introduced, for example RELOAD. Then scheduler can have custom logic to perform different operations for different types of processes.

Ingestion run

Ingestion run is another entity to store information about the past and ongoing ingestion. However, this data is more granular than the ingestion_process itself. First of all, ingestion run should be considered as a log data. Secondly, ingestion_run is an entry describing just a single invocation during a long running process. So if a resource is ingested once a day, then every day there should be a new ingestion run entry. All of those entries will be linked with the single process.