You are viewing documentation about an older version (1.8.0). View latest version

snowflake.snowpark.DataFrameAIFunctions.split_text_recursive_character¶

DataFrameAIFunctions.split_text_recursive_character(text_to_split: Union[snowflake.snowpark.column.Column, str], format: Literal['none', 'markdown'], chunk_size: Union[int, Column], *, overlap: Union[int, Column] = 0, separators: Union[List[str], Column] = ('\n\n', '\n', ' ', ''), output_column: Optional[str] = None) → snowflake.snowpark.DataFrame[source]¶

Split text into chunks using recursive character-based splitting.

This method splits text by recursively trying a list of separators in order, creating chunks that fit within the specified size limit. It’s useful for breaking down large documents for embedding, RAG, or search indexing.

Parameters:
  • text_to_split – The column (Column object or column name as string) containing the text to split.

  • format –

    The format of your input text, which determines the default separators in the splitting algorithm. Must be one of the following:

    • none: No format-specific separators. Only the separators in the separators field are used for splitting.

    • markdown: Separates on headers, code blocks, and tables, in addition to any separators in the separators field.

  • chunk_size – The maximum number of characters in each chunk. Must be greater than zero. Can be an integer or a Column containing integer values.

  • overlap – Optional number of characters to overlap between consecutive chunks. Defaults to 0 if not provided. Can be an integer or a Column.

  • separators – A list of separator strings to use for splitting, or a Column containing an array of separators. The function tries separators in order until it finds one that produces appropriately sized chunks. Defaults to ["\n\n", "\n", " ", ""].

  • output_column – The name of the output column to be appended. If not provided, a column named SPLIT_TEXT_RECURSIVE_CHARACTER_OUTPUT is appended.

Returns:

A new DataFrame with an appended output column containing an array of text chunks.

Examples:

>>> # Basic text splitting without format
>>> df = session.create_dataframe([
...     ["This is a long document. It has multiple sentences.\n\nAnd multiple paragraphs."],
... ], schema=["text"])
>>> result_df = df.ai.split_text_recursive_character(
...     text_to_split="text",
...     format="none",
...     chunk_size=30,
...     overlap=5,
...     output_column="chunks"
... )
>>> result_df.show()
-----------------------------------------------------------------------------------------
|"TEXT"                                              |"CHUNKS"                          |
-----------------------------------------------------------------------------------------
|This is a long document. It has multiple senten...  |[                                 |
|                                                    |  "This is a long document. It",  |
|And multiple paragraphs.                            |  "It has multiple sentences.",   |
|                                                    |  "And multiple paragraphs."      |
|                                                    |]                                 |
-----------------------------------------------------------------------------------------


>>> # Split markdown formatted text
>>> from snowflake.snowpark.functions import col
>>> markdown_text = "# Title\n\n## Subtitle\n\nMore content."
>>> df = session.create_dataframe([
...     [markdown_text],
... ], schema=["text"])
>>> result_df = df.ai.split_text_recursive_character(
...     text_to_split=col("text"),
...     format="markdown",
...     chunk_size=25,
...     overlap=3,
...     output_column="md_chunks"
... )
>>> result_df.show()
-------------------------------------
|"TEXT"         |"MD_CHUNKS"        |
-------------------------------------
|# Title        |[                  |
|               |  "# Title",       |
|## Subtitle    |  "## Subtitle",   |
|               |  "More content."  |
|More content.  |]                  |
-------------------------------------


>>> # Custom separators with code
>>> df = session.create_dataframe([
...     ["def hello():\n    print('Hello')\n\ndef world():\n    print('World')"],
... ], schema=["code"])
>>> result_df = df.ai.split_text_recursive_character(
...     text_to_split="code",
...     format="none",
...     chunk_size=30,
...     separators=["\n\n", "\n", "    ", " ", ""],
...     output_column="code_chunks"
... )
>>> result_df.show()
--------------------------------------------
|"CODE"              |"CODE_CHUNKS"        |
--------------------------------------------
|def hello():        |[                    |
|    print('Hello')  |  "def hello():",    |
|                    |  "print('Hello')",  |
|def world():        |  "def world():",    |
|    print('World')  |  "print('World')"   |
|                    |]                    |
--------------------------------------------


>>> # Custom separators
>>> df = session.create_dataframe([
...     ["First sentence. Second sentence. Third sentence.", "none", 15, 3],
... ], schema=["text", "fmt", "max_size", "overlap_size"])
>>> result_df = df.ai.split_text_recursive_character(
...     text_to_split=col("text"),
...     format=col("fmt"),
...     chunk_size=col("max_size"),
...     overlap=col("overlap_size"),
...     separators=[". ", " ", ""],
...     output_column="split_text"
... )
>>> result_df.select("text", "split_text").show()
--------------------------------------------------------------------------
|"TEXT"                                            |"SPLIT_TEXT"         |
--------------------------------------------------------------------------
|First sentence. Second sentence. Third sentence.  |[                    |
|                                                  |  "First sentence",  |
|                                                  |  ". Second",        |
|                                                  |  "sentence",        |
|                                                  |  ". Third",         |
|                                                  |  "sentence."        |
|                                                  |]                    |
--------------------------------------------------------------------------
Copy

Note

  • The function tries separators in the order provided

  • If no separator produces small enough chunks, it splits by individual characters

  • Overlap helps maintain context between chunks, useful for embedding and retrieval

  • Choose separators appropriate for your content type (e.g., paragraphs for prose, functions for code)

This function or method is experimental since 1.39.0.