snowflake.snowpark.DataFrameAIFunctions.split_text_recursive_character¶
- DataFrameAIFunctions.split_text_recursive_character(text_to_split: Union[snowflake.snowpark.column.Column, str], format: Literal['none', 'markdown'], chunk_size: Union[int, Column], *, overlap: Union[int, Column] = 0, separators: Union[List[str], Column] = ('\n\n', '\n', ' ', ''), output_column: Optional[str] = None) snowflake.snowpark.DataFrame[source]¶
Split text into chunks using recursive character-based splitting.
This method splits text by recursively trying a list of separators in order, creating chunks that fit within the specified size limit. It’s useful for breaking down large documents for embedding, RAG, or search indexing.
- Parameters:
text_to_split – The column (Column object or column name as string) containing the text to split.
format –
The format of your input text, which determines the default separators in the splitting algorithm. Must be one of the following:
none: No format-specific separators. Only the separators in the separators field are used for splitting.markdown: Separates on headers, code blocks, and tables, in addition to any separators in the separators field.
chunk_size – The maximum number of characters in each chunk. Must be greater than zero. Can be an integer or a Column containing integer values.
overlap – Optional number of characters to overlap between consecutive chunks. Defaults to 0 if not provided. Can be an integer or a Column.
separators – A list of separator strings to use for splitting, or a Column containing an array of separators. The function tries separators in order until it finds one that produces appropriately sized chunks. Defaults to
["\n\n", "\n", " ", ""].output_column – The name of the output column to be appended. If not provided, a column named
SPLIT_TEXT_RECURSIVE_CHARACTER_OUTPUTis appended.
- Returns:
A new DataFrame with an appended output column containing an array of text chunks.
Examples:
>>> # Basic text splitting without format >>> df = session.create_dataframe([ ... ["This is a long document. It has multiple sentences.\n\nAnd multiple paragraphs."], ... ], schema=["text"]) >>> result_df = df.ai.split_text_recursive_character( ... text_to_split="text", ... format="none", ... chunk_size=30, ... overlap=5, ... output_column="chunks" ... ) >>> result_df.show() ----------------------------------------------------------------------------------------- |"TEXT" |"CHUNKS" | ----------------------------------------------------------------------------------------- |This is a long document. It has multiple senten... |[ | | | "This is a long document. It", | |And multiple paragraphs. | "It has multiple sentences.", | | | "And multiple paragraphs." | | |] | ----------------------------------------------------------------------------------------- >>> # Split markdown formatted text >>> from snowflake.snowpark.functions import col >>> markdown_text = "# Title\n\n## Subtitle\n\nMore content." >>> df = session.create_dataframe([ ... [markdown_text], ... ], schema=["text"]) >>> result_df = df.ai.split_text_recursive_character( ... text_to_split=col("text"), ... format="markdown", ... chunk_size=25, ... overlap=3, ... output_column="md_chunks" ... ) >>> result_df.show() ------------------------------------- |"TEXT" |"MD_CHUNKS" | ------------------------------------- |# Title |[ | | | "# Title", | |## Subtitle | "## Subtitle", | | | "More content." | |More content. |] | ------------------------------------- >>> # Custom separators with code >>> df = session.create_dataframe([ ... ["def hello():\n print('Hello')\n\ndef world():\n print('World')"], ... ], schema=["code"]) >>> result_df = df.ai.split_text_recursive_character( ... text_to_split="code", ... format="none", ... chunk_size=30, ... separators=["\n\n", "\n", " ", " ", ""], ... output_column="code_chunks" ... ) >>> result_df.show() -------------------------------------------- |"CODE" |"CODE_CHUNKS" | -------------------------------------------- |def hello(): |[ | | print('Hello') | "def hello():", | | | "print('Hello')", | |def world(): | "def world():", | | print('World') | "print('World')" | | |] | -------------------------------------------- >>> # Custom separators >>> df = session.create_dataframe([ ... ["First sentence. Second sentence. Third sentence.", "none", 15, 3], ... ], schema=["text", "fmt", "max_size", "overlap_size"]) >>> result_df = df.ai.split_text_recursive_character( ... text_to_split=col("text"), ... format=col("fmt"), ... chunk_size=col("max_size"), ... overlap=col("overlap_size"), ... separators=[". ", " ", ""], ... output_column="split_text" ... ) >>> result_df.select("text", "split_text").show() -------------------------------------------------------------------------- |"TEXT" |"SPLIT_TEXT" | -------------------------------------------------------------------------- |First sentence. Second sentence. Third sentence. |[ | | | "First sentence", | | | ". Second", | | | "sentence", | | | ". Third", | | | "sentence." | | |] | --------------------------------------------------------------------------
Note
The function tries separators in the order provided
If no separator produces small enough chunks, it splits by individual characters
Overlap helps maintain context between chunks, useful for embedding and retrieval
Choose separators appropriate for your content type (e.g., paragraphs for prose, functions for code)
This function or method is experimental since 1.39.0.