snowflake.snowpark.DataFrameAIFunctions.split_text_markdown_header¶
- DataFrameAIFunctions.split_text_markdown_header(text_to_split: Union[snowflake.snowpark.column.Column, str], headers_to_split_on: Union[Dict[str, str], Column], chunk_size: Union[int, Column], *, overlap: Union[int, Column] = 0, output_column: Optional[str] = None) snowflake.snowpark.DataFrame [source]¶
Split Markdown-formatted text into structured chunks based on header levels.
This method segments text using specified Markdown headers and recursively splits each segment to produce chunks of the desired size. It preserves document structure by tracking which headers each chunk falls under.
- Parameters:
text_to_split – The column (Column object or column name as string) containing the Markdown-formatted text to split.
headers_to_split_on – A dictionary mapping Markdown header syntax to metadata field names, or a Column containing such a mapping. For example:
{"#": "header_1", "##": "header_2"}
will split on # and ## headers.chunk_size – The maximum number of characters in each chunk. Must be greater than zero. Can be an integer or a Column containing integer values.
overlap – Optional number of characters to overlap between consecutive chunks. Defaults to 0 if not provided. Can be an integer or a Column.
output_column – The name of the output column to be appended. If not provided, a column named
SPLIT_TEXT_MARKDOWN_HEADER_OUTPUT
is appended.
- Returns:
A new DataFrame with an appended output column containing an array of objects. Each object has:
chunk
: A string containing the extracted textheaders
: A dictionary containing the Markdown header values under which the chunk is nested
Examples:
>>> # Split a simple Markdown document >>> df = session.create_dataframe([ ... ["# Introduction\nThis is the intro.\n## Background\nSome background info."], ... ], schema=["document"]) >>> result_df = df.ai.split_text_markdown_header( ... text_to_split="document", ... headers_to_split_on={"#": "section", "##": "subsection"}, ... chunk_size=20, ... overlap=5, ... output_column="chunks" ... ) >>> result_df.show() -------------------------------------------------------------- |"DOCUMENT" |"CHUNKS" | -------------------------------------------------------------- |# Introduction |[ | |This is the intro. | { | |## Background | "chunk": "This is the intro.", | |Some background info. | "headers": { | | | "section": "Introduction" | | | } | | | }, | | | { | | | "chunk": "Some background", | | | "headers": { | | | "section": "Introduction", | | | "subsection": "Background" | | | } | | | }, | | | { | | | "chunk": "info.", | | | "headers": { | | | "section": "Introduction", | | | "subsection": "Background" | | | } | | | } | | |] | --------------------------------------------------------------
Note
The function preserves document hierarchy by including parent headers for each chunk
Chunks are created using recursive character splitting after initial header segmentation
Overlap helps maintain context across chunk boundaries
This function or method is experimental since 1.39.0.