snowflake.snowpark.DataFrameAIFunctions.split_text_markdown_header¶

DataFrameAIFunctions.split_text_markdown_header(text_to_split: Union[snowflake.snowpark.column.Column, str], headers_to_split_on: Union[Dict[str, str], Column], chunk_size: Union[int, Column], *, overlap: Union[int, Column] = 0, output_column: Optional[str] = None) → snowflake.snowpark.DataFrame[source]¶

Split Markdown-formatted text into structured chunks based on header levels.

This method segments text using specified Markdown headers and recursively splits each segment to produce chunks of the desired size. It preserves document structure by tracking which headers each chunk falls under.

Parameters:
  • text_to_split – The column (Column object or column name as string) containing the Markdown-formatted text to split.

  • headers_to_split_on – A dictionary mapping Markdown header syntax to metadata field names, or a Column containing such a mapping. For example: {"#": "header_1", "##": "header_2"} will split on # and ## headers.

  • chunk_size – The maximum number of characters in each chunk. Must be greater than zero. Can be an integer or a Column containing integer values.

  • overlap – Optional number of characters to overlap between consecutive chunks. Defaults to 0 if not provided. Can be an integer or a Column.

  • output_column – The name of the output column to be appended. If not provided, a column named SPLIT_TEXT_MARKDOWN_HEADER_OUTPUT is appended.

Returns:

A new DataFrame with an appended output column containing an array of objects. Each object has:

  • chunk: A string containing the extracted text

  • headers: A dictionary containing the Markdown header values under which the chunk is nested

Examples:

>>> # Split a simple Markdown document
>>> df = session.create_dataframe([
...     ["# Introduction\nThis is the intro.\n## Background\nSome background info."],
... ], schema=["document"])
>>> result_df = df.ai.split_text_markdown_header(
...     text_to_split="document",
...     headers_to_split_on={"#": "section", "##": "subsection"},
...     chunk_size=20,
...     overlap=5,
...     output_column="chunks"
... )
>>> result_df.show()
--------------------------------------------------------------
|"DOCUMENT"             |"CHUNKS"                            |
--------------------------------------------------------------
|# Introduction         |[                                   |
|This is the intro.     |  {                                 |
|## Background          |    "chunk": "This is the intro.",  |
|Some background info.  |    "headers": {                    |
|                       |      "section": "Introduction"     |
|                       |    }                               |
|                       |  },                                |
|                       |  {                                 |
|                       |    "chunk": "Some background",     |
|                       |    "headers": {                    |
|                       |      "section": "Introduction",    |
|                       |      "subsection": "Background"    |
|                       |    }                               |
|                       |  },                                |
|                       |  {                                 |
|                       |    "chunk": "info.",               |
|                       |    "headers": {                    |
|                       |      "section": "Introduction",    |
|                       |      "subsection": "Background"    |
|                       |    }                               |
|                       |  }                                 |
|                       |]                                   |
--------------------------------------------------------------
Copy

Note

  • The function preserves document hierarchy by including parent headers for each chunk

  • Chunks are created using recursive character splitting after initial header segmentation

  • Overlap helps maintain context across chunk boundaries

This function or method is experimental since 1.39.0.