Using the Snowpark XML RowTag Reader¶

You can activate the Snowpark XML RowTag Reader by specifying .option("rowTag", "<rowtag>") in session.read.option("rowTag", "<rowtag>").xml(). Instead of loading the entire document as a single object, this mode splits the file based on the specified rowTag, loads each matching element as a separate row, and splits each row into multiple columns in a Snowpark DataFrame. The Reader is especially useful for processing only selective elements in XML files or ingesting large XML files in a scalable, Snowpark-native way.

Exemplo¶

This sample XML is an example:

<library>
    <book id="1">
        <title>The Art of Snowflake</title>
        <author>Jane Doe</author>
        <price>29.99</price>
        <reviews>
            <review>
                <user>tech_guru_87</user>
                <rating>5</rating>
                <comment>Very insightful and practical.</comment>
            </review>
            <review>
                <user>datawizard</user>
                <rating>4</rating>
                <comment>Great read for data engineers.</comment>
            </review>
        </reviews>
        <editions>
            <edition year="2023" format="Hardcover"/>
            <edition year="2024" format="eBook"/>
        </editions>
    </book>

    <book id="2">
        <title>XML for Data Engineers</title>
        <author>John Smith</author>
        <price>35.50</price>
        <reviews>
            <review>
                <user>xml_master</user>
                <rating>5</rating>
                <comment>Perfect for mastering XML parsing.</comment>
            </review>
        </reviews>
        <editions>
            <edition year="2022" format="Paperback"/>
        </editions>
    </book>
</library>

Copy

Script do Snowpark¶

df = session.read.option("rowTag", "book").xml("@mystage/books.xml")

Copy

Ele carrega cada elemento <book> do arquivo XML na própria linha, com elementos filho (por exemplo, <title> e <author>) extraídos automaticamente como colunas do tipo VARIANT.

Saída¶

`_id`	`author`	`editions`	`price`	`reviews`	`title`
«2»	«John Smith»	`{ "edition": { "_format": "Paperback", "_year": "2022" } }`	«35.50»	`{ "review": { "comment": "Perfect for mastering XML parsing.", "rating": "5", "user": "xml_master" } }`	«XML for Data Engineers»
«1»	«Jane Doe»	`{ "edition": [ { "_format": "Hardcover", "_year": "2023" }, { "_format": "eBook", "_year": "2024" } ] }`	«29.99»	`{ "review": [ { "comment": "Very insightful and practical.", "rating": "5", "user": "tech_guru_87" }, { "comment": "Great read for data engineers.", "rating": "4", "user": "datawizard" } ] }`	«The Art of Snowflake»

Cada elemento XML identificado por rowTag se torna uma linha.
Cada subelemento nessa tag se torna uma coluna, armazenada como VARIANT. Os elementos aninhados são capturados como dados VARIANT aninhados.
The resulting DataFrame is flattened and columnized and behaves like any other Snowpark DataFrame.

Introdução¶

Instale o pacote Snowpark Python:
```
pip install snowflake-snowpark-python
```
Copy
Carregue o arquivo XML em uma área de preparação do Snowflake:
```
PUT file:///path/to/books.xml @mystage;
```
Copy

Use o Snowpark para ler o arquivo XML:

df = session.read.option("rowTag", "book").xml("@mystage/books.xml")

Copy

Use os métodos do DataFrame para transformar ou salvar:

df.select(col("`title`"), col("`author`")).show()
df.write.save_as_table("books_table")

Copy

Opções compatíveis¶

rowTag (obrigatório): o nome do elemento XML que será extraído como uma linha.
rowValidationXSDPath (opcional): caminho da área de preparação para um XSD usado para validar cada fragmento rowTag durante o carregamento.
mode (opcional): o comportamento padrão é carregar sem validação. Quando rowValidationXSDPath está definido:
- PERMISSIVE: Quarantines invalid rows in _corrupt_record; loads the rest.
- FAILFAST: Stops at the first invalid row and raises an error.

Para obter mais informações sobre as opções XML, consulte snowflake.snowpark.DataFrameReader.xml.

Validate XML using XSD¶

To validate each rowTag fragment against an XSD during load, set the XSD path and choose a validation mode:

df = (
session.read
    .option("rowTag", "book")
    .option("rowValidationXSDPath", "@mystage/schema.xsd")  # validates each row element
    .option("mode", "PERMISSIVE")                         # or "FAILFAST"
    .xml("@mystage/books.xml")
)

Copy

PERMISSIVE: Invalid rows are quarantined in a special _corrupt_record column; valid rows load normally.

To persist the result, write the DataFrame to a table with df.write.save_as_table("<table_name>"). The table will include all parsed columns plus an extra _corrupt_record column: it is NULL for valid rows and contains the full XML records for invalid rows (with the other columns showing NULL).
+-------------------+ | _corrupt_record | | <book id="1"> ... | | <book id="2"> ... | +-------------------+

FAILFAST: A leitura para na primeira linha com problema e retorna um erro.

Limitações¶

O Snowpark XML RowTag Reader tem as seguintes limitações:

Não infere o esquema, e as colunas de saída são todas do tipo VARIANT.
Only supports files stored in Snowflake stages; local files are not supported.
Está disponível apenas na biblioteca Snowpark Python.