Using the Snowpark XML RowTag Reader¶

You can activate the Snowpark XML RowTag Reader by specifying .option("rowTag", "<rowtag>") in session.read.option("rowTag", "<rowtag>").xml(). Instead of loading the entire document as a single object, this mode splits the file based on the specified rowTag, loads each matching element as a separate row, and splits each row into multiple columns in a Snowpark DataFrame. The Reader is especially useful for processing only selective elements in XML files or ingesting large XML files in a scalable, Snowpark-native way.

Exemple¶

This sample XML is an example:

<library>
    <book id="1">
        <title>The Art of Snowflake</title>
        <author>Jane Doe</author>
        <price>29.99</price>
        <reviews>
            <review>
                <user>tech_guru_87</user>
                <rating>5</rating>
                <comment>Very insightful and practical.</comment>
            </review>
            <review>
                <user>datawizard</user>
                <rating>4</rating>
                <comment>Great read for data engineers.</comment>
            </review>
        </reviews>
        <editions>
            <edition year="2023" format="Hardcover"/>
            <edition year="2024" format="eBook"/>
        </editions>
    </book>

    <book id="2">
        <title>XML for Data Engineers</title>
        <author>John Smith</author>
        <price>35.50</price>
        <reviews>
            <review>
                <user>xml_master</user>
                <rating>5</rating>
                <comment>Perfect for mastering XML parsing.</comment>
            </review>
        </reviews>
        <editions>
            <edition year="2022" format="Paperback"/>
        </editions>
    </book>
</library>

Copy

Script Snowpark¶

df = session.read.option("rowTag", "book").xml("@mystage/books.xml")

Copy

Cela charge chaque élément <book> du fichier XML dans sa propre ligne, avec des éléments enfants (par exemple, <title> et <author>) automatiquement extraits en colonnes de type VARIANT.

Sortie¶

`_id`	`author`	`editions`	`price`	`reviews`	`title`
« 2 »	« John Smith »	`{ "edition": { "_format": "Paperback", "_year": "2022" } }`	« 35.50 »	`{ "review": { "comment": "Perfect for mastering XML parsing.", "rating": "5", "user": "xml_master" } }`	« XML pour les ingénieurs de données »
« 1 »	« Jane Doe »	`{ "edition": [ { "_format": "Hardcover", "_year": "2023" }, { "_format": "eBook", "_year": "2024" } ] }`	« 29.99 »	`{ "review": [ { "comment": "Very insightful and practical.", "rating": "5", "user": "tech_guru_87" }, { "comment": "Great read for data engineers.", "rating": "4", "user": "datawizard" } ] }`	« L’art de Snowflake »

Chaque élément XML identifié par rowTag devient une ligne.
Chaque sous-élément de cette balise devient une colonne, stockée comme VARIANT. Les éléments imbriqués sont capturés sous la forme de données VARIANT imbriquées.
The resulting DataFrame is flattened and columnized and behaves like any other Snowpark DataFrame.

Prise en main¶

Installez le paquet Snowpark Python :
```
pip install snowflake-snowpark-python
```
Copy
Chargez vos fichiers XML dans une zone de préparation Snowflake :
```
PUT file:///path/to/books.xml @mystage;
```
Copy

Utilisez Snowpark pour lire le fichier XML :

df = session.read.option("rowTag", "book").xml("@mystage/books.xml")

Copy

Utilisez les méthodes DataFrame pour transformer ou enregistrer :

df.select(col("`title`"), col("`author`")).show()
df.write.save_as_table("books_table")

Copy

Options non prises en charge¶

rowTag (obligatoire) : Le nom de l’élément XML à extraire sous forme de ligne.
rowValidationXSDPath (facultatif) : Chemin d’accès de la zone de préparation à un XSD utilisé pour valider chaque fragment rowTag pendant le chargement.
mode (facultatif) : Le comportement par défaut se charge sans validation. Lorsque rowValidationXSDPath est défini :
- PERMISSIVE: Quarantines invalid rows in _corrupt_record; loads the rest.
- FAILFAST: Stops at the first invalid row and raises an error.

Pour plus d’informations sur les options XML, voir snowflake.snowpark.DataFrameReader.xml.

Validate XML using XSD¶

To validate each rowTag fragment against an XSD during load, set the XSD path and choose a validation mode:

df = (
session.read
    .option("rowTag", "book")
    .option("rowValidationXSDPath", "@mystage/schema.xsd")  # validates each row element
    .option("mode", "PERMISSIVE")                         # or "FAILFAST"
    .xml("@mystage/books.xml")
)

Copy

PERMISSIVE: Invalid rows are quarantined in a special _corrupt_record column; valid rows load normally.

To persist the result, write the DataFrame to a table with df.write.save_as_table("<table_name>"). The table will include all parsed columns plus an extra _corrupt_record column: it is NULL for valid rows and contains the full XML records for invalid rows (with the other columns showing NULL).
+-------------------+ | _corrupt_record | | <book id="1"> ... | | <book id="2"> ... | +-------------------+

FAILFAST : La lecture s’arrête à la première ligne fautive et renvoie une erreur.

Limitations¶

Le lecteur Snowpark XML RowTag a les limites suivantes :

Ne déduit pas le schéma et les colonnes de sortie sont toutes de type VARIANT.
Only supports files stored in Snowflake stages; local files are not supported.
Est disponible uniquement dans la bibliothèque Snowpark Python.