Using the Snowpark XML RowTag Reader¶

You can activate the Snowpark XML RowTag Reader by specifying .option("rowTag", "<rowtag>") in session.read.option("rowTag", "<rowtag>").xml(). Instead of loading the entire document as a single object, this mode splits the file based on the specified rowTag, loads each matching element as a separate row, and splits each row into multiple columns in a Snowpark DataFrame. The Reader is especially useful for processing only selective elements in XML files or ingesting large XML files in a scalable, Snowpark-native way.

Beispiel¶

This sample XML is an example:

<library>
    <book id="1">
        <title>The Art of Snowflake</title>
        <author>Jane Doe</author>
        <price>29.99</price>
        <reviews>
            <review>
                <user>tech_guru_87</user>
                <rating>5</rating>
                <comment>Very insightful and practical.</comment>
            </review>
            <review>
                <user>datawizard</user>
                <rating>4</rating>
                <comment>Great read for data engineers.</comment>
            </review>
        </reviews>
        <editions>
            <edition year="2023" format="Hardcover"/>
            <edition year="2024" format="eBook"/>
        </editions>
    </book>

    <book id="2">
        <title>XML for Data Engineers</title>
        <author>John Smith</author>
        <price>35.50</price>
        <reviews>
            <review>
                <user>xml_master</user>
                <rating>5</rating>
                <comment>Perfect for mastering XML parsing.</comment>
            </review>
        </reviews>
        <editions>
            <edition year="2022" format="Paperback"/>
        </editions>
    </book>
</library>

Copy

Snowpark-Skript¶

df = session.read.option("rowTag", "book").xml("@mystage/books.xml")

Copy

Dieses lädt jedes <book>-Element aus der XML-Datei in die zugehörige eigene Zeile mit untergeordneten Elementen (z. B. <title> und <author>), die automatisch als Spalten vom Typ VARIANT extrahiert werden.

Ausgabe¶

`_id`	`author`	`editions`	`price`	`reviews`	`title`
„2“	„Joan Smith“	`{ "edition": { "_format": "Paperback", "_year": "2022" } }`	„35,50“	`{ "review": { "comment": "Perfect for mastering XML parsing.", "rating": "5", "user": "xml_master" } }`	„XML für Dateningenieure“
„1“	„Jane Doe“	`{ "edition": [ { "_format": "Hardcover", "_year": "2023" }, { "_format": "eBook", "_year": "2024" } ] }`	„29,99“	`{ "review": [ { "comment": "Very insightful and practical.", "rating": "5", "user": "tech_guru_87" }, { "comment": "Great read for data engineers.", "rating": "4", "user": "datawizard" } ] }`	„The Art of Snowflake“

Jedes XML-Element, das durch rowTag identifiziert wird, wird zu einer Zeile.
Jedes Unterelement innerhalb dieses Tags wird zu einer Spalte, die als VARIANT gespeichert wird. Verschachtelte Elemente werden als verschachtelte VARIANT-Daten erfasst.
The resulting DataFrame is flattened and columnized and behaves like any other Snowpark DataFrame.

Erste Schritte¶

Installieren der Snowpark Python-Bibliothek:
```
pip install snowflake-snowpark-python
```
Copy
Laden Sie Ihre XML-Datei in einen Snowflake-Stagingbereich hoch:
```
PUT file:///path/to/books.xml @mystage;
```
Copy

Verwenden Sie Snowpark, um die XML-Datei zu lesen:

df = session.read.option("rowTag", "book").xml("@mystage/books.xml")

Copy

Verwenden Sie DataFrame-Methoden zum Transformieren oder Speichern:

df.select(col("`title`"), col("`author`")).show()
df.write.save_as_table("books_table")

Copy

Unterstützte Optionen¶

rowTag (Erforderlich): Der Name des XML-Elements, das als Zeile extrahiert werden soll.
rowValidationXSDPath (Optional): Stagingbereichspfad zu einer XSD, die verwendet wird, um jedes rowTag-Fragment während des Ladens zu validieren.
mode (Optional): Das Standardverhalten wird ohne Validierung geladen. Wenn rowValidationXSDPath folgendermaßen eingestellt ist:
- PERMISSIVE: Quarantines invalid rows in _corrupt_record; loads the rest.
- FAILFAST: Stops at the first invalid row and raises an error.

Weitere Informationen zu XML-Optionen finden Sie unter snowflake.snowpark.DataFrameReader.xml.

Validate XML using XSD¶

To validate each rowTag fragment against an XSD during load, set the XSD path and choose a validation mode:

df = (
session.read
    .option("rowTag", "book")
    .option("rowValidationXSDPath", "@mystage/schema.xsd")  # validates each row element
    .option("mode", "PERMISSIVE")                         # or "FAILFAST"
    .xml("@mystage/books.xml")
)

Copy

PERMISSIVE: Invalid rows are quarantined in a special _corrupt_record column; valid rows load normally.

To persist the result, write the DataFrame to a table with df.write.save_as_table("<table_name>"). The table will include all parsed columns plus an extra _corrupt_record column: it is NULL for valid rows and contains the full XML records for invalid rows (with the other columns showing NULL).
+-------------------+ | _corrupt_record | | <book id="1"> ... | | <book id="2"> ... | +-------------------+

FAILFAST: Der Lesevorgang wird bei der ersten fehlerhaften Zeile angehalten und ein Fehler zurückgegeben.

Einschränkungen¶

Snowpark XML RowTag Reader hat die folgenden Einschränkungen:

Leitet nicht das Schema ab, und die Ausgabespalten sind alle vom Typ VARIANT.
Only supports files stored in Snowflake stages; local files are not supported.
Ist nur in der Snowpark Python-Bibliothek verfügbar.