snowflake.snowpark.DataFrameReader.xml¶

DataFrameReader.xml(path: str) → DataFrame[source]¶

Specify the path of the XML file(s) to load.

Parameters:: path – The stage location of an XML file, or a stage location that has XML files.
Returns:: a DataFrame that is set up to load data from the specified XML file(s) in a Snowflake stage.

Notes about reading XML files using a row tag:

We support reading XML by specifying the element tag that represents a single record using the rowTag option. See Example 13 in DataFrameReader.

Each XML record is flattened into a single row, with each XML element or attribute mapped to a column. All columns are represented with the variant type to accommodate heterogeneous or nested data. Therefore, every column value has a size limit due to the variant type.

The column names are derived from the XML element names. It will always be wrapped by single quotes.

To parse the nested XML under a row tag, you can use dot notation . to query the nested fields in a DataFrame. See Example 13 in DataFrameReader.

When rowTag is specified, the following options are supported for reading XML files via option() or options():

mode: Specifies the mode for dealing with corrupt XML records. The default value is PERMISSIVE. The supported values are:

PERMISSIVE: When it encounters a corrupt record, it sets all fields to null and includes a columnNameOfCorruptRecord column.

DROPMALFORMED: Ignores the whole record that cannot be parsed correctly.

FAILFAST: When it encounters a corrupt record, it raises an exception immediately.

columnNameOfCorruptRecord: Specifies the name of the column that contains the corrupt record. The default value is ‘_corrupt_record’.

ignoreNamespace: remove namespace prefixes from XML element names when constructing result column names. The default value is True. Note that a given prefix isn’t declared on the row tag element, it cannot be resolved and will be left intact (i.e. this setting is ignored for that element). For example, for the following XML data with a row tag abc:def: ` <abc:def><abc:xyz>0</abc:xyz></abc:def> ` the result column name is abc:xyz where abc is not stripped.

attributePrefix: The prefix to add to the attribute names. The default value is _.

excludeAttributes: Whether to exclude attributes from the XML element. The default value is False.

valueTag: The column name used for the value when there are attributes in an element that has no child elements. The default value is _VALUE.

nullValue: The value to treat as a null value. The default value is "".

charset: The character encoding of the XML file. The default value is utf-8.

ignoreSurroundingWhitespace: Whether or not whitespaces surrounding values should be skipped. The default value is False.