Snowpark Migration Accelerator : Codes de problème pour Python¶

SPRKPY1000¶

Message : La version spark-core du projet source est xx.xx:xx.x.x, la version spark-core prise en charge par Snowpark est 2.12:3.1.2. Il peut donc y avoir des différences de fonctionnalité entre les mappages existants.

Catégorie : Avertissement

Description¶

Ce problème apparaît lorsque la version Pyspark de votre code source n’est pas prise en charge. Cela signifie qu’il peut y avoir des différences fonctionnelles entre les mappages existants.

Recommandations supplémentaires¶

La version pyspark analysée par SMA pour la compatibilité avec Snowpark est comprise entre 2.12 et 3.1.2. Si vous utilisez une version en dehors de cette plage, l’outil peut produire des résultats incohérents. Vous pourriez modifier la version du code source que vous analysez.
For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1001¶

Message**:** This code section has parsing errors

Category**:** Parsing error.

Description¶

Une erreur d’analyse est signalée par l’outil Snowpark Migration Accelerator (SMA) lorsqu’il ne peut pas lire ou comprendre correctement le code d’un fichier (il ne peut pas « analyser » correctement le fichier). Ce code de problème apparaît lorsqu’un fichier présente une ou plusieurs erreurs d’analyse.

Scénario¶

Entrée : Le message EWI apparaît lorsque le code présente une syntaxe non valide, par exemple :

def foo():
    x = %%%%%%1###1

Sortie : SMA détecte une erreur d’analyse et commente l’erreur d’analyse en ajoutant le message EWI correspondant :

def foo():
    x
## EWI: SPRKPY1101 => Unrecognized or invalid CODE STATEMENT @(2, 7). Last valid token was 'x' @(2, 5), failed token '=' @(2, 7)
##      = %%%%%%1###1

Recommandations supplémentaires¶

Check that the file contains valid Python code. (You can use the issues.csv file to find all files with this EWI code to determine which file(s) were not processed by the tool due to parsing error(s).) Many parsing errors occur because only part of the code is input into the tool, so it’s bets to ensure that the code will run in the source. If it is valid, report that you encountered a parsing error using the Report an Issue option in the SMA. Include the line of code that was causing a parsing error in the description when you file this issue.
For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1002¶

Message**:** < element > is not supported,Spark element is not supported.

Category**:** Conversion error.

Description¶

Ce problème apparaît lorsque l’outil détecte l’utilisation d’un élément qui n’est pas pris en charge par Snowpark, et qui n’a pas son propre code d’erreur associé. Il s’agit du code d’erreur générique utilisé par SMA pour un élément non pris en charge.

Recommandations supplémentaires¶

Même si l’option ou l’élément du message n’est pas pris en charge, cela ne signifie pas qu’une solution ne peut être trouvée. Cela signifie seulement que l’outil lui-même ne peut pas trouver la solution.
Si vous avez rencontré un élément non pris en charge d’une bibliothèque pyspark.ml, envisagez une approche alternative. Il existe d’autres guides disponibles pour résoudre les problèmes liés à ML, comme celui de Snowflake.
Vérifiez que la syntaxe du code source est correcte. (Vous pouvez utiliser le fichier issues.csv pour déterminer où se produisent les erreurs de conversion) Si la syntaxe est correcte, signalez que vous avez rencontré une erreur de conversion sur un élément particulier en utilisant l’option Signaler un problème dans SMA. Incluez la ligne de code à l’origine de l’erreur dans la description lorsque vous déclarez ce problème.
For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1003¶

Message**:** An error occurred when loading the symbol table.

Category**:** Conversion error.

Description¶

Ce problème apparaît en cas d’erreur de traitement des symboles dans la table des symboles. La table des symboles fait partie de l’architecture sous-jacente de l’outil SMA, ce qui permet des conversions plus complexes. Cette erreur peut être due à une instruction inattendue dans le code source.

Recommandations supplémentaires¶

This is unlikely to be an error in the source code itself, but rather is an error in how the tool processes the source code. The best resolution would be to post an issue in the SMA.
For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1004¶

Message**:** The symbol table could not be loaded.

Category**:** Parsing error.

Description¶

Ce problème apparaît en cas d’erreur inattendue dans le processus d’exécution de l’outil. La table des symboles ne pouvant être chargée, l’outil ne peut pas lancer le processus d’évaluation ou de conversion.

Recommandations supplémentaires¶

This is unlikely to be an error in the source code itself, but rather is an error in how the tool processes the source code. The best resolution would be to reach out to the SMA support team. You can email us at sma-support@snowflake.com.
For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1005¶

Avertissement

This issue code has been deprecated since Spark Conversion Core Version 4.8.0

Message**:** pyspark.conf.SparkConf is not required

Category**:** Warning.

Description¶

This issue appears when the tool detects the usage of pyspark.conf.SparkConf which is not required.

Scénario¶

Entrée

SparkConf peut être appelé sans paramètres ou avec loadDefaults.

from pyspark import SparkConf

my_conf = SparkConf(loadDefaults=True)

Sortie

For both cases (with or without parameters) SMA creates a Snowpark Session.builder object:

#EWI: SPRKPY1005 => pyspark.conf.SparkConf is not required
#from pyspark import SparkConf
pass

#EWI: SPRKPY1005 => pyspark.conf.SparkConf is not required
my_conf = Session.builder.configs({"user" : "my_user", "password" : "my_password", "account" : "my_account", "role" : "my_role", "warehouse" : "my_warehouse", "database" : "my_database", "schema" : "my_schema"}).create()

Recommandations supplémentaires¶

Il s’agit de la suppression d’un paramètre inutile et de l’insertion d’un commentaire d’avertissement. L’utilisateur ne doit effectuer aucune action supplémentaire.
For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1006¶

Avertissement

This issue code has been deprecated since Spark Conversion Core Version 4.8.0

Message**:** pyspark.context.SparkContext is not required

Category**:** Warning.

Description¶

This issue appears when the tool detects the usage of pyspark.context.SparkContext, which is not required in Snowflake.

Scénario¶

Entrée

Dans cet exemple, il y a deux contextes pour créer une connexion à un cluster Spark.

from pyspark import SparkContext

sql_context1 = SparkContext(my_sc1)
sql_context2 = SparkContext(sparkContext=my_sc2)

Sortie

Étant donné qu’il n’y a pas de cluster dans Snowflake, le contexte n’est pas nécessaire. Notez que les variables my_sc1 et my_sc2 qui contiennent les propriétés Spark peuvent ne pas être nécessaires ou devront être adaptées pour corriger le code.

from snowflake.snowpark import Session
#EWI: SPRKPY1006 => pyspark.sql.context.SparkContext is not required
sql_context1 = my_sc1
#EWI: SPRKPY1006 => pyspark.sql.context.SparkContext is not required

sql_context2 = my_sc2

Recommandations supplémentaires¶

Il s’agit de la suppression d’un paramètre inutile et de l’insertion d’un commentaire d’avertissement. L’utilisateur ne doit effectuer aucune action.
For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1007¶

Avertissement

This issue code has been deprecated since Spark Conversion Core Version 4.8.0

Message**:** pyspark.sql.context.SQLContext is not required

Category**:** Warning.

Description¶

This issue appears when the tool detects the usage of pyspark.sql.context.SQLContext, which is not required.

Scénario¶

Entrée

Voici un exemple avec différentes surcharges SparkContext.

from pyspark import SQLContext
​
my_sc1 = SQLContext(myMaster, myAppName, mySparkHome, myPyFiles, myEnvironment, myBatctSize, mySerializer, my_conf1)
my_sc2 = SQLContext(conf=my_conf2)
my_sc3 = SQLContext()

Sortie

Le code de sortie a commenté la ligne pour pyspark.SQLContext, et remplace les scénarios par une référence à une configuration. Notez que les variables my_sc1 et my_sc2 qui contiennent les propriétés Spark peuvent ne pas être nécessaires ou devront être adaptées pour corriger le code.

#EWI: SPRKPY1007 => pyspark.sql.context.SQLContext is not required
#from pyspark import SQLContext
pass

#EWI: SPRKPY1007 => pyspark.sql.context.SQLContext is not required
sql_context1 = my_sc1
#EWI: SPRKPY1007 => pyspark.sql.context.SQLContext is not required
sql_context2 = my_sc2

Recommandations supplémentaires¶

Il s’agit d’un paramètre inutile qui est supprimé et pour lequel un commentaire d’avertissement a été inséré dans le code source. L’utilisateur ne doit effectuer aucune action.
For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1008¶

Message : pyspark.sql.context.HiveContext n’est pas requis

Catégorie : Avertissement

Description¶

This issue appears when the tool detects the usage of pyspark.sql.context.HiveContext, which is not required.

Scénario¶

Entrée

Dans cet exemple, il s’agit de créer une connexion à un magasin Hive.

from pyspark.sql import HiveContext
hive_context = HiveContext(sc)
df = hive_context.table("myTable")
df.show()

Sortie

In Snowflake there are not Hive stores, so the Hive Context is not required, You can still use parquet files on Snowflake please check this tutorial to learn how.

#EWI: SPRKPY1008 => pyspark.sql.context.HiveContext is not required
hive_context = sc
df = hive_context.table("myTable")
df.show()

the sc variable refers to a Snow Park Session Object

Correction recommandée

For the output code in the example you should add the Snow Park Session Object similar to this code:

## Here manually we can add the Snowpark Session object via a json config file called connection.json
import json
from snowflake.snowpark import Session
jsonFile = open("connection.json")
connection_parameter = json.load(jsonFile)
jsonFile.close()
sc = Session.builder.configs(connection_parameter).getOrCreate()

hive_context = sc
df = hive_context.table("myTable")
df.show()

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1009¶

Message**:** pyspark.sql.dataframe.DataFrame.approxQuantile has a workaround

Category**:** Warning.

Description¶

This issue appears when the tool detects the usage of pyspark.sql.dataframe.DataFrame.approxQuantile which has a workaround.

Scénario¶

Entrée

It’s important understand that Pyspark uses two different approxQuantile functions, here we use the DataFrame approxQuantile version

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [['Sun', 10],
        ['Mon', 64],
        ['Thr', 12],
        ['Wen', 15],
        ['Thu', 68],
        ['Fri', 14],
        ['Sat', 13]]

columns = ['Day', 'Ammount']
df = spark.createDataFrame(data, columns)
df.approxQuantile('Ammount', [0.25, 0.5, 0.75], 0)

Sortie

SMA renvoie l’EWI SPRKPY1009 sur la ligne où approxQuantile est utilisé, ce qui vous permet d’identifier l’endroit à corriger.

from snowflake.snowpark import Session
spark = Session.builder.getOrCreate()
spark.update_query_tag({"origin":"sf_sit","name":"sma","version":{"major":0,"minor":0,"patch":0},"attributes":{"language":"Python"}})
data = [['Sun', 10],
        ['Mon', 64],
        ['Thr', 12],
        ['Wen', 15],
        ['Thu', 68],
        ['Fri', 14],
        ['Sat', 13]]

columns = ['Day', 'Ammount']
df = spark.createDataFrame(data, columns)
#EWI: SPRKPY1009 => pyspark.sql.dataframe.DataFrame.approxQuantile has a workaround, see documentation for more info
df.approxQuantile('Ammount', [0.25, 0.5, 0.75], 0)

Correction recommandée

Use Snowpark approxQuantile method. Some parameters don’t match so they require some manual adjustments. for the output code’s example a recommended fix could be:

from snowflake.snowpark import Session
...
df = spark.createDataFrame(data, columns)

df.stat.approx_quantile('Ammount', [0.25, 0.5, 0.75])

Le paramètre d’erreur relatif à pyspark.sql.dataframe.DataFrame.approxQuantile n’existe pas dans SnowPark.

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1010¶

Message : pyspark.sql.dataframe.DataFrame.checkpoint a une solution de contournement

Catégorie : Avertissement

Description¶

This issue appears when the tool detects the usage of pyspark.sql.dataframe.DataFrame.checkpoint which has a workaround.

Scénario¶

Entrée

Dans PySpark, les points de contrôle sont utilisés pour tronquer le plan logique d’un dataframe, ceci afin d’éviter le développement d’un plan logique.

import tempfile
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
data = [['Q1', 300000],
        ['Q2', 60000],
        ['Q3', 500002],
        ['Q4', 130000]]

columns = ['Quarter', 'Score']
df = spark.createDataFrame(data, columns)
with tempfile.TemporaryDirectory() as d:
    spark.sparkContext.setCheckpointDir("/tmp/bb")
    df.checkpoint(False)

Sortie

SMA returns the EWI SPRKPY1010 over the line where approxQuantile is used, so you can use to identify where to fix. Note that also marks the setCheckpointDir as unsupported, but a checpointed directory is not required for the fix.

import tempfile
from snowflake.snowpark import Session
spark = Session.builder.getOrCreate()
spark.update_query_tag({"origin":"sf_sit","name":"sma","version":{"major":0,"minor":0,"patch":0},"attributes":{"language":"Python"}})
data = [['Q1', 300000],
        ['Q2', 60000],
        ['Q3', 500002],
        ['Q4', 130000]]

columns = ['Quarter', 'Score']
df = spark.createDataFrame(data, columns)
with tempfile.TemporaryDirectory() as d:
    #EWI: SPRKPY1002 => pyspark.context.SparkContext.setCheckpointDir is not supported
    spark.setCheckpointDir("/tmp/bb")
    #EWI: SPRKPY1010 => pyspark.sql.dataframe.DataFrame.checkpoint has a workaround, see documentation for more info
    df.checkpoint(False)

Correction recommandée

Snowpark élimine le besoin de points de contrôle explicites : en effet, Snowpark travaille avec des opérations basées sur SQL qui sont optimisées par le moteur d’optimisation des requêtes de Snowflake, ce qui élimine le besoin de calculs non requis ou de plans logiques qui deviennent incontrôlables.

However there could be scenarios where you would require persist the result of a computation on a dataframe. In this scenarios you can save materialize the results by writing the dataframe on a Snowflake Table or in a Snowflake Temporary Table.

L’utilisation d’une table permanente ou le résultat du calcul est accessible à tout moment, même après la fin de la session.

from snowflake.snowpark import Session
spark = Session.builder.getOrCreate()
spark.update_query_tag({"origin":"sf_sit","name":"sma","version":{"major":0,"minor":0,"patch":0},"attributes":{"language":"Python"}})
data = [['Q1', 300000],
        ['Q2', 60000],
        ['Q3', 500002],
        ['Q4', 130000]]

columns = ['Quarter', 'Score']
df = spark.createDataFrame(data, columns)
df.write.save_as_table("my_table", table_type="temporary") # Save the dataframe into Snowflake table "my_table".
df2 = Session.table("my_table") # Now I can access the stored result quering the table "my_table"

Une correction alternative, l’utilisation d’une table temporaire, présente l’avantage que la table est supprimée après la fin de la session :

from snowflake.snowpark import Session
spark = Session.builder.getOrCreate()
spark.update_query_tag({"origin":"sf_sit","name":"sma","version":{"major":0,"minor":0,"patch":0},"attributes":{"language":"Python"}})
data = [['Q1', 300000],
        ['Q2', 60000],
        ['Q3', 500002],
        ['Q4', 130000]]

columns = ['Quarter', 'Score']
df = spark.createDataFrame(data, columns)
df.write.save_as_table("my_temp_table", table_type="temporary") # Save the dataframe into Snowflake table "my_temp_table".
df2 = Session.table("my_temp_table") # Now I can access the stored result quering the table "my_temp_table"

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1011¶

Message : pyspark.sql.dataframe.DataFrameStatFunctions.approxQuantile a une solution de contournement

Catégorie : Avertissement

Description¶

This issue appears when the tool detects the usage of pyspark.sql.dataframe.DataFrameStatFunctions.approxQuantile which has a workaround.

Scénario¶

Entrée

It’s important understand that Pyspark uses two different approxQuantile functions, here we use the DataFrameStatFunctions approxQuantile version.

import tempfile
from pyspark.sql import SparkSession, DataFrameStatFunctions
spark = SparkSession.builder.getOrCreate()
data = [['Q1', 300000],
        ['Q2', 60000],
        ['Q3', 500002],
        ['Q4', 130000]]

columns = ['Quarter', 'Gain']
df = spark.createDataFrame(data, columns)
aprox_quantille = DataFrameStatFunctions(df).approxQuantile('Gain', [0.25, 0.5, 0.75], 0)
print(aprox_quantille)

Sortie

SMA renvoie l’EWI SPRKPY1011 sur la ligne où approxQuantile est utilisé, ce qui vous permet d’identifier l’endroit à corriger.

import tempfile
from snowflake.snowpark import Session, DataFrameStatFunctions
spark = Session.builder.getOrCreate()
spark.update_query_tag({"origin":"sf_sit","name":"sma","version":{"major":0,"minor":0,"patch":0},"attributes":{"language":"Python"}})
data = [['Q1', 300000],
        ['Q2', 60000],
        ['Q3', 500002],
        ['Q4', 130000]]

columns = ['Quarter', 'Gain']
df = spark.createDataFrame(data, columns)
#EWI: SPRKPY1011 => pyspark.sql.dataframe.DataFrameStatFunctions.approxQuantile has a workaround, see documentation for more info
aprox_quantille = DataFrameStatFunctions(df).approxQuantile('Gain', [0.25, 0.5, 0.75], 0)

Correction recommandée

You can use Snowpark approxQuantile method. Some parameters don’t match so they require some manual adjustments. for the output code’s example a recommended fix could be:

from snowflake.snowpark import Session # remove DataFrameStatFunctions because is not required
...
df = spark.createDataFrame(data, columns)

aprox_quantille = df.stat.approx_quantile('Ammount', [0.25, 0.5, 0.75])

Le paramètre d’erreur relatif à pyspark.sql.dataframe.DataFrame.approxQuantile n’existe pas dans SnowPark.

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1012¶

Avertissement

Ce code de problème est maintenant obsolète.

Message : pyspark.sql.dataframe.DataFrameStatFunctions.writeTo a une solution de contournement

Catégorie : Avertissement

Description¶

This issue appears when the tool detects the usage of pyspark.sql.dataframe.DataFrameStatFunctions.writeTo which has a workaround.

Scénario¶

Entrée

Pour cet exemple, le dataframe df est écrit dans une table Spark « table ».

writer = df.writeTo("table")

Sortie

SMA renvoie l’EWI SPRKPY1012 sur la ligne où DataFrameStatFunctions.writeTo est utilisé, ce qui vous permet d’identifier l’endroit à corriger.

#EWI: SPRKPY1012 => pyspark.sql.dataframe.DataFrameStatFunctions.writeTo has a workaround, see documentation for more info
writer = df.writeTo("table")

Correction recommandée

Au lieu de cela, utilisez df.write.SaveAsTable().

import df.write as wt
writer = df.write.save_as_table(table)

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1013¶

Message : pyspark.sql.functions.acosh a une solution de contournement

Catégorie : Avertissement

Description¶

This issue appears when the tool detects the usage of pyspark.sql.functions.acosh which has a workaround.

Scénario¶

Entrée

On this example pyspark calculates the acosh for a dataframe by using pyspark.sql.functions.acosh

from pyspark.sql import SparkSession
from pyspark.sql.functions import acosh
spark = SparkSession.builder.getOrCreate()
data = [['V1', 30],
        ['V2', 60],
        ['V3', 50],
        ['V4', 13]]

columns = ['Paremeter', 'value']
df = spark.createDataFrame(data, columns)
df_with_acosh = df.withColumn("acosh_value", acosh(df["value"]))

Sortie

SMA renvoie l’EWI SPRKPY1013 sur la ligne où acosh est utilisé, ce qui vous permet d’identifier l’endroit à corriger.

from snowflake.snowpark import Session

spark = Session.builder.getOrCreate()
spark.update_query_tag({"origin":"sf_sit","name":"sma","version":{"major":0,"minor":0,"patch":0},"attributes":{"language":"Python"}})
data = [['V1', 30],
        ['V2', 60],
        ['V3', 50],
        ['V4', 13]]

columns = ['Paremeter', 'value']
df = spark.createDataFrame(data, columns)
#EWI: SPRKPY1013 => pyspark.sql.functions.acosh has a workaround, see documentation for more info
df_with_acosh = df.withColumn("acosh_value", acosh(df["value"]))

Correction recommandée

There is no direct « acosh » implementation but « call_function » can be used instead, using « acosh » as the first parameter, and colName as the second one.

import snowflake.snowpark as snowpark
from snowflake.snowpark import Session
from snowflake.snowpark.functions import call_function, col

spark = Session.builder.getOrCreate()
spark.update_query_tag({"origin":"sf_sit","name":"sma","version":{"major":0,"minor":0,"patch":0},"attributes":{"language":"Python"}})
data = [['V1', 30],
        ['V2', 60],
        ['V3', 50],
        ['V4', 13]]

columns = ['Paremeter', 'value']
df = spark.createDataFrame(data, columns)
df_with_acosh = df.select(call_function('ACOSH', col('value')))

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1014¶

Message : pyspark.sql.functions.asinh a une solution de contournement

Catégorie : Avertissement

Description¶

This issue appears when the tool detects the usage of pyspark.sql.functions.asinh which has a workaround.

Scénario¶

Entrée

On this example pyspark calculates the asinh for a dataframe by using pyspark.sql.functions.asinh.

from pyspark.sql import SparkSession
from pyspark.sql.functions import asinh
spark = SparkSession.builder.getOrCreate()
data = [['V1', 3.0],
        ['V2', 60.0],
        ['V3', 14.0],
        ['V4', 3.1]]

columns = ['Paremeter', 'value']
df = spark.createDataFrame(data, columns)
df_result = df.withColumn("asinh_value", asinh(df["value"]))

Sortie

SMA renvoie l’EWI SPRKPY1014 sur la ligne où asinh est utilisé, ce qui vous permet d’identifier l’endroit à corriger.

from snowflake.snowpark import Session

spark = Session.builder.getOrCreate()
spark.update_query_tag({"origin":"sf_sit","name":"sma","version":{"major":0,"minor":0,"patch":0},"attributes":{"language":"Python"}})
data = [['V1', 3.0],
        ['V2', 60.0],
        ['V3', 14.0],
        ['V4', 3.1]]

columns = ['Paremeter', 'value']
df = spark.createDataFrame(data, columns)
#EWI: SPRKPY1014 => pyspark.sql.functions.asinh has a workaround, see documentation for more info
df_result = df.withColumn("asinh_value", asinh(df["value"]))

Correction recommandée

There is no direct « asinh » implementation but « call_function » can be used instead, using « asinh » as the first parameter, and colName as the second one.

import snowflake.snowpark as snowpark
from snowflake.snowpark import Session
from snowflake.snowpark.functions import call_function, col

spark = Session.builder.getOrCreate()
spark.update_query_tag({"origin":"sf_sit","name":"sma","version":{"major":0,"minor":0,"patch":0},"attributes":{"language":"Python"}})
data = [['V1', 3.0],
        ['V2', 60.0],
        ['V3', 14.0],
        ['V4', 3.1]]

columns = ['Paremeter', 'value']
df = spark.createDataFrame(data, columns)
df_result = df.select(call_function('asinh', col('value')))

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1015¶

Message : pyspark.sql.functions.atanh a une solution de contournement

Catégorie : Avertissement

Description¶

This issue appears when the tool detects the usage of pyspark.sql.functions.atanh which has a workaround.

Scénario¶

Entrée

On this example pyspark calculates the atanh for a dataframe by using pyspark.sql.functions.atanh.

from pyspark.sql import SparkSession
from pyspark.sql.functions import atanh
spark = SparkSession.builder.getOrCreate()
data = [['V1', 0.14],
        ['V2', 0.32],
        ['V3', 0.4],
        ['V4', -0.36]]

columns = ['Paremeter', 'value']
df = spark.createDataFrame(data, columns)
df_result = df.withColumn("atanh_value", atanh(df["value"]))

Sortie

SMA renvoie l’EWI SPRKPY1015 sur la ligne où atanh est utilisé, ce qui vous permet d’identifier l’endroit à corriger.

from snowflake.snowpark import Session

spark = Session.builder.getOrCreate()
spark.update_query_tag({"origin":"sf_sit","name":"sma","version":{"major":0,"minor":0,"patch":0},"attributes":{"language":"Python"}})
data = [['V1', 0.14],
        ['V2', 0.32],
        ['V3', 0.4],
        ['V4', -0.36]]

columns = ['Paremeter', 'value']
df = spark.createDataFrame(data, columns)
#EWI: SPRKPY1015 => pyspark.sql.functions.atanh has a workaround, see documentation for more info
df_result = df.withColumn("atanh_value", atanh(df["value"]))

Correction recommandée

There is no direct « atanh » implementation but « call_function » can be used instead, using « atanh » as the first parameter, and colName as the second one.

import snowflake.snowpark as snowpark
from snowflake.snowpark import Session
from snowflake.snowpark.functions import call_function, col

spark = Session.builder.getOrCreate()
spark.update_query_tag({"origin":"sf_sit","name":"sma","version":{"major":0,"minor":0,"patch":0},"attributes":{"language":"Python"}})
data = [['V1', 0.14],
        ['V2', 0.32],
        ['V3', 0.4],
        ['V4', -0.36]]

columns = ['Paremeter', 'value']
df = spark.createDataFrame(data, columns)
df_result = df.select(call_function('atanh', col('value')))

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1016¶

Avertissement

This issue code has been deprecated since Spark Conversion Core Version 0.11.7

Message : pyspark.sql.functions.collect_set a une solution de contournement

Catégorie : Avertissement

Description¶

This issue appears when the tool detects the usage of pyspark.sql.functions.collect_set which has a workaround.

Scénario¶

Entrée

Using collect*set to get the elements of _colname* without duplicates:

col = collect_set(colName)

Sortie

SMA renvoie l’EWI SPRKPY1016 sur la ligne où collect_set est utilisé, ce qui vous permet d’identifier l’endroit à corriger.

#EWI: SPRKPY1016 => pyspark.sql.functions.collect_set has a workaround, see documentation for more info
col = collect_set(colName)

Correction recommandée

Utilisez la fonction array_agg et ajoutez un deuxième argument avec la valeur True.

col = array_agg(col, True)

Recommandation supplémentaire¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1017¶

Avertissement

This issue code has been deprecated since Spark Conversion Core Version 4.8.0

pyspark.sql.functions.date_add a une solution de contournement

Catégorie : Avertissement

Description¶

This issue appears when the tool detects the usage of pyspark.sql.functions.date_add which has a workaround.

Scénario¶

Entrée

Dans cet exemple, nous utilisons date_add pour calculer la date 5 jours après la date actuelle pour le dataframe df.

col = df.select(date_add(df.colName, 5))

Sortie

SMA renvoie l’EWI SPRKPY1017 sur la ligne où date_add est utilisé, ce qui vous permet d’identifier l’endroit à corriger.

#EWI: SPRKPY1017 => pyspark.sql.functions.date_add has a workaround, see documentation for more info
col = df.select(date_add(df.colName, 5))

Correction recommandée

Import snowflake.snowpark.functions, which contains an implementation for date_add (and alias dateAdd) function.

from snowflake.snowpark.functions import date_add

col = df.select(date_add(df.dt, 1))

Recommandation supplémentaire¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1018¶

Avertissement

This issue code has been deprecated since Spark Conversion Core Version 4.8.0

Message : pyspark.sql.functions.date_sub a une solution de contournement

Catégorie : Avertissement

Description¶

This issue appears when the tool detects the usage of pyspark.sql.functions.date_sub which has a workaround.

Scénario¶

Entrée

Dans cet exemple, nous utilisons date_add pour calculer la date 5 jours avant la date actuelle pour le dataframe df.

col = df.select(date_sub(df.colName, 5))

Sortie

SMA renvoie l’EWI SPRKPY1018 sur la ligne où date_sub est utilisé, ce qui vous permet d’identifier l’endroit à corriger.

#EWI: SPRKPY1018 => pyspark.sql.functions.date_sub has a workaround, see documentation for more info
col = df.select(date_sub(df.colName, 5))

Correction recommandée

Import snowflake.snowpark.functions, which contains an implementation for date_sub function.

from pyspark.sql.functions import date_sub
df.withColumn("date", date_sub(df.colName, 5))

Recommandation supplémentaire¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1019¶

Avertissement

This issue code has been deprecated since Spark Conversion Core Version 4.8.0

Message : pyspark.sql.functions.datediff a une solution de contournement

Catégorie : Avertissement

Description¶

This issue appears when the tool detects the usage of pyspark.sql.functions.datediff which has a workaround.

Scénario¶

Entrée

Dans cet exemple, nous utilisons datediff pour calculer la différence de jour entre “today” et d’autres dates.

contacts = (contacts
            #days since last event
            .withColumn('daysSinceLastEvent', datediff(lit(today),'lastEvent'))
            #days since deployment
            .withColumn('daysSinceLastDeployment', datediff(lit(today),'lastDeploymentEnd'))
            #days since online training
            .withColumn('daysSinceLastTraining', datediff(lit(today),'lastTraining'))
            #days since last RC login
            .withColumn('daysSinceLastRollCallLogin', datediff(lit(today),'adx_identity_lastsuccessfullogin'))
            #days since last EMS login
            .withColumn('daysSinceLastEMSLogin', datediff(lit(today),'vms_lastuserlogin'))
           )

Sortie

SMA renvoie l’EWI SPRKPY1019 sur la ligne où datediff est utilisé, ce qui vous permet d’identifier l’endroit à corriger.

from pyspark.sql.functions import datediff
#EWI: SPRKPY1019 => pyspark.sql.functions.datediff has a workaround, see documentation for more info
contacts = (contacts
            #days since last event
            .withColumn('daysSinceLastEvent', datediff(lit(today),'lastEvent'))
            #days since deployment
            .withColumn('daysSinceLastDeployment', datediff(lit(today),'lastDeploymentEnd'))
            #days since online training
            .withColumn('daysSinceLastTraining', datediff(lit(today),'lastTraining'))
            #days since last RC login
            .withColumn('daysSinceLastRollCallLogin', datediff(lit(today),'adx_identity_lastsuccessfullogin'))
            #days since last EMS login
            .withColumn('daysSinceLastEMSLogin', datediff(lit(today),'vms_lastuserlogin'))
           )

SMA convert pyspark.sql.functions.datediff onto snowflake.snowpark.functions.daydiff that also calculates the diference in days between two dates.

Correction recommandée

datediff(part: string ,end: ColumnOrName, start: ColumnOrName)

Action: Import snowflake.snowpark.functions, which contains an implementation for datediff function that requires an extra parameter for date time part and allows more versatility on calculate differences between dates.

from snowflake.snowpark import Session
from snowflake.snowpark.functions import datediff
contacts = (contacts
            #days since last event
            .withColumn('daysSinceLastEvent', datediff('day', lit(today),'lastEvent'))
            #days since deployment
            .withColumn('daysSinceLastDeployment', datediff('day',lit(today),'lastDeploymentEnd'))
            #days since online training
            .withColumn('daysSinceLastTraining', datediff('day', lit(today),'lastTraining'))
            #days since last RC login
            .withColumn('daysSinceLastRollCallLogin', datediff('day', lit(today),'adx_identity_lastsuccessfullogin'))
            #days since last EMS login
            .withColumn('daysSinceLastEMSLogin', datediff('day', lit(today),'vms_lastuserlogin'))
           )

Recommandation¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1020¶

Message : pyspark.sql.functions.instr a une solution de contournement

Catégorie : Avertissement

Description¶

This issue appears when the tool detects the usage of pyspark.sql.functions.instr which has a workaround.

Scénario¶

Entrée

Voici un exemple d’utilisation de pyspark instr :

from pyspark.sql import SparkSession
from pyspark.sql.functions import instr
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([('abcd',)], ['test',])
df.select(instr(df.test, 'cd').alias('result')).collect()

Sortie :

SMA renvoie l’EWI SPRKPY1020 sur la ligne où instr est utilisé, ce qui vous permet d’identifier l’endroit à corriger.

from snowflake.snowpark import Session

spark = Session.builder.getOrCreate()
spark.update_query_tag({"origin":"sf_sit","name":"sma","version":{"major":0,"minor":0,"patch":0},"attributes":{"language":"Python"}})
df = spark.createDataFrame([('abcd',)], ['test',])
#EWI: SPRKPY1020 => pyspark.sql.functions.instr has a workaround, see documentation for more info
df.select(instr(df.test, 'cd').alias('result')).collect()

Correction recommandée

Requires a manual change by using the function charindex and changing the order of the first two parameters.

import snowflake.snowpark as snowpark
from snowflake.snowpark import Session
from snowflake.snowpark.functions import charindex, lit

spark = Session.builder.getOrCreate()
spark.update_query_tag({"origin":"sf_sit","name":"sma","version":{"major":0,"minor":0,"patch":0},"attributes":{"language":"Python"}})
df = spark.createDataFrame([('abcd',)], ['test',])
df.select(charindex(lit('cd'), df.test).as_('result')).show()

Recommandation supplémentaire¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1021¶

Avertissement

Ce code de problème est maintenant obsolète.

Message : pyspark.sql.functions.last a une solution de contournement. Voir la documentation pour plus d’informations

Catégorie : Avertissement

Description¶

This issue appears when the SMA detects a use of the pyspark.sql.functions.last function, which has a workaround.

Scénario¶

Entrée

Below is an example of a use of the pyspark.sql.functions.last function that generates this EWI. In this example, the last function is used to get the last value for each name.

df = spark.createDataFrame([("Alice", 1), ("Bob", 2), ("Charlie", 3), ("Alice", 4), ("Bob", 5)], ["name", "value"])
df_grouped = df.groupBy("name").agg(last("value").alias("last_value"))

Sortie

The SMA adds the EWI SPRKPY1021 to the output code to let you know that this function is not directly supported by Snowpark, but it has a workaround.

df = spark.createDataFrame([("Alice", 1), ("Bob", 2), ("Charlie", 3), ("Alice", 4), ("Bob", 5)], ["name", "value"])
#EWI: SPRKPY1021 => pyspark.sql.functions.last has a workaround, see documentation for more info
df_grouped = df.groupBy("name").agg(last("value").alias("last_value"))

Correction recommandée

As a workaround, you can use the Snowflake LAST_VALUE function. To invoke this function from Snowpark, use the snowflake.snowpark.functions.call_builtin function and pass the string last_value as the first argument and the corresponding column as the second argument. If you were using the name of the column in the last function, you should convert it into a column when calling the call_builtin function.

df = spark.createDataFrame([("Alice", 1), ("Bob", 2), ("Charlie", 3), ("Alice", 4), ("Bob", 5)], ["name", "value"])
df_grouped = df.groupBy("name").agg(call_builtin("last_value", col("value")).alias("last_value"))

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

description: >- The mode parameter in the methods of CSV, JSON and PARQUET is transformed to overwrite

SPRKPY1022¶

Message : pyspark.sql.functions.log10 a une solution de contournement. Voir la documentation pour plus d’informations

Catégorie : Avertissement

Description¶

This issue appears when the SMA detects a use of the pyspark.sql.functions.log10 function, which has a workaround.

Scénario¶

Entrée

Below is an example of a use of the pyspark.sql.functions.log10 function that generates this EWI. In this example, the log10 function is used to calculate the base-10 logarithm of the value column.

df = spark.createDataFrame([(1,), (10,), (100,), (1000,), (10000,)], ["value"])
df_with_log10 = df.withColumn("log10_value", log10(df["value"]))

Sortie

The SMA adds the EWI SPRKPY1022 to the output code to let you know that this function is not directly supported by Snowpark, but it has a workaround.

df = spark.createDataFrame([(1,), (10,), (100,), (1000,), (10000,)], ["value"])
#EWI: SPRKPY1022 => pyspark.sql.functions.log10 has a workaround, see documentation for more info
df_with_log10 = df.withColumn("log10_value", log10(df["value"]))

Correction recommandée

As a workaround, you can use the snowflake.snowpark.functions.log function by passing the literal value 10 as the base.

df = spark.createDataFrame([(1,), (10,), (100,), (1000,), (10000,)], ["value"])
df_with_log10 = df.withColumn("log10_value", log(10, df["value"]))

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1023¶

Message : pyspark.sql.functions.log1p a une solution de contournement. Voir la documentation pour plus d’informations

Catégorie : Avertissement

Description¶

This issue appears when the SMA detects a use of the pyspark.sql.functions.log1p function, which has a workaround.

Scénario¶

Entrée

Below is an example of a use of the pyspark.sql.functions.log1p function that generates this EWI. In this example, the log1p function is used to calculate the natural logarithm of the value column.

df = spark.createDataFrame([(0,), (1,), (10,), (100,)], ["value"])
df_with_log1p = df.withColumn("log1p_value", log1p(df["value"]))

Sortie

The SMA adds the EWI SPRKPY1023 to the output code to let you know that this function is not directly supported by Snowpark, but it has a workaround.

df = spark.createDataFrame([(0,), (1,), (10,), (100,)], ["value"])
#EWI: SPRKPY1023 => pyspark.sql.functions.log1p has a workaround, see documentation for more info
df_with_log1p = df.withColumn("log1p_value", log1p(df["value"]))

Correction recommandée

As a workaround, you can use the call_function function by passing the string ln as the first argument and by adding 1 to the second argument.

df = spark.createDataFrame([(0,), (1,), (10,), (100,)], ["value"])
df_with_log1p = df.withColumn("log1p_value", call_function("ln", lit(1) + df["value"]))

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1024¶

Message : pyspark.sql.functions.log2 a une solution de contournement. Voir la documentation pour plus d’informations

Catégorie : Avertissement

Description¶

This issue appears when the SMA detects a use of the pyspark.sql.functions.log2 function, which has a workaround.

Scénario¶

Entrée

Below is an example of a use of the pyspark.sql.functions.log2 function that generates this EWI. In this example, the log2 function is used to calculate the base-2 logarithm of the value column.

df = spark.createDataFrame([(1,), (2,), (4,), (8,), (16,)], ["value"])
df_with_log2 = df.withColumn("log2_value", log2(df["value"]))

Sortie

The SMA adds the EWI SPRKPY1024 to the output code to let you know that this function is not directly supported by Snowpark, but it has a workaround.

df = spark.createDataFrame([(1,), (2,), (4,), (8,), (16,)], ["value"])
#EWI: SPRKPY1024 => pyspark.sql.functions.log2 has a workaround, see documentation for more info
df_with_log2 = df.withColumn("log2_value", log2(df["value"]))

Correction recommandée

As a workaround, you can use the snowflake.snowpark.functions.log function by passing the literal value 2 as the base.

df = session.createDataFrame([(1,), (2,), (4,), (8,), (16,)], ["value"])
df_with_log2 = df.withColumn("log2_value", log(2, df["value"]))

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1025¶

Avertissement

Ce code de problème est maintenant obsolète.

Message : pyspark.sql.functions.ntile a une solution de contournement. Voir la documentation pour plus d’informations

Catégorie : Avertissement

Description¶

This issue appears when the SMA detects a use of the pyspark.sql.functions.ntile function, which has a workaround.

Scénario¶

Entrée

Below is an example of a use of the pyspark.sql.functions.ntile function that generates this EWI. In this example, the ntile function is used to divide the rows into 3 buckets.

df = spark.createDataFrame([("Alice", 50), ("Bob", 30), ("Charlie", 60), ("David", 90), ("Eve", 70), ("Frank", 40)], ["name", "score"])
windowSpec = Window.orderBy("score")
df_with_ntile = df.withColumn("bucket", ntile(3).over(windowSpec))

Sortie

The SMA adds the EWI SPRKPY1025 to the output code to let you know that this function is not directly supported by Snowpark, but it has a workaround.

df = spark.createDataFrame([("Alice", 50), ("Bob", 30), ("Charlie", 60), ("David", 90), ("Eve", 70), ("Frank", 40)], ["name", "score"])
windowSpec = Window.orderBy("score")
#EWI: SPRKPY1025 => pyspark.sql.functions.ntile has a workaround, see documentation for more info
df_with_ntile = df.withColumn("bucket", ntile(3).over(windowSpec))

Correction recommandée

Snowpark has an equivalent ntile function, however, the argument pass to it should be a column. As a workaround, you can convert the literal argument into a column using the snowflake.snowpark.functions.lit function.

df = spark.createDataFrame([("Alice", 50), ("Bob", 30), ("Charlie", 60), ("David", 90), ("Eve", 70), ("Frank", 40)], ["name", "score"])
windowSpec = Window.orderBy("score")
df_with_ntile = df.withColumn("bucket", ntile(lit(3)).over(windowSpec))

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1026¶

Avertissement

This issue code has been deprecated since Spark Conversion Core 4.3.2

Message : pyspark.sql.readwriter.DataFrameReader.csv a une solution de contournement. Voir la documentation pour plus d’informations

Catégorie : Avertissement

Description¶

This issue appears when the SMA detects a use of the pyspark.sql.readwriter.DataFrameReader.csv function, which has a workaround.

Scénario¶

Entrée

Below is an example of a use of the pyspark.sql.readwriter.DataFrameReader.csv function that generates this EWI. In this example, the csv function is used to read multiple .csv files with a given schema and uses some extra options such as encoding, header and sep to fine-tune the behavior of reading the files.

file_paths = [
  "path/to/your/file1.csv",
  "path/to/your/file2.csv",
  "path/to/your/file3.csv",
]

df = session.read.csv(
  file_paths,
  schema=my_schema,
  encoding="UTF-8",
  header=True,
  sep=","
)

Sortie

The SMA adds the EWI SPRKPY1026 to the output code to let you know that this function is not directly supported by Snowpark, but it has a workaround.

file_paths = [
  "path/to/your/file1.csv",
  "path/to/your/file2.csv",
  "path/to/your/file3.csv",
]

#EWI: SPRKPY1026 => pyspark.sql.readwriter.DataFrameReader.csv has a workaround, see documentation for more info
df = session.read.csv(
  file_paths,
  schema=my_schema,
  encoding="UTF-8",
  header=True,
  sep=","
)

Correction recommandée

In this section, we explain how to configure the path parameter, the schema parameter and some options to make them work in Snowpark.

1. Paramètre path

Snowpark requires the path parameter to be a stage location so, as a workaround, you can create a temporary stage and add each .csv file to that stage using the prefix file://.

2. Paramètre schema

Snowpark does not allow defining the schema as a parameter of the csv function. As a workaround, you can use the snowflake.snowpark.DataFrameReader.schema function.

3. Paramètres des options

Snowpark does not allow defining the extra options as parameters of the csv function. As a workaround, for many of them you can use the snowflake.snowpark.DataFrameReader.option function to specify those parameters as options of the DataFrameReader.

Note

Les options suivantes ne sont pas prises en charge par Snowpark :

columnNameOfCorruptRecord
emptyValue
enforceSchema
header
ignoreLeadingWhiteSpace
ignoreTrailingWhiteSpace
inferSchema
locale
maxCharsPerColumn
maxColumns
mode
multiLine
nanValue
negativeInf
nullValue
positiveInf
quoteAll
samplingRatio
timestampNTZFormat
unescapedQuoteHandling

Vous trouverez ci-dessous l’exemple complet de ce à quoi devrait ressembler le code d’entrée après avoir appliqué les suggestions mentionnées ci-dessus pour qu’il fonctionne dans Snowpark :

stage = f'{session.get_fully_qualified_current_schema()}.{_generate_prefix("TEMP_STAGE")}'
session.sql(f'CREATE TEMPORARY STAGE IF NOT EXISTS {stage}')

session.file.put(f"file:///path/to/your/file1.csv", f"@{stage}")
session.file.put(f"file:///path/to/your/file2.csv", f"@{stage}")
session.file.put(f"file:///path/to/your/file3.csv", f"@{stage}")

df = session.read.schema(my_schema).option("encoding", "UTF-8").option("sep", ",").csv(stage)

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1027¶

Avertissement

This issue code has been deprecated since Spark Conversion Core 4.5.2

Message : pyspark.sql.readwriter.DataFrameReader.json a une solution de contournement. Voir la documentation pour plus d’informations

Catégorie : Avertissement

Description¶

This issue appears when the SMA detects a use of the pyspark.sql.readwriter.DataFrameReader.json function, which has a workaround.

Scénario¶

Entrée

Below is an example of a use of the pyspark.sql.readwriter.DataFrameReader.json function that generates this EWI. In this example, the json function is used to read multiple .json files with a given schema and uses some extra options such as primitiveAsString and dateFormat to fine-tune the behavior of reading the files.

file_paths = [
  "path/to/your/file1.json",
  "path/to/your/file2.json",
  "path/to/your/file3.json",
]

df = session.read.json(
  file_paths,
  schema=my_schema,
  primitiveAsString=True,
  dateFormat="2023-06-20"
)

Sortie

The SMA adds the EWI SPRKPY1027 to the output code to let you know that this function is not directly supported by Snowpark, but it has a workaround.

file_paths = [
  "path/to/your/file1.json",
  "path/to/your/file2.json",
  "path/to/your/file3.json",
]

#EWI: SPRKPY1027 => pyspark.sql.readwriter.DataFrameReader.json has a workaround, see documentation for more info
df = session.read.json(
  file_paths,
  schema=my_schema,
  primitiveAsString=True,
  dateFormat="2023-06-20"
)

Correction recommandée

In this section, we explain how to configure the path parameter, the schema parameter and some options to make them work in Snowpark.

1. Paramètre path

Snowpark requires the path parameter to be a stage location so, as a workaround, you can create a temporary stage and add each .json file to that stage using the prefix file://.

2. Paramètre schema

Snowpark does not allow defining the schema as a parameter of the json function. As a workaround, you can use the snowflake.snowpark.DataFrameReader.schema function.

3. Paramètres des options

Snowpark does not allow defining the extra options as parameters of the json function. As a workaround, for many of them you can use the snowflake.snowpark.DataFrameReader.option function to specify those parameters as options of the DataFrameReader.

Note

Les options suivantes ne sont pas prises en charge par Snowpark :

allowBackslashEscapingAnyCharacter
allowComments
allowNonNumericNumbers
allowNumericLeadingZero
allowSingleQuotes
allowUnquotedControlChars
allowUnquotedFieldNames
columnNameOfCorruptRecord
dropFiledIfAllNull
encoding
ignoreNullFields
ligneSep
locale
mode
multiline
prefersDecimal
primitiveAsString
samplingRatio
timestampNTZFormat
timeZone

stage = f'{session.get_fully_qualified_current_schema()}.{_generate_prefix("TEMP_STAGE")}'
session.sql(f'CREATE TEMPORARY STAGE IF NOT EXISTS {stage}')

session.file.put(f"file:///path/to/your/file1.json", f"@{stage}")
session.file.put(f"file:///path/to/your/file2.json", f"@{stage}")
session.file.put(f"file:///path/to/your/file3.json", f"@{stage}")

df = session.read.schema(my_schema).option("dateFormat", "2023-06-20").json(stage)

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1028¶

Message : pyspark.sql.readwriter.DataFrameReader.orc a une solution de contournement. Voir la documentation pour plus d’informations

Catégorie : Avertissement

Description¶

This issue appears when the SMA detects a use of the pyspark.sql.readwriter.DataFrameReader.orc function, which has a workaround.

Scénario¶

Entrée

Below is an example of a use of the pyspark.sql.readwriter.DataFrameReader.orc function that generates this EWI. In this example, the orc function is used to read multiple .orc files and uses some extra options such as mergeSchema and recursiveFileLookup to fine-tune the behavior of reading the files.

file_paths = [
  "path/to/your/file1.orc",
  "path/to/your/file2.orc",
  "path/to/your/file3.orc",
]

df = session.read.orc(
  file_paths,
  mergeSchema="True",
  recursiveFileLookup="True"
)

Sortie

The SMA adds the EWI SPRKPY1028 to the output code to let you know that this function is not directly supported by Snowpark, but it has a workaround.

file_paths = [
  "path/to/your/file1.orc",
  "path/to/your/file2.orc",
  "path/to/your/file3.orc",
]

#EWI: SPRKPY1028 => pyspark.sql.readwriter.DataFrameReader.orc has a workaround, see documentation for more info
df = session.read.orc(
  file_paths,
  mergeSchema="True",
  recursiveFileLookup="True"
)

Correction recommandée

In this section, we explain how to configure the path parameter and the extra options to make them work in Snowpark.

1. Paramètre path

Snowpark requires the path parameter to be a stage location so, as a workaround, you can create a temporary stage and add each .orc file to that stage using the prefix file://.

2. Paramètres des options

Snowpark does not allow defining the extra options as parameters of the orc function. As a workaround, for many of them you can use the snowflake.snowpark.DataFrameReader.option function to specify those parameters as options of the DataFrameReader.

Note

Les options suivantes ne sont pas prises en charge par Snowpark :

compression
mergeSchema

stage = f'{session.get_fully_qualified_current_schema()}.{_generate_prefix("TEMP_STAGE")}'
session.sql(f'CREATE TEMPORARY STAGE IF NOT EXISTS {stage}')

session.file.put(f"file:///path/to/your/file1.orc", f"@{stage}")
session.file.put(f"file:///path/to/your/file2.orc", f"@{stage}")
session.file.put(f"file:///path/to/your/file3.orc", f"@{stage}")

df = session.read.option(recursiveFileLookup, "True").orc(stage)

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1029¶

Message : Ce problème apparaît lorsque l’outil détecte l’utilisation de pyspark.sql.readwriter.DataFrameReader.parquet. Cette fonction est prise en charge, mais certaines différences entre Snowpark et Spark API peuvent nécessiter quelques modifications manuelles.

Catégorie : Avertissement

Description¶

This issue appears when the SMA detects a use of the pyspark.sql.readwriter.DataFrameReader.parquet function. This function is supported by Snowpark, however, there are some differences that would require some manual changes.

Scénario¶

Entrée

Below is an example of a use of the pyspark.sql.readwriter.DataFrameReader.parquet function that generates this EWI.

file_paths = [
  "path/to/your/file1.parquet",
  "path/to/your/file2.parquet",
  "path/to/your/file3.parquet",
]

df = session.read.parquet(
  *file_paths,
  mergeSchema="true",
  pathGlobFilter="*file*",
  recursiveFileLookup="true",
  modifiedBefore="2024-12-31T00:00:00",
  modifiedAfter="2023-12-31T00:00:00"
)

Sortie

The SMA adds the EWI SPRKPY1029 to the output code to let you know that this function is supported by Snowpark, but it requires some manual adjustments. Please note that the options supported by Snowpark are transformed into option function calls and those that are not supported are removed. This is explained in more detail in the next sections.

file_paths = [
  "path/to/your/file1.parquet",
  "path/to/your/file2.parquet",
  "path/to/your/file3.parquet"
]

#EWI: SPRKPY1076 => Some of the included parameters are not supported in the parquet function, the supported ones will be added into a option method.
#EWI: SPRKPY1029 => This issue appears when the tool detects the usage of pyspark.sql.readwriter.DataFrameReader.parquet. This function is supported, but some of the differences between Snowpark and the Spark API might require making some manual changes.
df = session.read.option("PATTERN", "*file*").parquet(
  *file_paths
)

Correction recommandée

In this section, we explain how to configure the paths and options parameters to make them work in Snowpark.

1. Paramètre path

In Spark, this parameter can be a local or cloud location. Snowpark only accepts cloud locations using a snowflake stage. So, you can create a temporal stage and add each file into it using the prefix file://.

2. Paramètre des options

Snowpark does not allow defining the different options as parameters of the parquet function. As a workaround, you can use the option or options functions to specify those parameters as extra options of the DataFrameReader.

Veuillez noter que les options Snowpark ne sont pas exactement les mêmes que les options PySpark et que des modifications manuelles peuvent donc être nécessaires. Vous trouverez ci-dessous une explication plus détaillée de la configuration des options PySpark les plus courantes dans Snowpark.

2.1 Option mergeSchema

Parquet supports schema evolution, allowing users to start with a simple schema and gradually add more columns as needed. This can result in multiple parquet files with different but compatible schemas. In Snowflake, thanks to the infer_schema capabilities you don’t need to do that and therefore the mergeSchema option can just be removed.

2.2 Option pathGlobFilter

If you want to load only a subset of files from the stage, you can use the pattern option to specify a regular expression that matches the files you want to load. The SMA already automates this as you can see in the output of this scenario.

2.3 Option recursiveFileLookupstr

This option is not supported by Snowpark. The best recommendation is to use a regular expression like with the pathGlobFilter option to achieve something similar.

2.4 Option modifiedBefore / modifiedAfter

You can achieve the same result in Snowflake by using the metadata columns.

Note

Les options suivantes ne sont pas prises en charge par Snowpark :

compression
datetimeRebaseMode
int96RebaseMode
mergeSchema

Vous trouverez ci-dessous l’exemple complet de la façon dont le code d’entrée doit être transformé pour qu’il fonctionne dans Snowpark :

from snowflake.snowpark.column import METADATA_FILE_LAST_MODIFIED, METADATA_FILENAME

temp_stage = f'{session.get_fully_qualified_current_schema()}.{_generate_prefix("TEMP_STAGE")}'
session.sql(f'CREATE TEMPORARY STAGE IF NOT EXISTS {temp_stage}')

session.file.put(f"file:///path/to/your/file1.parquet", f"@{temp_stage}")
session.file.put(f"file:///path/to/your/file2.parquet", f"@{temp_stage}")
session.file.put(f"file:///path/to/your/file3.parquet", f"@{temp_stage}")

df = session.read \
  .option("PATTERN", ".*file.*") \
  .with_metadata(METADATA_FILENAME, METADATA_FILE_LAST_MODIFIED) \
  .parquet(temp_stage) \
  .where(METADATA_FILE_LAST_MODIFIED < '2024-12-31T00:00:00') \
  .where(METADATA_FILE_LAST_MODIFIED > '2023-12-31T00:00:00')

Recommandations supplémentaires¶

Dans Snowflake, vous pouvez exploiter d’autres approches pour l’ingestion de données parquet, comme les suivantes :
- Leveraging native parquet ingestion capabilities. Consider also autoingest with snowpipe.
- Parquet external tables which can be pointed directly to cloud file locations.
- Using Iceberg tables.
Lors d’une migration, il est conseillé d’utiliser les rapports SMA pour tenter de dresser un inventaire des fichiers et de déterminer, après la modernisation, à quelles zones de préparation/tables les données seront mappées.
For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1030¶

Avertissement

Ce code de problème est maintenant obsolète.

Message : pyspark.sql.session.SparkSession.Builder.appName a une solution de contournement. Voir la documentation pour plus d’informations.

Catégorie : Avertissement

Description¶

This issue appears when the SMA detects a use of the pyspark.sql.session.SparkSession.Builder.appName function, which has a workaround.

Scénario¶

Entrée

Below is an example of a use of the pyspark.sql.session.SparkSession.Builder.appName function that generates this EWI. In this example, the appName function is used to set MyApp as the name of the application.

session = SparkSession.builder.appName("MyApp").getOrCreate()

Sortie

The SMA adds the EWI SPRKPY1030 to the output code to let you know that this function is not directly supported by Snowpark, but it has a workaround.

#EWI: SPRKPY1030 => pyspark.sql.session.SparkSession.Builder.appName has a workaround, see documentation for more info
session = Session.builder.appName("MyApp").getOrCreate()

Correction recommandée

As a workaround, you can import the snowpark_extensions package which provides an extension for the appName function.

import snowpark_extensions
session = SessionBuilder.appName("MyApp").getOrCreate()

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1031¶

Avertissement

This issue code has been deprecated since Spark Conversion Core 2.7.0

Message : pyspark.sql.column.Column.contains a une solution de contournement. Voir la documentation pour plus d’informations

Catégorie : Avertissement

Description¶

This issue appears when the SMA detects a use of the pyspark.sql.column.Column.contains function, which has a workaround.

Scénario¶

Entrée

Below is an example of a use of the pyspark.sql.column.Column.contains function that generates this EWI. In this example, the contains function is used to filter the rows where the “City” column contains the substring “New”.

df = spark.createDataFrame([("Alice", "New York"), ("Bob", "Los Angeles"), ("Charlie", "Chicago")], ["Name", "City"])
df_filtered = df.filter(col("City").contains("New"))

Sortie

The SMA adds the EWI SPRKPY1031 to the output code to let you know that this function is not directly supported by Snowpark, but it has a workaround.

df = spark.createDataFrame([("Alice", "New York"), ("Bob", "Los Angeles"), ("Charlie", "Chicago")], ["Name", "City"])
#EWI: SPRKPY1031 => pyspark.sql.column.Column.contains has a workaround, see documentation for more info
df_filtered = df.filter(col("City").contains("New"))

Correction recommandée

As a workaround, you can use the snowflake.snowpark.functions.contains function by passing the column as the first argument and the element to search as the second argument. If the element to search is a literal value then it should be converted into a column expression using the lit function.

from snowflake.snowpark import functions as f
df = spark.createDataFrame([("Alice", "New York"), ("Bob", "Los Angeles"), ("Charlie", "Chicago")], ["Name", "City"])
df_filtered = df.filter(f.contains(col("City"), f.lit("New")))

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1032¶

Message: *spark element* is not defined

Catégorie : Erreur de conversion

Description¶

Ce problème apparaît lorsque l’outil SMA n’a pas pu déterminer un statut de mappage approprié pour l’élément donné. Cela signifie que SMA ne sait pas encore si cet élément est pris en charge ou non par Snowpark. Veuillez noter qu’il s’agit d’un code d’erreur générique utilisé par SMA pour tout élément non défini.

Scénario¶

Entrée

Below is an example of a function for which the SMA could not determine an appropriate mapping status. In this case, you should assume that not_defined_function() is a valid PySpark function and the code runs.

sc.parallelize(["a", "b", "c", "d", "e"], 3).not_defined_function().collect()

Sortie

The SMA adds the EWI SPRKPY1032 to the output code to let you know that this element is not defined.

#EWI: SPRKPY1032 => pyspark.rdd.RDD.not_defined_function is not defined
sc.parallelize(["a", "b", "c", "d", "e"], 3).not_defined_function().collect()

Correction recommandée

Pour tenter d’identifier le problème, vous pouvez effectuer les validations suivantes :

Vérifiez que la syntaxe du code source est correcte et qu’il est correctement orthographié.
Check if you are using a PySpark version supported by the SMA. To know which PySpark version is supported by the SMA at the moment of running the SMA, you can review the first page of the DetailedReport.docx file.

If this is a valid PySpark element, please report that you encountered a conversion error on that particular element using the Report an Issue option of the SMA and include any additional information that you think may be helpful.

Please note that if an element is not defined, it does not mean that it is not supported by Snowpark. You should check the Snowpark Documentation to verify if an equivalent element exist.

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1033¶

Avertissement

Ce code de problème est maintenant obsolète.

Message : pyspark.sql.functions.asc a une solution de contournement. Voir la documentation pour plus d’informations

Catégorie : Avertissement

Description¶

This issue appears when the SMA detects a use of the pyspark.sql.functions.asc function, which has a workaround.

Scénarios¶

The pyspark.sql.functions.asc function takes either a column object or the name of the column as a string as its parameter. Both scenarios are not supported by Snowpark so this EWI is generated.

Scénario 1¶

Entrée

Below is an example of a use of the pyspark.sql.functions.asc function that takes a column object as parameter.

df.orderBy(asc(col))

Sortie

The SMA adds the EWI SPRKPY1033 to the output code to let you know that the asc function with a column object parameter is not directly supported by Snowpark, but it has a workaround.

#EWI: SPRKPY1033 => pyspark.sql.functions.asc has a workaround, see documentation for more info
df.orderBy(asc(col))

Correction recommandée

As a workaround, you can call the snowflake.snowpark.Column.asc function from the column parameter.

df.orderBy(col.asc())

Scénario 2¶

Entrée

Below is an example of a use of the pyspark.sql.functions.asc function that takes the name of the column as parameter.

df.orderBy(asc("colName"))

Sortie

The SMA adds the EWI SPRKPY1033 to the output code to let you know that the asc function with a column name parameter is not directly supported by Snowpark, but it has a workaround.

#EWI: SPRKPY1033 => pyspark.sql.functions.asc has a workaround, see documentation for more info
df.orderBy(asc("colName"))

Correction recommandée

As a workaround, you can convert the string parameter into a column object using the snowflake.snowpark.functions.col function and then call the snowflake.snowpark.Column.asc function.

df.orderBy(col("colName").asc())

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1034¶

Avertissement

Ce code de problème est maintenant obsolète.

Message : pyspark.sql.functions.desc a une solution de contournement. Voir la documentation pour plus d’informations

Catégorie : Avertissement

Description¶

This issue appears when the SMA detects a use of the pyspark.sql.functions.desc function, which has a workaround.

Scénarios¶

The pyspark.sql.functions.desc function takes either a column object or the name of the column as a string as its parameter. Both scenarios are not supported by Snowpark so this EWI is generated.

Scénario 1¶

Entrée

Below is an example of a use of the pyspark.sql.functions.desc function that takes a column object as parameter.

df.orderBy(desc(col))

Sortie

The SMA adds the EWI SPRKPY1034 to the output code to let you know that the desc function with a column object parameter is not directly supported by Snowpark, but it has a workaround.

#EWI: SPRKPY1034 => pyspark.sql.functions.desc has a workaround, see documentation for more info
df.orderBy(desc(col))

Correction recommandée

As a workaround, you can call the snowflake.snowpark.Column.desc function from the column parameter.

df.orderBy(col.desc())

Scénario 2¶

Entrée

Below is an example of a use of the pyspark.sql.functions.desc function that takes the name of the column as parameter.

df.orderBy(desc("colName"))

Sortie

The SMA adds the EWI SPRKPY1034 to the output code to let you know that the desc function with a column name parameter is not directly supported by Snowpark, but it has a workaround.

#EWI: SPRKPY1034 => pyspark.sql.functions.desc has a workaround, see documentation for more info
df.orderBy(desc("colName"))

Correction recommandée

As a workaround, you can convert the string parameter into a column object using the snowflake.snowpark.functions.col function and then call the snowflake.snowpark.Column.desc function.

df.orderBy(col("colName").desc())

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1035¶

Avertissement

Ce code de problème est maintenant obsolète.

Message : pyspark.sql.functions.reverse a une solution de contournement. Voir la documentation pour plus d’informations

Catégorie : Avertissement

Description¶

This issue appears when the SMA detects a use of the pyspark.sql.functions.reverse function, which has a workaround.

Scénario¶

Entrée

Below is an example of a use of the pyspark.sql.functions.reverse function that generates this EWI. In this example, the reverse function is used to reverse each string of the word column.

df = spark.createDataFrame([("hello",), ("world",)], ["word"])
df_reversed = df.withColumn("reversed_word", reverse(df["word"]))
df_reversed = df.withColumn("reversed_word", reverse("word"))

Sortie

The SMA adds the EWI SPRKPY1035 to the output code to let you know that this function is not directly supported by Snowpark, but it has a workaround.

df = spark.createDataFrame([("hello",), ("world",)], ["word"])
#EWI: SPRKPY1035 => pyspark.sql.functions.reverse has a workaround, see documentation for more info
df_reversed = df.withColumn("reversed_word", reverse(df["word"]))
#EWI: SPRKPY1035 => pyspark.sql.functions.reverse has a workaround, see documentation for more info
df_reversed = df.withColumn("reversed_word", reverse("word"))

Correction recommandée

As a workaround, you can import the snowpark_extensions package which provides an extension for the reverse function.

import snowpark_extensions

df = spark.createDataFrame([("hello",), ("world",)], ["word"])
df_reversed = df.withColumn("reversed_word", reverse(df["word"]))
df_reversed = df.withColumn("reversed_word", reverse("word"))

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1036¶

Avertissement

Ce code de problème est maintenant obsolète.

Message : pyspark.sql.column.Column.getField a une solution de contournement. Voir la documentation pour plus d’informations

Catégorie : Avertissement

Description¶

This issue appears when the SMA detects a use of the pyspark.sql.column.Column.getField function, which has a workaround.

Scénario¶

Entrée

Below is an example of a use of the pyspark.sql.column.Column.getField function that generates this EWI. In this example, the getField function is used to extract the name from the info column.

df = spark.createDataFrame([(1, {"name": "John", "age": 30}), (2, {"name": "Jane", "age": 25})], ["id", "info"])
df_with_name = df.withColumn("name", col("info").getField("name"))

Sortie

The SMA adds the EWI SPRKPY1036 to the output code to let you know that this function is not directly supported by Snowpark, but it has a workaround.

df = spark.createDataFrame([(1, {"name": "John", "age": 30}), (2, {"name": "Jane", "age": 25})], ["id", "info"])
#EWI: SPRKPY1036 => pyspark.sql.column.Column.getField has a workaround, see documentation for more info
df_with_name = df.withColumn("name", col("info").getField("name"))

Correction recommandée

As a workaround, you can use the Snowpark column indexer operator with the name of the field as the index.

df = spark.createDataFrame([(1, {"name": "John", "age": 30}), (2, {"name": "Jane", "age": 25})], ["id", "info"])
df_with_name = df.withColumn("name", col("info")["name"])

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1037¶

Avertissement

Ce code de problème est maintenant obsolète.

Message : pyspark.sql.functions.sort_array a une solution de contournement. Voir la documentation pour plus d’informations

Catégorie : Avertissement

Description¶

This issue appears when the SMA detects a use of the pyspark.sql.functions.sort_array function, which has a workaround.

Scénario¶

Entrée

Below is an example of a use of the pyspark.sql.functions.sort_array function that generates this EWI. In this example, the sort_array function is used to sort the numbers array in ascending and descending order.

df = spark.createDataFrame([(1, [3, 1, 2]), (2, [10, 5, 8]), (3, [6, 4, 7])], ["id", "numbers"])
df_sorted_asc = df.withColumn("sorted_numbers_asc", sort_array("numbers", asc=True))
df_sorted_desc = df.withColumn("sorted_numbers_desc", sort_array("numbers", asc=False))

Sortie

The SMA adds the EWI SPRKPY1037 to the output code to let you know that this function is not directly supported by Snowpark, but it has a workaround.

df = spark.createDataFrame([(1, [3, 1, 2]), (2, [10, 5, 8]), (3, [6, 4, 7])], ["id", "numbers"])
#EWI: SPRKPY1037 => pyspark.sql.functions.sort_array has a workaround, see documentation for more info
df_sorted_asc = df.withColumn("sorted_numbers_asc", sort_array("numbers", asc=True))
#EWI: SPRKPY1037 => pyspark.sql.functions.sort_array has a workaround, see documentation for more info
df_sorted_desc = df.withColumn("sorted_numbers_desc", sort_array("numbers", asc=False))

Correction recommandée

As a workaround, you can import the snowpark_extensions package which provides an extension for the sort_array function.

import snowpark_extensions

df = spark.createDataFrame([(1, [3, 1, 2]), (2, [10, 5, 8]), (3, [6, 4, 7])], ["id", "numbers"])
df_sorted_asc = df.withColumn("sorted_numbers_asc", sort_array("numbers", asc=True))
df_sorted_desc = df.withColumn("sorted_numbers_desc", sort_array("numbers", asc=False))

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1038¶

Message: *spark element* is not yet recognized

Catégorie : Erreur de conversion

Description¶

Ce problème apparaît lorsque votre code source contient un élément PySpark qui n’a pas été reconnu par l’outil SMA. Cela peut se produire pour différentes raisons, comme par exemple :

Un élément qui n’existe pas dans PySpark.
Un élément qui a été ajouté dans une version de PySpark et qui n’est pas encore pris en charge par SMA.
Une erreur interne de l’outil SMA lors du traitement de l’élément.

Il s’agit d’un code d’erreur générique utilisé par SMA pour tout élément non reconnu.

Scénario¶

Entrée

Vous trouverez ci-dessous un exemple d’utilisation d’une fonction qui n’a pas pu être reconnue par l’outil SMA car elle n’existe pas dans PySpark.

from pyspark.sql import functions as F
F.unrecognized_function()

Sortie

The SMA adds the EWI SPRKPY1038 to the output code to let you know that this element could not be recognized.

from snowflake.snowpark import functions as F
#EWI: SPRKPY1038 => pyspark.sql.functions.non_existent_function is not yet recognized
F.unrecognized_function()

Correction recommandée

Pour tenter d’identifier le problème, vous pouvez effectuer les validations suivantes :

Vérifiez si l’élément existe dans PySpark.
Vérifiez que l’élément est correctement orthographié.
Check if you are using a PySpark version supported by the SMA. To know which PySpark version is supported by the SMA at the moment of running the SMA, you can review the first page of the DetailedReport.docx file.

If it is a valid PySpark element, please report that you encountered a conversion error on that particular element using the Report an Issue option of the SMA and include any additional information that you think may be helpful.

Please note that if an element could not be recognized by the SMA, it does not mean that it is not supported by Snowpark. You should check the Snowpark Documentation to verify if an equivalent element exist.

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1039¶

Avertissement

Ce code de problème est maintenant obsolète.

Message : pyspark.sql.column.Column.getItem a une solution de contournement. Voir la documentation pour plus d’informations

Catégorie : Avertissement

Description¶

This issue appears when the SMA detects a use of the pyspark.sql.column.Column.getItem function, which has a workaround.

Scénario¶

Entrée

Below is an example of a use of the pyspark.sql.column.Column.getItem function that generates this EWI. In this example, the getItem function is used to get an item by position and by key.

df = spark.createDataFrame([(1, ["apple", "banana", "orange"]), (2, ["carrot", "avocado", "banana"])], ["id", "fruits"])
df.withColumn("first_fruit", col("fruits").getItem(0))

df = spark.createDataFrame([(1, {"apple": 10, "banana": 20}), (2, {"carrot": 15, "grape": 25}), (3, {"pear": 30, "apple": 35})], ["id", "fruit_quantities"])
df.withColumn("apple_quantity", col("fruit_quantities").getItem("apple"))

Sortie

The SMA adds the EWI SPRKPY1039 to the output code to let you know that this function is not directly supported by Snowpark, but it has a workaround.

df = spark.createDataFrame([(1, ["apple", "banana", "orange"]), (2, ["carrot", "avocado", "banana"])], ["id", "fruits"])
#EWI: SPRKPY1039 => pyspark.sql.column.Column.getItem has a workaround, see documentation for more info
df.withColumn("first_fruit", col("fruits").getItem(0))

df = spark.createDataFrame([(1, {"apple": 10, "banana": 20}), (2, {"carrot": 15, "grape": 25}), (3, {"pear": 30, "apple": 35})], ["id", "fruit_quantities"])
#EWI: SPRKPY1039 => pyspark.sql.column.Column.getItem has a workaround, see documentation for more info
df.withColumn("apple_quantity", col("fruit_quantities").getItem("apple"))

Correction recommandée

Pour contourner ce problème, vous pouvez utiliser l”opérateur d’indexation de colonne de Snowpark avec le nom ou la position du champ comme index.

df = spark.createDataFrame([(1, ["apple", "banana", "orange"]), (2, ["carrot", "avocado", "banana"])], ["id", "fruits"])
df.withColumn("first_fruit", col("fruits")[0])

df = spark.createDataFrame([(1, {"apple": 10, "banana": 20}), (2, {"carrot": 15, "grape": 25}), (3, {"pear": 30, "apple": 35})], ["id", "fruit_quantities"])
df.withColumn("apple_quantity", col("fruit_quantities")["apple"])

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1040¶

Avertissement

Ce code de problème est maintenant obsolète.

Message : pyspark.sql.functions.explode a une solution de contournement. Voir la documentation pour plus d’informations

Catégorie : Avertissement

Description¶

This issue appears when the SMA detects a use of the pyspark.sql.functions.explode function, which has a workaround.

Scénario¶

Entrée

Below is an example of a use of the pyspark.sql.functions.explode function that generates this EWI. In this example, the explode function is used to generate one row per array item for the numbers column.

df = spark.createDataFrame([("Alice", [1, 2, 3]), ("Bob", [4, 5]), ("Charlie", [6, 7, 8, 9])], ["name", "numbers"])
exploded_df = df.select("name", explode(df.numbers).alias("number"))

Sortie

The SMA adds the EWI SPRKPY1040 to the output code to let you know that this function is not directly supported by Snowpark, but it has a workaround.

df = spark.createDataFrame([("Alice", [1, 2, 3]), ("Bob", [4, 5]), ("Charlie", [6, 7, 8, 9])], ["name", "numbers"])
#EWI: SPRKPY1040 => pyspark.sql.functions.explode has a workaround, see documentation for more info
exploded_df = df.select("name", explode(df.numbers).alias("number"))

Correction recommandée

As a workaround, you can import the snowpark_extensions package which provides an extension for the explode function.

import snowpark_extensions

df = spark.createDataFrame([("Alice", [1, 2, 3]), ("Bob", [4, 5]), ("Charlie", [6, 7, 8, 9])], ["name", "numbers"])
exploded_df = df.select("name", explode(df.numbers).alias("number"))

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1041¶

Avertissement

This issue code has been deprecated since Spark Conversion Core Version 2.9.0

Message : pyspark.sql.functions.explode_outer a une solution de contournement

Catégorie : Avertissement

Description¶

This issue appears when the tool detects the usage of pyspark.sql.functions.explode_outer which has a workaround.

Scénario¶

Entrée

L’exemple montre l’utilisation de la méthode explode_outer dans un appel sélectif.

df = spark.createDataFrame(
    [(1, ["foo", "bar"], {"x": 1.0}),
     (2, [], {}),
     (3, None, None)],
    ("id", "an_array", "a_map")
)

df.select("id", "an_array", explode_outer("a_map")).show()

Sortie

The tool adds the EWI SPRKPY1041 indicating that a workaround can be implemented.

df = spark.createDataFrame(
    [(1, ["foo", "bar"], {"x": 1.0}),
     (2, [], {}),
     (3, None, None)],
    ("id", "an_array", "a_map")
)

#EWI: SPRKPY1041 => pyspark.sql.functions.explode_outer has a workaround, see documentation for more info
df.select("id", "an_array", explode_outer("a_map")).show()

Correction recommandée

As a workaround, you can import the snowpark_extensions package, which contains a helper for the explode_outer function.

import snowpark_extensions

df = spark.createDataFrame(
    [(1, ["foo", "bar"], {"x": 1.0}),
     (2, [], {}),
     (3, None, None)],
    ("id", "an_array", "a_map")
)

df.select("id", "an_array", explode_outer("a_map")).show()

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1042¶

Message : pyspark.sql.functions.posexplode a une solution de contournement

Catégorie : Avertissement

Description¶

This issue appears when the tool detects the usage of pyspark.sql.functions.posexplode which has a workaround.

Scénarios¶

There are a couple of scenarios that this method can handle depending on the type of column it is passed as a parameter, it can be a list of values or a map/directory (keys/values).

Scénario 1¶

Entrée

Below is an example of the usage of posexplode passing as a parameter of a list of values.

df = spark.createDataFrame(
    [Row(a=1,
         intlist=[1, 2, 3])])

df.select(posexplode(df.intlist)).collect()

Sortie

The tool adds the EWI SPRKPY1042 indicating that a workaround can be implemented.

df = spark.createDataFrame(
    [Row(a=1,
         intlist=[100, 200, 300])])
#EWI: SPRKPY1042 => pyspark.sql.functions.posexplode has a workaround, see documentation for more info

df.select(posexplode(df.intlist)).show()

Correction recommandée

For having the same behavior, use the method functions.flatten, drop extra columns, and rename index and value column names.

df = spark.createDataFrame(
  [Row(a=1,
       intlist=[1, 2, 3])])

df.select(
    flatten(df.intlist))\
    .drop("DATA", "SEQ", "KEY", "PATH", "THIS")\
    .rename({"INDEX": "pos", "VALUE": "col"}).show()

Scénario 2¶

Entrée

Below is another example of the usage of posexplode passing as a parameter a map/dictionary (keys/values)

df = spark.createDataFrame([
    [1, [1, 2, 3], {"Ashi Garami": "Single Leg X"}, "Kimura"],
    [2, [11, 22], {"Sankaku": "Triangle"}, "Coffee"]
],
schema=["idx", "lists", "maps", "strs"])

df.select(posexplode(df.maps)).show()

Sortie

The tool adds the EWI SPRKPY1042 indicating that a workaround can be implemented.

df = spark.createDataFrame([
    [1, [1, 2, 3], {"Ashi Garami": "Single Leg X"}, "Kimura"],
    [2, [11, 22], {"Sankaku": "Triangle"}, "Coffee"]
],
schema=["idx", "lists", "maps", "strs"])
#EWI: SPRKPY1042 => pyspark.sql.functions.posexplode has a workaround, see documentation for more info

df.select(posexplode(df.maps)).show()

Correction recommandée

As a workaround, you can use functions.row_number to get the position and functions.explode with the name of the field to get the value the key/value for dictionaries.

df = spark.createDataFrame([
    [10, [1, 2, 3], {"Ashi Garami": "Single Leg X"}, "Kimura"],
    [11, [11, 22], {"Sankaku": "Triangle"}, "Coffee"]
],
    schema=["idx", "lists", "maps", "strs"])

window = Window.orderBy(col("idx").asc())

df.select(
    row_number().over(window).alias("pos"),
    explode(df.maps).alias("key", "value")).show()

Remarque : l’utilisation de row_number n’est pas totalement équivalent, car il commence par 1 (et non par zéro comme dans la méthode Spark)

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1043¶

Message : pyspark.sql.functions.posexplode_outer a une solution de contournement

Catégorie : Avertissement

Description¶

This issue appears when the tool detects the usage of pyspark.sql.functions.posexplode_outer which has a workaround.

Scénarios¶

There are a couple of scenarios that this method can handle depending on the type of column it is passed as a parameter, it can be a list of values or a map/directory (keys/values).

Scénario 1¶

Entrée

Below is an example that shows the usage of posexplode_outer passing a list of values.

df = spark.createDataFrame(
    [
        (1, ["foo", "bar"]),
        (2, []),
        (3, None)],
    ("id", "an_array"))

df.select("id", "an_array", posexplode_outer("an_array")).show()

Sortie

The tool adds the EWI SPRKPY1043 indicating that a workaround can be implemented.

df = spark.createDataFrame(
    [
        (1, ["foo", "bar"]),
        (2, []),
        (3, None)],
    ("id", "an_array"))
#EWI: SPRKPY1043 => pyspark.sql.functions.posexplode_outer has a workaround, see documentation for more info

df.select("id", "an_array", posexplode_outer("an_array")).show()

Correction recommandée

For having the same behavior, use the method functions.flatten sending the outer parameter in True, drop extra columns, and rename index and value column names.

df = spark.createDataFrame(
    [
        (1, ["foo", "bar"]),
        (2, []),
        (3, None)],
    ("id", "an_array"))

df.select(
    flatten(df.an_array, outer=True))\
    .drop("DATA", "SEQ", "KEY", "PATH", "THIS")\
    .rename({"INDEX": "pos", "VALUE": "col"}).show()

Scénario 2¶

Entrée

Vous trouverez ci-dessous un autre exemple d’utilisation de posexplode_outer en transférant une carte/un dictionnaire (clés/valeurs)

df = spark.createDataFrame(
    [
        (1, {"x": 1.0}),
        (2, {}),
        (3, None)],
    ("id", "a_map"))

df.select(posexplode_outer(df.a_map)).show()

Sortie

The tool adds the EWI SPRKPY1043 indicating that a workaround can be implemented.

df = spark.createDataFrame(
    [
        (1, {"x": "Ashi Garami"}),
        (2, {}),
        (3, None)],
    ("id", "a_map"))
#EWI: SPRKPY1043 => pyspark.sql.functions.posexplode_outer has a workaround, see documentation for more info

df.select(posexplode_outer(df.a_map)).show()

Correction recommandée

As a workaround, you can use functions.row_number to get the position and functions.explode_outer with the name of the field to get the value of the key/value for dictionaries.

df = spark.createDataFrame(
    [
        (1, {"x": "Ashi Garami"}),
        (2,  {}),
        (3, None)],
    ("id", "a_map"))

window = Window.orderBy(col("id").asc())

df.select(
    row_number().over(window).alias("pos"),
          explode_outer(df.a_map)).show()

Remarque : l’utilisation de row_number n’est pas totalement équivalent, car il commence par 1 (et non par zéro comme dans la méthode Spark)

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1044¶

Avertissement

This issue code has been deprecated since Spark Conversion Core Version 2.4.0

Message : pyspark.sql.functions.split a une solution de contournement

Catégorie : Avertissement

Description¶

This issue appears when the tool detects the usage of pyspark.sql.functions.split which has a workaround.

Scénarios¶

Il existe plusieurs scénarios en fonction du nombre de paramètres transmis à la méthode.

Scénario 1¶

Entrée

Below is an example when the function split has just the str and pattern parameters

F.split('col', '\\|')

Sortie

The tool shows the EWI SPRKPY1044 indicating there is a workaround.

#EWI: SPRKPY1044 => pyspark.sql.functions.split has a workaround, see the documentation for more info
F.split('col', '\\|')

Correction recommandée

As a workaround, you can call the function snowflake.snowpark.functions.lit with the pattern parameter and send it into the split.

F.split('col', lit('\\|'))
## the result of lit will be sent to the split function

Scénario 2¶

Entrée

Below is another example when the function split has the str, pattern, and limit parameters.

F.split('col', '\\|', 2)

Sortie

The tool shows the EWI SPRKPY1044 indicating there is a workaround.

#EWI: SPRKPY1044 => pyspark.sql.functions.split has a workaround, see the documentation for more info
F.split('col', '\\|', 2)

Correction recommandée

Ce scénario spécifique n’est pas pris en charge.

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1045¶

Message : pyspark.sql.functions.map_values a une solution de contournement

Catégorie : Avertissement

Description¶

Cette fonction est utilisée pour extraire la liste des valeurs d’une colonne qui contient une carte/un dictionnaire (clés/valeurs).

The issue appears when the tool detects the usage of pyspark.sql.functions.map_values which has a workaround.

Scénario¶

Entrée

Below is an example of the usage of the method map_values.

df = spark.createDataFrame(
    [(1, {'Apple': 'Fruit', 'Potato': 'Vegetable'})],
    ("id", "a_map"))

df.select(map_values("a_map")).show()

Sortie

The tool adds the EWI SPRKPY1045 indicating that a workaround can be implemented.

df = spark.createDataFrame(
    [(1, {'Apple': 'Fruit', 'Potato': 'Vegetable'})],
    ("id", "a_map"))
#EWI: SPRKPY1045 => pyspark.sql.functions.map_values has a workaround, see documentation for more info

df.select(map_values("a_map")).show()

Correction recommandée

As a workaround, you can create an udf to get the values for a column. The below example shows how to create the udf, then assign it to F.map_values, and then make use of it.

from snowflake.snowpark import functions as F
from snowflake.snowpark.types import ArrayType, MapType

map_values_udf=None

def map_values(map):
    global map_values_udf
    if not map_values_udf:
        def _map_values(map: dict)->list:
            return list(map.values())
        map_values_udf = F.udf(_map_values,return_type=ArrayType(),input_types=[MapType()],name="map_values",is_permanent=False,replace=True)
    return map_values_udf(map)

F.map_values = map_values

df.select(map_values(colDict))

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1046¶

Avertissement

This issue code has been deprecated since Spark Conversion Core Version 2.1.22

Message : pyspark.sql.functions.monotonically_increasing_id a une solution de contournement

Catégorie : Avertissement

Description¶

This issue appears when the tool detects the usage of pyspark.sql.functions.monotonically_increasing_id which has a workaround.

Scénario¶

Entrée

Below is an example of the usage of the method monotonically_increasing_id.

from pyspark.sql import functions as F

spark.range(0, 10, 1, 2).select(F.monotonically_increasing_id()).show()

Sortie

The tool adds the EWI SPRKPY1046 indicating that a workaround can be implemented.

from pyspark.sql import functions as F
#EWI: SPRKPY1046 => pyspark.sql.functions.monotonically_increasing_id has a workaround, see documentation for more info
spark.range(0, 10, 1, 2).select(F.monotonically_increasing_id()).show()

Correction recommandée

Mettez à jour la version de l’outil.

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1047¶

Avertissement

This issue code has been deprecated since Spark Conversion Core Version 4.6.0

Description¶

This issue appears when the tool detects the usage of pyspark.context.SparkContext.setLogLevel which has a workaround.

Scénario¶

Entrée

Below is an example of the usage of the method setLogLevel.

sparkSession.sparkContext.setLogLevel("WARN")

Sortie

The tool adds the EWI SPRKPY1047 indicating that a workaround can be implemented.

#EWI: SPRKPY1047 => pyspark.context.SparkContext.setLogLevel has a workaround, see documentation for more info
sparkSession.sparkContext.setLogLevel("WARN")

Correction recommandée

Replace the setLogLevel function usage with logging.basicConfig that provides a set of convenience functions for simple logging usage. In order to use it, we need to import two modules, « logging » and « sys », and the level constant should be replaced using the « Level equivalent table »:

import logging
import sys
logging.basicConfig(stream=sys.stdout, level=logging.WARNING)

Table d’équivalence des niveaux


Paramètre source du niveau	Paramètre cibl du niveaue
« ALL »	<mark style= »color:red; »>This has no equivalent</mark>
« DEBUG »	logging.DEBUG
« ERROR »	logging.ERROR
« FATAL »	logging.CRITICAL
« INFO »	logging.INFO
« OFF »	logging.NOTSET
« TRACE »	<mark style= »color:red; »>This has no equivalent</mark>
« WARN »	logging.WARNING

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1048¶

Avertissement

This issue code has been deprecated since Spark Conversion Core Version 2.4.0

Message : pyspark.sql.session.SparkSession.conf a une solution de contournement

Catégorie : Avertissement

Description¶

This issue appears when the tool detects the usage of pyspark.sql.session.SparkSession.conf which has a workaround.

Scénario¶

Entrée

Below is an example of how to set a configuration into the property conf .

spark.conf.set("spark.sql.crossJoin.enabled", "true")

Sortie

The tool adds the EWI SPRKPY1048 indicating that a workaround can be implemented.

#EWI: SPRKPY1048 => pyspark.sql.session.SparkSession.conf has a workaround, see documentation for more info
spark.conf.set("spark.sql.crossJoin.enabled", "true")

Correction recommandée

SparkSession.conf est utilisé pour transférer certains paramètres spécifiques utilisés uniquement par Pyspark et ne s’applique pas à Snowpark. Vous pouvez supprimer ou commenter le code

#spark.conf.set("spark.sql.crossJoin.enabled", "true")

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1049¶

Avertissement

This issue code has been deprecated since Spark Conversion Core Version 2.1.9

Message : pyspark.sql.session.SparkSession.sparkContext a une solution de contournement

Catégorie : Avertissement

Description¶

This issue appears when the tool detects the usage of pyspark.sql.session.SparkSession.sparkContext which has a workaround.

Scénario¶

Entrée

Below is an example that creates a spark session and then uses the SparkContext property to print the appName.

print("APP Name :"+spark.sparkContext.appName())

Sortie

The tool adds the EWI SPRKPY1049 indicating that a workaround can be implemented.

#EWI: SPRKPY1049 => pyspark.sql.session.SparkSession.sparkContext has a workaround, see documentation for more info
print("APP Name :"+spark.sparkContext.appName())

Correction recommandée

SparkContext n’est pas pris en charge par SnowPark, mais vous pouvez accéder aux méthodes et propriétés de SparkContext directement à partir de l’instance de session.

## Pyspark
print("APP Name :"+spark.sparkContext.appName())
can be used in SnowPark removing the sparkContext as:
#Manual adjustment in SnowPark
print("APP Name :"+spark.appName());

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1050¶

Message : pyspark.conf.SparkConf.set a une solution de contournement

Catégorie : Avertissement

Description¶

This issue appears when the tool detects the usage of pyspark.conf.SparkConf.set which has a workaround.

Scénario¶

Entrée

Below is an example that sets a variable using conf.set.

conf = SparkConf().setAppName('my_app')

conf.set("spark.storage.memoryFraction", "0.5")

Sortie

The tool adds the EWI SPRKPY1050 indicating that a workaround can be implemented.

conf = SparkConf().setAppName('my_app')

#EWI: SPRKPY1050 => pyspark.conf.SparkConf.set has a workaround, see documentation for more info
conf.set("spark.storage.memoryFraction", "0.5")

Correction recommandée

SparkConf.set est utilisé pour définir un paramètre de configuration utilisé uniquement par Pyspark et ne s’applique pas à Snowpark. Vous pouvez supprimer ou commenter le code

#conf.set("spark.storage.memoryFraction", "0.5")

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1051¶

Avertissement

This issue code has been deprecated since Spark Conversion Core Version 2.4.0

Message : pyspark.sql.session.SparkSession.Builder.master a une solution de contournement

Catégorie : Avertissement

Description¶

This issue appears when the tool detects pyspark.sql.session.SparkSession.Builder.master usage which has a workaround.

Scénario¶

Entrée

Below is an example of the usage of the method builder.master to set the Spark Master URL to connect to local using 1 core.

spark = SparkSession.builder.master("local[1]")

Sortie

The tool adds the EWI SPRKPY1051 indicating that a workaround can be implemented.

#EWI: SPRKPY1051 => pyspark.sql.session.SparkSession.Builder.master has a workaround, see documentation for more info
spark = Session.builder.master("local[1]")

Correction recommandée

pyspark.sql.session.SparkSession.Builder.master is used to set up a Spark Cluster. Snowpark doesn’t use Spark Clusters so you can remove or comment the code.

## spark = Session.builder.master("local[1]")

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1052¶

Avertissement

This issue code has been deprecated since Spark Conversion Core Version 2.8.0

Message : pyspark.sql.session.SparkSession.Builder.enableHiveSupport a une solution de contournement

Catégorie : Avertissement

Description¶

This issue appears when the tool detects the usage of pyspark.sql.session.SparkSession.Builder.enableHiveSupport which has a workaround.

Scénario¶

Entrée

Below is an example that configures the SparkSession and enables the hive support using the method enableHiveSupport.

spark = Session.builder.appName("Merge_target_table")\
        .config("spark.port.maxRetries","100") \
        .enableHiveSupport().getOrCreate()

Sortie

The tool adds the EWI SPRKPY1052 indicating that a workaround can be implemented.

#EWI: SPRKPY1052 => pyspark.sql.session.SparkSession.Builder.enableHiveSupport has a workaround, see documentation for more info
spark = Session.builder.appName("Merge_target_table")\
        .config("spark.port.maxRetries","100") \
        .enableHiveSupport().getOrCreate()

Correction recommandée

Remove the use of enableHiveSupport function because it is not needed in Snowpark.

spark = Session.builder.appName("Merge_target_table")\
        .config("spark.port.maxRetries","100") \
        .getOrCreate()

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1053¶

Message : Une erreur s’est produite lors de l’extraction des fichiers dbc.

Catégorie : Avertissement

Description¶

Ce problème apparaît lorsqu’un fichier dbc ne peut pas être extrait. Cet avertissement peut être dû à une ou plusieurs des raisons suivantes : Trop lourd, inaccessible, en lecture seule, etc.

Recommandations supplémentaires¶

En guise de solution de contournement, vous pouvez vérifier la taille du fichier s’il est trop lourd pour être traité. Analysez également si l’outil peut y accéder afin d’éviter tout problème d’accès.
Pour plus d’assistance, vous pouvez nous envoyer un e-mail à l’adresse suivante : snowconvert-info@snowflake.com. Si vous avez un contrat de support avec Snowflake, contactez votre ingénieur commercial, qui pourra répondre à vos besoins en matière d’assistance.

SPRKPY1080¶

Message : La valeur de SparkContext est remplacée par la variable “session”.

Catégorie : Avertissement

Description¶

Le contexte de Spark est stocké dans une variable appelée session qui crée une session Snowpark.

Scénario¶

Entrée

Cet extrait décrit un SparkContext.

## Input Code
from pyspark import SparkContext
from pyspark.sql import SparkSession

def example1():

    sc = SparkContext("local[*]", "TestApp")

    sc.setLogLevel("ALL")
    sc.setLogLevel("DEBUG")

Sortie

Dans ce code de sortie, SMA a remplacé PySpark.SparkContext par SparkSession. Notez que l’outil SMA ajoute également un modèle pour remplacer la connexion dans le fichier « connection.json » et charge ensuite cette configuration dans la variable connection_parameter.

## Output Code
import logging
import sys
import json
from snowflake.snowpark import Session
from snowflake.snowpark import Session

def example1():
    jsonFile = open("connection.json")
    connection_parameter = json.load(jsonFile)
    jsonFile.close()
    #EWI: SPRKPY1080 => The value of SparkContext is replaced with 'session' variable.
    sc = Session.builder.configs(connection_parameter).getOrCreate()
    sc.update_query_tag({"origin":"sf_sit","name":"sma","version":{"major":0,"minor":0,"patch":0},"attributes":{"language":"Python"}})
    logging.basicConfig(stream = sys.stdout, level = logging.NOTSET)
    logging.basicConfig(stream = sys.stdout, level = logging.DEBUG)

Correction recommandée

Le fichier de configuration « connection.json » doit être mis à jour avec les informations de connexion requises :

{
  "user": "my_user",
  "password": "my_password",
  "account": "my_account",
  "role": "my_role",
  "warehouse": "my_warehouse",
  "database": "my_database",
  "schema": "my_schema"
}

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1054¶

Messag : pyspark.sql.readwriter.DataFrameReader.format n’est pas pris en charge.

Catégorie : Avertissement

Description¶

This issue appears when the pyspark.sql.readwriter.DataFrameReader.format has an argument that is not supported by Snowpark.

Scénarios¶

There are some scenarios depending on the type of format you are trying to load. It can be a supported , or non-supported format.

Scénario 1¶

Entrée

L’outil analyse le type de format que vous essayez de charger. Les formats pris en charge sont les suivants :

Csv
JSON
Parquet
Orc

The below example shows how the tool transforms the format method when passing a Csv value.

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

df1 = spark.read.format('csv').load('/path/to/file')

Sortie

The tool transforms the format method into a Csv method call.

from snowflake.snowpark import Session
spark = Session.builder.getOrCreate()

df1 = spark.read.csv('/path/to/file')

Correction recommandée

Dans ce cas, l’outil n’affiche pas l’EWI, ce qui signifie qu’aucune correction n’est nécessaire.

Scénario 2¶

Entrée

The below example shows how the tool transforms the format method when passing a Jdbc value.

from snowflake.snowpark import Session
spark = Session.builder.getOrCreate()

df2 = spark.read.format('jdbc') \
    .option("driver", "com.mysql.cj.jdbc.Driver") \
    .option("url", "jdbc:mysql://localhost:3306/emp") \
    .option("dbtable", "employee") \
    .option("user", "root") \
    .option("password", "root") \
    .load()

Sortie

The tool shows the EWI SPRKPY1054 indicating that the value « jdbc » is not supported.

from snowflake.snowpark import Session
spark = Session.builder.getOrCreate()

#EWI: SPRKPY1054 => pyspark.sql.readwriter.DataFrameReader.format with argument value "jdbc" is not supported.
#EWI: SPRKPY1002 => pyspark.sql.readwriter.DataFrameReader.load is not supported

df2 = spark.read.format('jdbc') \
    .option("driver", "com.mysql.cj.jdbc.Driver") \
    .option("url", "jdbc:mysql://localhost:3306/emp") \
    .option("dbtable", "employee") \
    .option("user", "root") \
    .option("password", "root") \
    .load()

Correction recommandée

For the not supported scenarios, there is no specific fix since it depends on the files that are trying to be read.

Scénario 3¶

Entrée

The below example shows how the tool transforms the format method when passing a CSV, but using a variable instead.

from snowflake.snowpark import Session
spark = Session.builder.getOrCreate()

myFormat = 'csv'
df3 = spark.read.format(myFormat).load('/path/to/file')

Sortie

Since the tool can not determine the value of the variable in runtime, shows the EWI SPRKPY1054 indicating that the value « » is not supported.

from snowflake.snowpark import Session
spark = Session.builder.getOrCreate()

myFormat = 'csv'
#EWI: SPRKPY1054 => pyspark.sql.readwriter.DataFrameReader.format with argument value "" is not supported.
#EWI: SPRKPY1002 => pyspark.sql.readwriter.DataFrameReader.load is not supported
df3 = spark.read.format(myFormat).load('/path/to/file')

Correction recommandée

As a workaround, you can check the value of the variable and add it as a string to the format call.

Recommandations supplémentaires¶

The Snowpark location only accepts cloud locations using a snowflake stage.
The documentation of methods supported by Snowpark can be found in the documentation
For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1055¶

Message : La valeur de clé pyspark.sql.readwriter.DataFrameReader.option n’est pas prise en charge.

Catégorie : Avertissement

Description¶

This issue appears when the pyspark.sql.readwriter.DataFrameReader.option key value is not supported by SnowFlake.

L’outil analyse les paramètres de l’appel de l’option et, en fonction de la méthode (CSV ou JSON ou PARQUET), la valeur clé peut avoir ou non un équivalent dans Snowpark. Si tous les paramètres ont un équivalent, l’outil n’ajoute pas l’EWI et remplace la valeur clé par son équivalent. Dans le cas contraire, l’outil ajoute l’EWI.

Liste des équivalences :

Équivalences pour CSV :


Clés d’option Spark	Equivalences Snowpark
sep	FIELD_DELIMITER
header	PARSE_HEADER
ligneSep	RECORD_DELIMITER
pathGlobFilter	PATTERN
quote	FIELD_OPTIONALLY_ENCLOSED_BY
nullValue	NULL_IF
dateFormat	DATE_FORMAT
timestampFormat	TIMESTAMP_FORMAT
inferSchema	INFER_SCHEMA
delimiter	FIELD_DELIMITER

Équivalences pour JSON :


Clés d’option Spark	Equivalences Snowpark
dateFormat	DATE_FORMAT
timestampFormat	TIMESTAMP_FORMAT
pathGlobFilter	PATTERN

Équivalences pour PARQUET :


Clés d’option Spark	Equivalences Snowpark
pathGlobFilter	PATTERN

Toute autre option clé qui ne figure pas dans l’une des tables ci-dessus, n’est pas prise en charge ou n’a pas d’équivalent dans Snowpark. Si c’est le cas, l’outil ajoute l’EWI avec les informations sur les paramètres et le retire de la chaîne.

Scénarios¶

Les scénarios suivants s’appliquent à CSV, JSON et PARQUET.

There are a couple of scenarios depending on the value of the key used in the option method.

Scénario 1¶

Entrée

Below is an example of a option call using a equivalent key.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

## CSV example:
spark.read.option("header", True).csv(csv_file_path)

## Json example:
spark.read.option("dateFormat", "dd-MM-yyyy").json(json_file_path)

## Parquet example:
spark.read.option("pathGlobFilter", "*.parquet").parquet(parquet_file_path)

Sortie

L’outil transforme la clé avec l’équivalent correct.

from snowflake.snowpark import Session

spark = Session.builder.getOrCreate()

## CSV example:
spark.read.option("PARSE_HEADER", True).csv(csv_file_path)

## Json example:
spark.read.option("DATE_FORMAT", "dd-MM-yyyy").json(json_file_path)

## Parquet example:
spark.read.option("PATTERN", "*.parquet").parquet(parquet_file_path)

Correction recommandée

Étant donné que l’outil transforme la valeur de la clé, aucune correction n’est nécessaire.

Scénario 2¶

Entrée

Below is an example of a option call using a non-equivalent key.

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

## CSV example:
spark.read.option("anotherKeyValue", "myVal").csv(csv_file_path)

## Json example:
spark.read.option("anotherKeyValue", "myVal").json(json_file_path)

## Parquet example:
spark.read.option("anotherKeyValue", "myVal").parquet(parquet_file_path)

Sortie

The tool adds the EWI SPRKPY1055 indicating the key is not supported and removes the option call.

from snowflake.snowpark import Session

spark = Session.builder.getOrCreate()

## CSV example:
#EWI: SPRKPY1055 => pyspark.sql.readwriter.DataFrameReader.option with key value "anotherKeyValue" is not supported.
spark.read.csv(csv_file_path)

## Json example:
#EWI: SPRKPY1055 => pyspark.sql.readwriter.DataFrameReader.option with key value "anotherKeyValue" is not supported.
spark.read.json(json_file_path)

## Parquet example:
#EWI: SPRKPY1055 => pyspark.sql.readwriter.DataFrameReader.option with key value "anotherKeyValue" is not supported.
spark.read.parquet(parquet_file_path)

Correction recommandée

Il est recommandé de vérifier le comportement après la transformation.

Recommandations supplémentaires¶

En présence de paramètres non équivalents, il est recommandé de vérifier le comportement après la transformation.
For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1056¶

Avertissement

Ce code de problème est maintenant obsolète.

Message : pyspark.sql.readwriter.DataFrameReader.option argument _ <argument_name> _ n’est pas un littéral et ne peut pas être évalué

Catégorie : Avertissement

Description¶

This issue appears when the argument’s key or value of the pyspark.sql.readwriter.DataFrameReader.option function is not a literal value (for example a variable). The SMA does a static analysis of your source code, and therefore it is not possible to evaluate the content of the argument.

Scénario¶

Entrée

Below is an example of a use of the pyspark.sql.readwriter.DataFrameReader.option function that generates this EWI.

my_value = ...
my_option = ...

df1 = spark.read.option("dateFormat", my_value).format("csv").load('filename.csv')
df2 = spark.read.option(my_option, "false").format("csv").load('filename.csv')

Sortie

The SMA adds the EWI SPRKPY1056 to the output code to let you know that the argument of this function is not a literal value, and therefore it could not be evaluated by the SMA.

my_value = ...
my_option = ...

#EWI: SPRKPY1056 => pyspark.sql.readwriter.DataFrameReader.option argument "dateFormat" is not a literal and can't be evaluated
df1 = spark.read.option("dateFormat", my_value).format("csv").load('filename.csv')
#EWI: SPRKPY1056 => pyspark.sql.readwriter.DataFrameReader.option argument key is not a literal and can't be evaluated
df2 = spark.read.option(my_option, "false").format("csv").load('filename.csv')

Correction recommandée

Even though the SMA was unable to evaluate the argument, it does not mean that it is not supported by Snowpark. Please make sure that the value of the argument is valid and equivalent in Snowpark by checking the documentation.

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1057¶

Avertissement

This Issue Code has been deprecated since Spark Conversion Core Version 4.8.0

Message: PySpark Dataframe Option argument contains a value that is not a literal, therefore cannot be evaluated

Catégorie : Avertissement

Description¶

Ce code de problème est obsolète. Si vous utilisez une version plus ancienne, veuillez la mettre à jour.

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1058¶

Message : La clé spécifique à la plateforme < méthode > avec < clé > n’est pas prise en charge.

Catégorie : ConversionError

Description¶

The get and set methods from pyspark.sql.conf.RuntimeConfig are not supported with a Platform specific key.

Scénarios¶

Not all usages of get or set methods are going to have an EWI in the output code. This EWI appears when the tool detects the usage of these methods with a Platform specific key which is not supported.

Scénario 1¶

Entrée

Below is an example of the get or set methods with supported keys in Snowpark.

session.conf.set("use_constant_subquery_alias", False)
spark.conf.set("sql_simplifier_enabled", True)

session.conf.get("use_constant_subquery_alias")
session.conf.get("use_constant_subquery_alias")

Sortie

Étant donné que les clés sont prises en charge dans Snowpark, l’outil n’ajoute pas l’EWI au code de sortie.

session.conf.set("use_constant_subquery_alias", True)
session.conf.set("sql_simplifier_enabled", False)

session.conf.get("use_constant_subquery_alias")
session.conf.get("sql_simplifier_enabled")

Correction recommandée

Il n’y a pas de correction recommandée pour ce scénario.

Scénario 2¶

Entrée

Vous trouverez ci-dessous un exemple d’utilisation des clés non prises en charge.

data =
    [
      ("John", 30, "New York"),
      ("Jane", 25, "San Francisco")
    ]

session.conf.set("spark.sql.shuffle.partitions", "50")
spark.conf.set("spark.yarn.am.memory", "1g")

session.conf.get("spark.sql.shuffle.partitions")
session = spark.conf.get("spark.yarn.am.memory")

df = spark.createDataFrame(data, schema=["Name", "Age", "City"])

Sortie

The tool adds this EWI SPRKPY1058 on the output code to let you know that these methods are not supported with a Platform specific key.

data =
    [
      ("John", 30, "New York"),
      ("Jane", 25, "San Francisco")
    ]

#EWI: SPRKPY1058 => pyspark.sql.conf.RuntimeConfig.set method with this "spark.sql.shuffle.partitions" Platform specific key is not supported.
spark.conf.set("spark.sql.shuffle.partitions", "50")
#EWI: SPRKPY1058 => pyspark.sql.conf.RuntimeConfig.set method with this "spark.yarn.am.memory" Platform specific key is not supported.
spark.conf.set("spark.yarn.am.memory", "1g")

#EWI: SPRKPY1058 => pyspark.sql.conf.RuntimeConfig.get method with this "spark.sql.shuffle.partitions" Platform specific key is not supported.
spark.conf.get("spark.sql.shuffle.partitions")
#EWI: SPRKPY1058 => pyspark.sql.conf.RuntimeConfig.get method with this "spark.yarn.am.memory" Platform specific key is not supported.
spark.conf.get("spark.yarn.am.memory")

df = spark.createDataFrame(data, schema=["Name", "Age", "City"])

Correction recommandée

La correction recommandée consiste à supprimer ces méthodes.

data =
    [
      ("John", 30, "New York"),
      ("Jane", 25, "San Francisco")
    ]

df = spark.createDataFrame(data, schema=["Name", "Age", "City"])

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1059¶

Avertissement

This issue code has been deprecated since Spark Conversion Core Version 2.45.1

Message: pyspark.storagelevel.StorageLevel has a workaround, see documentation.

Catégorie : Avertissement

Description¶

Currently, the use of StorageLevel is not required in Snowpark since Snowflake controls the storage. For more information, you can refer to the EWI SPRKPY1072

Recommandations supplémentaires¶

Mettez votre application à jour vers la dernière version.
For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1060¶

Message : Le mécanisme d’authentification est connection.json (modèle fourni).

Catégorie : Avertissement

Description¶

This issue appears when the tool detects the usage of pyspark.conf.SparkConf.

Scénario¶

Entrée

Comme le mécanisme d’authentification est différent dans Snowpark, l’outil supprime les utilisations et crée un fichier de configuration des connexions (connection.json) à la place.

from pyspark import SparkConf

my_conf = SparkConf(loadDefaults=True)

Sortie

The tool adds the EWI SPRKPY1060 indicating that the authentication mechanism is different.

#EWI: SPRKPY1002 => pyspark.conf.SparkConf is not supported
#EWI: SPRKPY1060 => The authentication mechanism is connection.json (template provided).
#my_conf = Session.builder.configs(connection_parameter).getOrCreate()

my_conf = None

Correction recommandée

To create a connection it is necessary that you fill in the information in the connection.json file.

{
  "user": "<USER>",
  "password": "<PASSWORD>",
  "account": "<ACCOUNT>",
  "role": "<ROLE>",
  "warehouse": "<WAREHOUSE>",
  "database": "<DATABASE>",
  "schema": "<SCHEMA>"
}

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1061¶

Message : Snowpark ne prend en charge les fonctions unix_timestamp

Catégorie : Avertissement

Description¶

In Snowpark, the first parameter is mandatory; the issue appears when the tool detects the usage of pyspark.sql.functions.unix_timestamp with no parameters.

Scénario¶

Entrée

Below an example that calls the unix_timestamp method without parameters.

data = [["2015-04-08", "10"],["2015-04-10", "15"]]

df = spark.createDataFrame(data, ['dt', 'val'])
df.select(unix_timestamp()).show()

Sortie

The Snowpark signature for this function unix_timestamp(e: ColumnOrName, fmt: Optional["Column"] = None), as you can notice the first parameter it’s required.

The tool adds this EWI SPRKPY1061 to let you know that function unix_timestamp with no parameters it’s not supported in Snowpark.

data = [["2015-04-08", "10"],["2015-04-10", "15"]]

df = spark.createDataFrame(data, ['dt', 'val'])
#EWI: SPRKPY1061 => Snowpark does not support unix_timestamp functions with no parameters. See documentation for more info.
df.select(unix_timestamp()).show()

Correction recommandée

En guise de solution de contournement, vous pouvez ajouter au moins le nom ou la colonne de la chaîne d’horodatage.

data = [["2015-04-08", "10"],["2015-04-10", "15"]]

df = spark.createDataFrame(data, ["dt", "val"])
df.select(unix_timestamp("dt")).show()

Recommandations supplémentaires¶

You also can add the current_timestamp() as the first parameter.
For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1062¶

Message : Snowpark ne prend pas charge GroupedData.pivot sans le paramètre « values ».

Catégorie : Avertissement

Description¶

This issue appears when the SMA detects the usage of the pyspark.sql.group.GroupedData.pivot function without the « values » parameter (the list of values to pivot on).

Pour l’instant, la fonction pivot de Snowpark Python vous oblige à spécifier explicitement la liste des valeurs distinctes sur lesquelles pivoter.

Scénarios¶

Scénario 1¶

Entrée

The SMA detects an expression that matches the pattern dataFrame.groupBy("columnX").pivot("columnY") and the pivot does not have the values parameter.

df.groupBy("date").pivot("category").sum("amount")

Sortie

L’outil SMA ajoute un message EWI indiquant que la fonction pivot sans le paramètre « values » n’est pas prise en charge.

En outre, il ajoutera comme deuxième paramètre de la fonction pivot une compréhension de liste qui calcule la liste des valeurs qui seront traduites en colonnes. Gardez à l’esprit que cette opération n’est pas efficace pour les grands ensembles de données, et qu’il est conseillé d’indiquer les valeurs de manière explicite.

#EWI: SPRKPY1062 => pyspark.sql.group.GroupedData.pivot without parameter 'values' is not supported. See documentation for more info.
df.groupBy("date").pivot("category", [v[0] for v in df.select("category").distinct().limit(10000).collect()]]).sum("amount")

Correction recommandée

Pour ce scénario, l’outil SMA ajoute comme deuxième paramètre de la fonction pivot une compréhension de liste qui calcule la liste des valeurs qui seront traduites en colonnes, mais vous pouvez également utiliser une liste de valeurs distinctes sur lesquelles pivoter, comme suit :

df = spark.createDataFrame([
      Row(category="Client_ID", date=2012, amount=10000),
      Row(category="Client_name",   date=2012, amount=20000)
  ])

df.groupBy("date").pivot("category", ["dotNET", "Java"]).sum("amount")

Scénario 2¶

Entrée

The SMA couldn’t detect an expression that matches the pattern dataFrame.groupBy("columnX").pivot("columnY") and the pivot does not have the values parameter.

df1.union(df2).groupBy("date").pivot("category").sum("amount")

Sortie

L’outil SMA ajoute un message EWI indiquant que la fonction pivot sans le paramètre « values » n’est pas prise en charge.

#EWI: SPRKPY1062 => pyspark.sql.group.GroupedData.pivot without parameter 'values' is not supported. See documentation for more info.
df1.union(df2).groupBy("date").pivot("category").sum("amount")

Correction recommandée

Ajoutez une liste de valeurs distinctes sur lesquelles pivoter, comme suit :

df = spark.createDataFrame([
      Row(course="dotNET", year=2012, earnings=10000),
      Row(course="Java",   year=2012, earnings=20000)
  ])

df.groupBy("year").pivot("course", ["dotNET", "Java"]).sum("earnings").show()

Recommandations supplémentaires¶

Le calcul de la liste des valeurs distinctes sur lesquelles pivoter n’est pas une opération efficace sur les grands ensembles de données et pourrait se transformer en appel bloquant. Veuillez envisager d’indiquer explicitement la liste des valeurs distinctes sur lesquelles pivoter.
Si vous ne souhaitez pas spécifier explicitement la liste des valeurs distinctes sur lesquelles pivoter (ce qui n’est pas conseillé), vous pouvez ajouter le code suivant comme deuxième argument de la fonction pivot pour déduire les valeurs au moment de l’exécution*

[v[0] for v in <df>.select(<column>).distinct().limit(<count>).collect()]]

****Replace*** :code:`<df>` with the corresponding DataFrame, with the column to pivot and with the number of rows to select.

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1063¶

Message : pyspark.sql.pandas.functions.pandas_udf a une solution de contournement.

Catégorie : Avertissement

Description¶

This issue appears when the tool detects the usage of pyspark.sql.pandas.functions.pandas_udf which has a workaround.

Scénario¶

Entrée

La fonction pandas_udf est utilisée pour créer une fonction définie par l’utilisateur qui travaille avec de grandes quantités de données.

@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def modify_df(pdf):
    return pd.DataFrame({'result': pdf['col1'] + pdf['col2'] + 1})
df = spark.createDataFrame([(1, 2), (3, 4), (1, 1)], ["col1", "col2"])
new_df = df.groupby().apply(modify_df)

Sortie

L’outil SMA ajoute un message EWI indiquant que la fonction pandas_udf a une solution de contournement.

#EWI: SPRKPY1062 => pyspark.sql.pandas.functions.pandas_udf has a workaround, see documentation for more info
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)

def modify_df(pdf):
    return pd.DataFrame({'result': pdf['col1'] + pdf['col2'] + 1})

df = spark.createDataFrame([(1, 2), (3, 4), (1, 1)], ["col1", "col2"])

new_df = df.groupby().apply(modify_df)

Correction recommandée

Specify explicitly the parameters types as a new parameter input_types, and remove functionType parameter if applies. Created function must be called inside a select statement.

@pandas_udf(
    return_type = schema,
    input_types = [PandasDataFrameType([IntegerType(), IntegerType()])]
)

def modify_df(pdf):
    return pd.DataFrame({'result': pdf['col1'] + pdf['col2'] + 1})

df = spark.createDataFrame([(1, 2), (3, 4), (1, 1)], ["col1", "col2"])

new_df = df.groupby().apply(modify_df) # You must modify function call to be a select and not an apply

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1064¶

Message: The *Spark element* does not apply since snowflake uses snowpipe mechanism instead.

Catégorie : Avertissement

Description¶

Ce problème apparaît lorsque l’outil détecte l’utilisation d’un élément de la bibliothèque pyspark.streaming :

pyspark.streaming.DStream
pyspark.streaming.StreamingContext
pyspark.streaming.listener.StreamingListener.

Scénario¶

Entrée

Vous trouverez ci-dessous un exemple avec l’un des éléments qui déclenchent cet EWI.

from pyspark.streaming.listener import StreamingListener

var = StreamingListener.Java
var.mro()

df = spark.createDataFrame([(25, "Alice", "150"), (30, "Bob", "350")], schema=["age", "name", "value"])
df.show()

Sortie

The SMA adds the EWI SPRKPY1064 on the output code to let you know that this function does not apply.

#EWI: SPRKPY1064 => The element does not apply since snowflake uses snowpipe mechanism instead.

var = StreamingListener.Java
var.mro()

df = spark.createDataFrame([(25, "Alice", "150"), (30, "Bob", "350")], schema=["age", "name", "value"])
df.show()

Correction recommandée

The SMA removes the import statement and adds the issue to the Issues.csv inventory, remove any usages of the Spark element.

df = spark.createDataFrame([(25, "Alice", "150"), (30, "Bob", "350")], schema=["age", "name", "value"])
df.show()

Recommandations supplémentaires¶

Check the documentation for Snowpipe to see how it fits to the current scenario.
For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1065¶

Message : pyspark.context.SparkContext.broadcast ne s’applique pas car Snowflake utilise un mécanisme de clustering des données pour calculer les données.

Catégorie : Avertissement

Description¶

This issue appears when the tool detects the usage of element pyspark.context.SparkContext.broadcast, which is not necessary due to the use of data-clustering of Snowflake.

Code d’entrée

Dans cet exemple, une variable de diffusion est créée. Ces variables permettent de partager les données plus efficacement entre tous les nœuds.

sc = SparkContext(conf=conf_spark)

mapping = {1: 10001, 2: 10002}

bc = sc.broadcast(mapping)

Code de sortie

L’outil SMA ajoute un message EWI indiquant que la diffusion n’est pas nécessaire.

sc = conf_spark

mapping = {1: 10001, 2: 10002}
#EWI: SPRKPY1065 => The element does not apply since snowflake use data-clustering mechanism to compute the data.

bc = sc.broadcast(mapping)

Correction recommandée

Supprimez toute utilisation de pyspark.context.SparkContext.broadcast.

sc = conf_spark

mapping = {1: 10001, 2: 10002}

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1066¶

Message : L’élément Spark ne s’applique pas car Snowflake utilise le mécanisme de micro-partitionnement qui est créé automatiquement.

Catégorie : Avertissement

Description¶

Ce problème apparaît lorsque l’outil détecte l’utilisation d’éléments liés aux partitions :

Those elements do not apply due the use of micro-partitions of Snowflake.

Code d’entrée

In this example sortWithinPartitions it’s used to create a partition in a DataFrame sorted by the specified column.

df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], schema=["age", "name"])
df.sortWithinPartitions("age", ascending=False)

Code de sortie

L’outil SMA ajoute un EWI indiquant que l’élément Spark n’est pas requis.

df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], schema=["age", "name"])
#EWI: SPRKPY1066 => The element does not apply since snowflake use micro-partitioning mechanism are created automatically.
df.sortWithinPartitions("age", ascending=False)

Correction recommandée

Supprimez l’utilisation de l’élément.

df = spark.createDataFrame([(2, "Alice"), (5, "Bob")], schema=["age", "name"])

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1067¶

Message : La fonction pyspark.sql.functions.split a des paramètres qui ne sont pas pris en charge dans Snowpark.

Catégorie : Avertissement

Description¶

This issue appears when the tool detects the usage of pyspark.sql.functions.split with more than two parameters or a regex pattern as a parameter; both cases are not supported.

Scénarios¶

Scénario 1¶

Code d’entrée

Dans cet exemple, la fonction de fractionnement a plus de deux paramètres.

df.select(split(columnName, ",", 5))

Code de sortie

L’outil ajoute cet EWI au code de sortie pour vous indiquer que cette fonction n’est pas prise en charge lorsqu’elle comporte plus de deux paramètres.

#EWI: SPRKPY1067 => Snowpark does not support split functions with more than two parameters or containing regex pattern. See documentation for more info.
df.select(split(columnName, ",", 5))

Correction recommandée

Conservez la fonction de fractionnement avec seulement deux paramètres.

df.select(split(columnName, ","))

Scénario 2¶

Code d’entrée

Dans cet exemple, la fonction de fractionnement a pour paramètre un modèle regex.

df.select(split(columnName, "^([\d]+-[\d]+-[\d])"))

Code de sortie

L’outil ajoute cet EWI au code de sortie pour vous indiquer que cette fonction n’est pas prise en charge lorsqu’elle comporte un modèle regex comme paramètre.

#EWI: SPRKPY1067 => Snowpark does not support split functions with more than two parameters or containing regex pattern. See documentation for more info.
df.select(split(columnName, "^([\d]+-[\d]+-[\d])"))

Correction recommandée

The spark signature for this method functions.split(str: ColumnOrName, pattern: str, limit: int = - 1) not exactly match with the method in Snowpark functions.split(str: Union[Column, str], pattern: Union[Column, str]) so for now the scenario using regular expression does not have a recommended fix.

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1068¶

Message : toPandas contient des colonnes de type ArrayType qui n’est pas pris en charge et qui dispose d’une solution de contournement.

Catégorie : Avertissement

Description¶

pyspark.sql.DataFrame.toPandas doesn’t work properly If there are columns of type ArrayType. The workaround for these cases is converting those columns into a Python Dictionary by using json.loads method.

Scénario¶

Entrée

ToPandas renvoie les données du DataFrame d’origine sous la forme d’un DataFrame Pandas.

sparkDF = spark.createDataFrame([
Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0))
])

pandasDF = sparkDF.toPandas()

Sortie

L’outil ajoute cet EWI pour vous indiquer que toPandas n’est pas pris en charge en cas de colonnes de type ArrayType, mais qu’il existe une solution de contournement.

sparkDF = spark.createDataFrame([
Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0))
])
#EWI: SPRKPY1068 => toPandas doesn't work properly If there are columns of type ArrayType. The workaround for these cases is converting those columns into a Python Dictionary by using json.loads method. example: df[colName] = json.loads(df[colName]).
pandasDF = sparkDF.toPandas()

Correction recommandée

pandas_df = sparkDF.toPandas()​
​
## check/convert all resulting fields from calling toPandas when they are of
## type ArrayType,
## they will be reasigned by converting them into a Python Dictionary
## using json.loads method​
​
for field in pandas_df.schema.fields:
    if isinstance(field.datatype, ArrayType):
        pandas_df[field.name] = pandas_df[field.name].apply(lambda x: json.loads(x) if x is not None else x)

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1069¶

Message : Si le paramètre partitionBy est une liste, Snowpark générera une erreur.

Catégorie : Avertissement

Description¶

When there is a usage of pyspark.sql.readwriter.DataFrameWriter.parquet method where it comes to the parameter partitionBy, the tool shows the EWI.

This is because in Snowpark the DataFrameWriter.parquet only supports a ColumnOrSqlExpr as a partitionBy parameter.

Scénarios¶

Scénario 1¶

Code d’entrée :

Dans ce scénario, le paramètre partitionBy n’est pas une liste.

df = spark.createDataFrame([(25, "Alice", "150"), (30, "Bob", "350")], schema=["age", "name", "value"])

df.write.parquet(file_path, partitionBy="age")

Code de sortie :

The tool adds the EWI SPRKPY1069 to let you know that Snowpark throws an error if parameter is a list.

df = spark.createDataFrame([(25, "Alice", "150"), (30, "Bob", "350")], schema=["age", "name", "value"])

#EWI: SPRKPY1069 => If partitionBy parameter is a list, Snowpark will throw and error.
df.write.parquet(file_path, partition_by = "age", format_type_options = dict(compression = "None"))

Correction recommandée

There is not a recommended fix for this scenario because the tool always adds this EWI just in case the partitionBy parameter is a list. Remember that in Snowpark, only accepts cloud locations using a snowflake stage.

df = spark.createDataFrame([(25, "Alice", "150"), (30, "Bob", "350")], schema=["age", "name", "value"])

stage = f'{Session.get_fully_qualified_current_schema()}.{_generate_prefix("TEMP_STAGE")}'
Session.sql(f'CREATE TEMPORARY STAGE IF NOT EXISTS {stage}').show()
Session.file.put(f"file:///path/to/data/file.parquet", f"@{stage}")

df.write.parquet(stage, partition_by = "age", format_type_options = dict(compression = "None"))

Scénario 2¶

Code d’entrée :

Pour ce scénario, le paramètre partitionBy est une liste.

df = spark.createDataFrame([(25, "Alice", "150"), (30, "Bob", "350")], schema=["age", "name", "value"])

df.write.parquet(file_path, partitionBy=["age", "name"])

Code de sortie :

The tool adds the EWI SPRKPY1069 to let you know that Snowpark throws an error if parameter is a list.

df = spark.createDataFrame([(25, "Alice", "150"), (30, "Bob", "350")], schema=["age", "name", "value"])

#EWI: SPRKPY1069 => If partitionBy parameter is a list, Snowpark will throw and error.
df.write.parquet(file_path, partition_by = ["age", "name"], format_type_options = dict(compression = "None"))

Correction recommandée

If the value of the parameter is a list, then replace it with a ColumnOrSqlExpr.

df.write.parquet(file_path, partition_by = sql_expr("age || name"), format_type_options = dict(compression = "None"))

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1070¶

Message: The mode argument is transformed to overwrite, check the variable value and set the corresponding bool value.

Catégorie : Avertissement

Description¶

En cas d’utilisation de :

The tool analyzes the parameter mode to determinate if the value is overwrite.

Scénarios¶

Scénario 1¶

Code d’entrée

Pour ce scénario, l’outil détecte que le paramètre mode peut définir la valeur booléenne correspondante.

df.write.csv(file_path, mode="overwrite")

Code de sortie :

The SMA tool analyzes the mode parameter, determinate that the value is overwrite and set the corresponding bool value

df.write.csv(file_path, format_type_options = dict(compression = "None"), overwrite = True)

Correction recommandée

Il n’y a pas de correction recommandée pour ce scénario car l’outil a effectué la transformation correspondante.

Scénario 2 :

Code d’entrée

In this scenario the tool can not validate the value is overwrite.

df.write.csv(file_path, mode=myVal)

Code de sortie :

L’outil SMA ajoute un message EWI indiquant que le paramètre mode a été transformé en « overwrite », mais également pour vous indiquer qu’il est préférable de vérifier la valeur de la variable et de définir la valeur boolénne correcte.

#EWI: SPRKPY1070 => The 'mode' argument is transformed to 'overwrite', check the variable value and set the corresponding bool value.
df.write.csv(file_path, format_type_options = dict(compression = "None"), overwrite = myVal)

Correction recommandée

Check for the value of the parameter mode and add the correct value for the parameter overwrite.

df.write.csv(file_path, format_type_options = dict(compression = "None"), overwrite = True)

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1071¶

Message : La fonction pyspark.rdd.RDD.getNumPartitions n’est pas requise dans Snowpark. Vous devez donc supprimer toutes les références.

Catégorie : Avertissement

Description¶

This issue appears when the tool finds the use of the pyspark.rdd.RDD.getNumPartitions function. Snowflake uses micro-partitioning mechanism, so the use of this function is not required.

Scénario¶

Entrée

getNumPartitions renvoie la quantité de partitions sur un RDD.

df = spark.createDataFrame([('2015-04-08',), ('5',), [Row(a=1, b="b")]], ['dt', 'num', 'row'])

print(df.getNumPartitions())

Sortie

L’outil ajoute cet EWI pour vous indiquer que getNumPartitions n’est pas requis.

df = spark.createDataFrame([('2015-04-08',), ('5',), [Row(a=1, b="b")]], ['dt', 'num', 'row'])
#EWI: SPRKPY1071 => The getNumPartitions are not required in Snowpark. So, you should remove all references.

print(df.getNumPartitions())

Correction recommandée

Supprimez toutes les utilisations de cette fonction.

df = spark.createDataFrame([('2015-04-08',), ('5',), [Row(a=1, b="b")]], ['dt', 'num', 'row'])

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1072¶

Message : L’utilisation de StorageLevel n’est pas requise dans Snowpark.

Catégorie : Avertissement

Description¶

This issue appears when the tool finds the use of the StorageLevel class, which works like « flags » to set the storage level. Since Snowflake controls the storage, the use of this function is not required.

Recommandations supplémentaires¶

Supprimez toutes les utilisations de cette fonction.
For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1073¶

Message : pyspark.sql.functions.udf sans paramètre ou paramètre de type retour n’est pas pris en charge.

Catégorie : Avertissement

Description¶

This issue appears when the tool detects the usage of pyspark.sql.functions.udf as function or decorator and is not supported in two specifics cases, when it has no parameters or return type parameter.

Scénarios¶

Scénario 1¶

Entrée

Dans Pyspark, vous pouvez créer une fonction définie par l’utilisateur sans paramètres de type d’entrée ou de retour :

from pyspark.sql import SparkSession, DataFrameStatFunctions
from pyspark.sql.functions import col, udf

spark = SparkSession.builder.getOrCreate()
data = [['Q1', 'Test 1'],
        ['Q2', 'Test 2'],
        ['Q3', 'Test 1'],
        ['Q4', 'Test 1']]

columns = ['Quadrant', 'Value']
df = spark.createDataFrame(data, columns)

my_udf = udf(lambda s: len(s))
df.withColumn('Len Value' ,my_udf(col('Value')) ).show()

Sortie

Snowpark exige les types d’entrée et de retour pour la fonction Udf. En effet, ils ne sont pas fournis et SMA ne peut pas définir ces paramètres.

from snowflake.snowpark import Session, DataFrameStatFunctions
from snowflake.snowpark.functions import col, udf

spark = Session.builder.getOrCreate()
spark.update_query_tag({"origin":"sf_sit","name":"sma","version":{"major":0,"minor":0,"patch":0},"attributes":{"language":"Python"}})
data = [['Q1', 'Test 1'],
        ['Q2', 'Test 2'],
        ['Q3', 'Test 1'],
        ['Q4', 'Test 1']]

columns = ['Quadrant', 'Value']
df = spark.createDataFrame(data, columns)
#EWI: SPRKPY1073 => pyspark.sql.functions.udf function without the return type parameter is not supported. See documentation for more info.
my_udf = udf(lambda s: len(s))

df.withColumn('Len Value' ,my_udf(col('Value')) ).show()

Correction recommandée

To fix this scenario is required to add the import for the returns types of the input and output, and then the parameters of return*type and input_types[] on the udf function _my_udf*.

from snowflake.snowpark import Session, DataFrameStatFunctions
from snowflake.snowpark.functions import col, udf
from snowflake.snowpark.types import IntegerType, StringType

spark = Session.builder.getOrCreate()
spark.update_query_tag({"origin":"sf_sit","name":"sma","version":{"major":0,"minor":0,"patch":0},"attributes":{"language":"Python"}})
data = [['Q1', 'Test 1'],
        ['Q2', 'Test 2'],
        ['Q3', 'Test 1'],
        ['Q4', 'Test 1']]

columns = ['Quadrant', 'Value']
df = spark.createDataFrame(data, columns)

my_udf = udf(lambda s: len(s), return_type=IntegerType(), input_types=[StringType()])

df.with_column("result", my_udf(df.Value)).show()

Scénario 2¶

Dans PySpark, vous pouvez utiliser un décorateur @udf sans paramètres.

Entrée

from pyspark.sql.functions import col, udf

spark = SparkSession.builder.getOrCreate()
data = [['Q1', 'Test 1'],
        ['Q2', 'Test 2'],
        ['Q3', 'Test 1'],
        ['Q4', 'Test 1']]

columns = ['Quadrant', 'Value']
df = spark.createDataFrame(data, columns)

@udf()
def my_udf(str):
    return len(str)


df.withColumn('Len Value' ,my_udf(col('Value')) ).show()

Sortie

In Snowpark all the parameters of a udf decorator are required.

from snowflake.snowpark.functions import col, udf

spark = Session.builder.getOrCreate()
spark.update_query_tag({"origin":"sf_sit","name":"sma","version":{"major":0,"minor":0,"patch":0},"attributes":{"language":"Python"}})
data = [['Q1', 'Test 1'],
        ['Q2', 'Test 2'],
        ['Q3', 'Test 1'],
        ['Q4', 'Test 1']]

columns = ['Quadrant', 'Value']
df = spark.createDataFrame(data, columns)

#EWI: SPRKPY1073 => pyspark.sql.functions.udf decorator without parameters is not supported. See documentation for more info.

@udf()
def my_udf(str):
    return len(str)

df.withColumn('Len Value' ,my_udf(col('Value')) ).show()

Correction recommandée

To fix this scenario is required to add the import for the returns types of the input and output, and then the parameters of return_type and input_types[] on the udf @udf decorator.

from snowflake.snowpark.functions import col, udf
from snowflake.snowpark.types import IntegerType, StringType

spark = Session.builder.getOrCreate()
spark.update_query_tag({"origin":"sf_sit","name":"sma","version":{"major":0,"minor":0,"patch":0},"attributes":{"language":"Python"}})
data = [['Q1', 'Test 1'],
        ['Q2', 'Test 2'],
        ['Q3', 'Test 1'],
        ['Q4', 'Test 1']]

columns = ['Quadrant', 'Value']
df = spark.createDataFrame(data, columns)

@udf(return_type=IntegerType(), input_types=[StringType()])
def my_udf(str):
    return len(str)

df.withColumn('Len Value' ,my_udf(col('Value')) ).show()

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1074¶

Message : Le fichier a une indentation mixte (espaces et tabulations).

Catégorie : Erreur d’analyse.

Description¶

Ce problème apparaît lorsque l’outil détecte que le fichier a une indentation mixte. Cela signifie que le fichier comporte une combinaison d’espaces et de tabulations pour indenter les lignes de code.

Scénario¶

Entrée

Dans Pyspark, vous pouvez mélanger les espaces et les tabulations pour le niveau d’indentation.

def foo():
    x = 5 # spaces
    y = 6 # tab

Sortie

SMA ne peut pas gérer les marqueurs d’indentation mixtes. Lorsque cela est détecté sur un fichier de code Python, l’outil SMA ajoute l’EWI SPRKPY1074 sur la première ligne.

## EWI: SPRKPY1074 => File has mixed indentation (spaces and tabs).
## This file was not converted, so it is expected to still have references to the Spark API
def foo():
    x = 5 # spaces
    y = 6 # tabs

Correction recommandée

La solution est de rendre tous les symboles d’indentation identiques.

def foo():
  x = 5 # tab
  y = 6 # tab

Recommandations supplémentaires¶

Useful indent tools PEP-8 and Reindent.
For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1075¶

Catégorie

Avertissement.

Description¶

La fonction parse_json n’applique pas de validation de schéma. Si vous avez besoin de filtrer/valider sur la base du schéma, vous devrez peut-être introduire une certaine logique.

Exemple¶

Entrée

df.select(from_json(df.value, Schema))
df.select(from_json(schema=Schema, col=df.value))
df.select(from_json(df.value, Schema, option))

Sortie

#EWI: SPRKPY1075 => The parse_json does not apply schema validation, if you need to filter/validate based on schema you might need to introduce some logic.
df.select(parse_json(df.value))
#EWI: SPRKPY1075 => The parse_json does not apply schema validation, if you need to filter/validate based on schema you might need to introduce some logic.
df.select(parse_json(df.value))
#EWI: SPRKPY1075 => The parse_json does not apply schema validation, if you need to filter/validate based on schema you might need to introduce some logic.
df.select(parse_json(df.value))

Pour la fonction from_json, le schéma n’est pas vraiment transmis pour l’inférence, il est utilisé pour la validation. Voir ces exemples :

data = [
    ('{"name": "John", "age": 30, "city": "New York"}',),
    ('{"name": "Jane", "age": "25", "city": "San Francisco"}',)
]

df = spark.createDataFrame(data, ["json_str"])

Exemple 1 : Appliquer les types de données et modifier les noms des colonnes :

## Parse JSON column with schema
parsed_df = df.withColumn("parsed_json", from_json(col("json_str"), schema))

parsed_df.show(truncate=False)

## +------------------------------------------------------+---------------------------+
## |json_str                                              |parsed_json                |
## +------------------------------------------------------+---------------------------+
## |{"name": "John", "age": 30, "city": "New York"}       |{John, 30, New York}       |
## |{"name": "Jane", "age": "25", "city": "San Francisco"}|{Jane, null, San Francisco}|
## +------------------------------------------------------+---------------------------+
## notice that values outside of the schema were dropped and columns not matched are returned as null

Exemple 2 : Sélectionner des colonnes spécifiques :

## Define a schema with only the columns we want to use
partial_schema = StructType([
    StructField("name", StringType(), True),
    StructField("city", StringType(), True)
])

## Parse JSON column with partial schema
partial_df = df.withColumn("parsed_json", from_json(col("json_str"), partial_schema))

partial_df.show(truncate=False)

## +------------------------------------------------------+---------------------+
## |json_str                                              |parsed_json          |
## +------------------------------------------------------+---------------------+
## |{"name": "John", "age": 30, "city": "New York"}       |{John, New York}     |
## |{"name": "Jane", "age": "25", "city": "San Francisco"}|{Jane, San Francisco}|
## +------------------------------------------------------+---------------------+
## there is also an automatic filtering

Recommandations¶

For more support, you can email us at sma-support@snowflake.com. If you have a contract for support with Snowflake, reach out to your sales engineer and they can direct your support needs.
Useful tools PEP-8 and Reindent.

SPRKPY1076¶

Message: Parameters in pyspark.sql.readwriter.DataFrameReader methods are not supported. This applies to CSV, JSON and PARQUET methods.

Catégorie : Avertissement

Description¶

For the CSV, JSON and PARQUET methods on the pyspark.sql.readwriter.DataFrameReader object, the tool will analyze the parameters and add a transformation according to each case:

Tous les paramètres correspondent à leur nom équivalent dans Snowpark : dans ce cas, l’outil transformera le paramètre en un appel .option(). Dans ce cas, le paramètre n’ajoutera pas cet EWI.
Certains paramètres ne correspondent pas à l’équivalent dans Snowpark : dans ce cas, l’outil ajoutera cet EWI avec les informations sur le paramètre et le supprimera de l’appel de la méthode.

Liste des équivalences :

Équivalences pour CSV :


Clés Spark	Equivalences Snowpark
sep	FIELD_DELIMITER
header	PARSE_HEADER
ligneSep	RECORD_DELIMITER
pathGlobFilter	PATTERN
quote	FIELD_OPTIONALLY_ENCLOSED_BY
nullValue	NULL_IF
dateFormat	DATE_FORMAT
timestampFormat	TIMESTAMP_FORMAT
inferSchema	INFER_SCHEMA
delimiter	FIELD_DELIMITER

Équivalences pour JSON :


Clés Spark	Equivalences Snowpark
dateFormat	DATE_FORMAT
timestampFormat	TIMESTAMP_FORMAT
pathGlobFilter	PATTERN

Équivalences pour PARQUET :


Clés Spark	Equivalences Snowpark
pathGlobFilter	PATTERN

Scénarios¶

Scénario 1¶

Entrée

Pour CVS, voici quelques exemples :

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('myapp').getOrCreate()

spark.read.csv("path3", None,None,None,None,None,None,True).show()

Sortie

Dans le code converti, les paramètres sont ajoutés en tant qu’options individuelles à la fonction cvs

from snowflake.snowpark import Session

spark = Session.builder.app_name('myapp', True).getOrCreate()
spark.update_query_tag({"origin":"sf_sit","name":"sma","version":{"major":0,"minor":0,"patch":0},"attributes":{"language":"Python"}})

#EWI: SPRKPY1076 => Some of the included parameters are not supported in the csv function, the supported ones will be added into a option method.
spark.read.option("FIELD_DELIMITER", None).option("PARSE_HEADER", True).option("FIELD_OPTIONALLY_ENCLOSED_BY", None).csv("path3").show()

Scénario 2¶

Entrée

Pour JSON, voici quelques exemples :

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('myapp').getOrCreate()
spark.read.json("/myPath/jsonFile/", dateFormat='YYYY/MM/DD').show()

Sortie

Dans le code converti, les paramètres sont ajoutés en tant qu’options individuelles à la fonction json

from snowflake.snowpark import Session
spark = Session.builder.app_name('myapp', True).getOrCreate()
#EWI: SPRKPY1076 => Some of the included parameters are not supported in the json function, the supported ones will be added into a option method.

spark.read.option("DATE_FORMAT", 'YYYY/MM/DD').json("/myPath/jsonFile/").show()

Scénario 3¶

Entrée

Pour PARQUET, voici quelques exemples :

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('myapp').getOrCreate()

spark.read.parquet("/path/to/my/file.parquet", pathGlobFilter="*.parquet").show()

Sortie

Dans le code converti, les paramètres sont ajoutés en tant qu’options individuelles à la fonction Parquet

from snowflake.snowpark import Session

spark = Session.builder.app_name('myapp', True).getOrCreate()
spark.update_query_tag({"origin":"sf_sit","name":"sma","version":{"major":0,"minor":0,"patch":0},"attributes":{"language":"Python"}})

#EWI: SPRKPY1076 => Some of the included parameters are not supported in the parquet function, the supported ones will be added into a option method.
#EWI: SPRKPY1029 => The parquet function require adjustments, in Snowpark the parquet files needs to be located in an stage. See the documentation for more info.

spark.read.option("PATTERN", "*.parquet").parquet("/path/to/my/file.parquet")

Recommandations supplémentaires¶

En présence de paramètres non équivalents, il est recommandé de vérifier le comportement après la transformation.
La documentation pourrait également être utile pour trouver une meilleure solution :
- Options documentation for CSV: - PySpark CSV Options. - Snowpark CSV Options.
- Options documentation for JSON: - PySpark JSON Options. - Snowpark JSON Options.
- Options documentation for PARQUET: - Pyspark PARQUET options. - SnowPark PARQUET options..
For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1077¶

Message : Le code intégré SQL ne peut pas être traité.

Catégorie : Avertissement

Description¶

Ce problème apparaît lorsque l’outil détecte un code intégré SQL qui ne peut pas être converti en Snowpark.

Pour plus d’informations, consultez la section Code intégré SQL.

Scénario¶

Entrée

Dans cet exemple, le code SQL est intégré à une variable appelée requête qui est utilisée comme paramètre de la méthode Pyspark.sql.

query = f"SELECT * from myTable"
spark.sql(query)

Sortie

SMA détecte que le paramètre PySpark.sql est une variable et non un code SQL, de sorte que le message EWI SPRKPY1077 est ajouté à la ligne PySpark.sql.

query = f"SELECT * myTable"
#EWI: SPRKPY1077 => SQL embedded code cannot be processed.
spark.sql(query)

Recommandations supplémentaires¶

Pour la transformation de SQL, ce code doit être directement intégré en tant que paramètre de la méthode, uniquement sous forme de valeurs de chaîne et sans interpolation. Veuillez vérifier l’envoi de SQL à la fonction PySpark.SQL pour valider sa fonctionnalité sur Snowflake.
For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1078¶

Message : L’argument de la fonction pyspark.context.SparkContext.setLogLevel n’est pas une valeur littérale et n’a donc pas pu être évalué.

Catégorie : Avertissement

Description¶

This issue appears when the SMA detects the use of the pyspark.context.SparkContext.setLogLevel function with an argument that is not a literal value, for example, when the argument is a variable.

L’outil SMA effectue une analyse statique de votre code source et il n’est donc pas possible d’évaluer le contenu de cet argument et de déterminer un équivalent dans Snowpark.

Scénario¶

Entrée

Dans cet exemple, le niveau de journalisation est défini dans la variable my_log_level, puis my_log_level est utilisé comme paramètre par la méthode setLogLevel.

my_log_level = "WARN"
sparkSession.sparkContext.setLogLevel(my_log_level)

Sortie

SMA n’est pas en mesure d’évaluer l’argument du paramètre du niveau de journalisation, de sorte que l’EWI SPRKPY1078 est ajouté par-dessus la ligne de la journalisation transformée :

my_log_level = "WARN"
#EWI: SPRKPY1078 => my_log_level is not a literal value and therefore could not be evaluated. Make sure the value of my_log_level is a valid level in Snowpark. Valid log levels are: logging.CRITICAL, logging.DEBUG, logging.ERROR, logging.INFO, logging.NOTSET, logging.WARNING
logging.basicConfig(stream = sys.stdout, level = my_log_level)

Correction recommandée

Even though the SMA was unable to evaluate the argument, it will transform the pyspark.context.SparkContext.setLogLevel function into the Snowpark equivalent. Please make sure the value of the level argument in the generated output code is a valid and equivalent log level in Snowpark according to the table below:


Niveau de journalisation PySpark	Équivalent Snowpark du niveau de journalisation
ALL	logging.NOTSET
DEBUG	logging.DEBUG
ERROR	logging.ERROR
FATAL	logging.CRITICAL
INFO	logging.INFO
OFF	logging.WARNING
TRACE	logging.NOTSET
WARN	logging.WARNING

La correction recommandée se présente donc comme suit :

my_log_level = logging.WARNING
logging.basicConfig(stream = sys.stdout, level = my_log_level)

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1079¶

Message : L’argument de la fonction pyspark.context.SparkContext.setLogLevel n’est pas un niveau de journalisation PySpark valide.

Catégorie : Avertissement

Description¶

This issue appears when the SMA detects the use of the pyspark.context.SparkContext.setLogLevel function with an argument that is not a valid log level in PySpark, and therefore an equivalent could not be determined in Snowpark.

Scénario¶

Entrée

Ici le niveau de journalisation utilise « INVALID_LOG_LEVEL » qui n’est pas un niveau de journalisation Pyspark valide.

sparkSession.sparkContext.setLogLevel("INVALID_LOG_LEVEL")

Sortie

SMA ne peut pas reconnaître le niveau de journalisation « INVALID_LOG_LEVEL ». Même si SMA effectue la conversion, l’EWI SPRKPY1079 est ajouté pour indiquer un éventuel problème.

#EWI: SPRKPY1079 => INVALID_LOG_LEVEL is not a valid PySpark log level, therefore an equivalent could not be determined in Snowpark. Valid PySpark log levels are: ALL, DEBUG, ERROR, FATAL, INFO, OFF, TRACE, WARN
logging.basicConfig(stream = sys.stdout, level = logging.INVALID_LOG_LEVEL)

Correction recommandée

Make sure that the log level used in the pyspark.context.SparkContext.setLogLevel function is a valid log level in PySpark or in Snowpark and try again.

logging.basicConfig(stream = sys.stdout, level = logging.DEBUG)

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1081¶

This issue code has been deprecated since Spark Conversion Core 4.12.0

Message : pyspark.sql.readwriter.DataFrameWriter.partitionBy a une solution de contournement.

Catégorie : Avertissement

Description¶

The Pyspark.sql.readwriter.DataFrameWriter.partitionBy function is not supported. The workaround is to use Snowpark’s copy_into_location instead. See the documentation for more info.

Scénario¶

Entrée

This code will create a separate directories for each unique value in the FIRST_NAME column. The data is the same, but it’s going to be stored in different directories based on the column.

df = session.createDataFrame([["John", "Berry"], ["Rick", "Berry"], ["Anthony", "Davis"]], schema = ["FIRST_NAME", "LAST_NAME"])
df.write.partitionBy("FIRST_NAME").csv("/home/data")

This code will create a separate directories for each unique value in the FIRST_NAME column. The data is the same, but it’s going to be stored in different directories based on the column.

Code de sortie

df = session.createDataFrame([["John", "Berry"], ["Rick", "Berry"], ["Anthony", "Davis"]], schema = ["FIRST_NAME", "LAST_NAME"])
#EWI: SPRKPY1081 => The partitionBy function is not supported, but you can instead use copy_into_location as workaround. See the documentation for more info.
df.write.partitionBy("FIRST_NAME").csv("/home/data", format_type_options = dict(compression = "None"))

Correction recommandée

In Snowpark, copy_into_location has a partition_by parameter that you can use instead of the partitionBy function, but it’s going to require some manual adjustments, as shown in the following example:

Code Spark :

df = session.createDataFrame([["John", "Berry"], ["Rick", "Berry"], ["Anthony", "Davis"]], schema = ["FIRST_NAME", "LAST_NAME"])
df.write.partitionBy("FIRST_NAME").csv("/home/data")

Code Snowpark ajusté manuellement :

df = session.createDataFrame([["John", "Berry"], ["Rick", "Berry"], ["Anthony", "Davis"]], schema = ["FIRST_NAME", "LAST_NAME"])
df.write.copy_into_location(location=temp_stage, partition_by=col("FIRST_NAME"), file_format_type="csv", format_type_options={"COMPRESSION": "NONE"}, header=True)

copy_into_location a les paramètres suivants.

location: The Snowpark location only accepts cloud locations using an snowflake stage.
partition_by : Il peut s’agir d’un nom de colonne ou d’une expression SQL. Vous devez donc effectuer une conversion en colonne ou en expression SQL en utilisant col ou sql_expr.

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1082¶

Message : La fonction pyspark.sql.readwriter.DataFrameReader.load n’est pas prise en charge. La solution consiste à utiliser la méthode Snowpark DataFrameReader.format à la place (avro csv, json, orc, parquet). Le paramètre path doit être un emplacement de zone de préparation.

Catégorie : Avertissement

Description¶

The pyspark.sql.readwriter.DataFrameReader.load function is not supported. The workaround is to use Snowpark DataFrameReader methods instead.

Scénarios¶

The spark signature for this method DataFrameReader.load(path, format, schema, **options) does not exist in Snowpark. Therefore, any usage of the load function is going to have an EWI in the output code.

Scénario 1¶

Entrée

Below is an example that tries to load data from a CSV source.

path_csv_file = "/path/to/file.csv"

schemaParam = StructType([
        StructField("Name", StringType(), True),
        StructField("Superhero", StringType(), True)
    ])

my_session.read.load(path_csv_file, "csv").show()
my_session.read.load(path_csv_file, "csv", schema=schemaParam).show()
my_session.read.load(path_csv_file, "csv", schema=schemaParam, lineSep="\r\n", dateFormat="YYYY/MM/DD").show()

Sortie

The SMA adds the EWI SPRKPY1082 to let you know that this function is not supported by Snowpark, but it has a workaround.

path_csv_file = "/path/to/file.csv"

schemaParam = StructType([
        StructField("Name", StringType(), True),
        StructField("Superhero", StringType(), True)
    ])
#EWI: SPRKPY1082 => The pyspark.sql.readwriter.DataFrameReader.load function is not supported. A workaround is to use Snowpark DataFrameReader format specific method instead (avro csv, json, orc, parquet). The path parameter should be a stage location.

my_session.read.load(path_csv_file, "csv").show()
#EWI: SPRKPY1082 => The pyspark.sql.readwriter.DataFrameReader.load function is not supported. A workaround is to use Snowpark DataFrameReader format specific method instead (avro csv, json, orc, parquet). The path parameter should be a stage location.
my_session.read.load(path_csv_file, "csv", schema=schemaParam).show()
#EWI: The pyspark.sql.readwriter.DataFrameReader.load function is not supported. A workaround is to use Snowpark DataFrameReader format specific method instead (avro csv, json, orc, parquet). The path parameter should be a stage location.
my_session.read.load(path_csv_file, "csv", schema=schemaParam, lineSep="\r\n", dateFormat="YYYY/MM/DD").show()

Correction recommandée

As a workaround, you can use Snowpark DataFrameReader methods instead.

Fixing path and format parameters:
- Replace the load method with csv method.
- The first parameter path must be in a stage to make an equivalence with Snowpark.

Below is an example that creates a temporal stage and puts the file into it, then calls the CSV method.

path_csv_file = "/path/to/file.csv"

## Stage creation

temp_stage = f'{Session.get_fully_qualified_current_schema()}.{_generate_prefix("TEMP_STAGE")}'
my_session.sql(f'CREATE TEMPORARY STAGE IF NOT EXISTS {temp_stage}').show()
my_session.file.put(f"file:///path/to/file.csv", f"@{temp_stage}")
stage_file_path = f"{temp_stage}file.csv"

schemaParam = StructType([
        StructField("Name", StringType(), True),
        StructField("Superhero", StringType(), True)
    ])

my_session.read.csv(stage_file_path).show()

Fixing schema parameter:
- The schema can be set by using the schema function as follows:

schemaParam = StructType([
        StructField("name", StringType(), True),
        StructField("city", StringType(), True)
    ])

df = my_session.read.schema(schemaParam).csv(temp_stage)

Fixing options parameter:

The options between spark and snowpark are not the same, in this case lineSep and dateFormat are replaced with RECORD_DELIMITER and DATE_FORMAT, the Additional recommendations section has a table with all the Equivalences.

Below is an example that creates a dictionary with RECORD_DELIMITER and DATE_FORMAT, and calls the options method with that dictionary.

optionsParam = {"RECORD_DELIMITER": "\r\n", "DATE_FORMAT": "YYYY/MM/DD"}
df = my_session.read.options(optionsParam).csv(stage)

Scénario 2¶

Entrée

Below is an example that tries to load data from a JSON source.

path_json_file = "/path/to/file.json"

schemaParam = StructType([
        StructField("Name", StringType(), True),
        StructField("Superhero", StringType(), True)
    ])

my_session.read.load(path_json_file, "json").show()
my_session.read.load(path_json_file, "json", schema=schemaParam).show()
my_session.read.load(path_json_file, "json", schema=schemaParam, dateFormat="YYYY/MM/DD", timestampFormat="YYYY-MM-DD HH24:MI:SS.FF3").show()

Sortie

The SMA adds the EWI SPRKPY1082 to let you know that this function is not supported by Snowpark, but it has a workaround.

path_json_file = "/path/to/file.json"

schemaParam = StructType([
        StructField("Name", StringType(), True),
        StructField("Superhero", StringType(), True)
    ])
#EWI: SPRKPY1082 => The pyspark.sql.readwriter.DataFrameReader.load function is not supported. A workaround is to use Snowpark DataFrameReader format specific method instead (avro csv, json, orc, parquet). The path parameter should be a stage location.

my_session.read.load(path_json_file, "json").show()
#EWI: SPRKPY1082 => The pyspark.sql.readwriter.DataFrameReader.load function is not supported. A workaround is to use Snowpark DataFrameReader format specific method instead (avro csv, json, orc, parquet). The path parameter should be a stage location.
my_session.read.load(path_json_file, "json", schema=schemaParam).show()
#EWI: SPRKPY1082 => The pyspark.sql.readwriter.DataFrameReader.load function is not supported. A workaround is to use Snowpark DataFrameReader format specific method instead (avro csv, json, orc, parquet). The path parameter should be a stage location.
my_session.read.load(path_json_file, "json", schema=schemaParam, dateFormat="YYYY/MM/DD", timestampFormat="YYYY-MM-DD HH24:MI:SS.FF3").show()

Correction recommandée

As a workaround, you can use Snowpark DataFrameReader methods instead.

Fixing path and format parameters:
- Replace the load method with json method
- The first parameter path must be in a stage to make an equivalence with Snowpark.

Below is an example that creates a temporal stage and puts the file into it, then calls the JSON method.

path_json_file = "/path/to/file.json"

## Stage creation

temp_stage = f'{Session.get_fully_qualified_current_schema()}.{_generate_prefix("TEMP_STAGE")}'
my_session.sql(f'CREATE TEMPORARY STAGE IF NOT EXISTS {temp_stage}').show()
my_session.file.put(f"file:///path/to/file.json", f"@{temp_stage}")
stage_file_path = f"{temp_stage}file.json"

schemaParam = StructType([
        StructField("Name", StringType(), True),
        StructField("Superhero", StringType(), True)
    ])

my_session.read.json(stage_file_path).show()

Fixing schema parameter:
- The schema can be set by using the schema function as follows:

schemaParam = StructType([
        StructField("name", StringType(), True),
        StructField("city", StringType(), True)
    ])

df = my_session.read.schema(schemaParam).json(temp_stage)

Fixing options parameter:

The options between Spark and snowpark are not the same, in this case dateFormat and timestampFormat are replaced with DATE_FORMAT and TIMESTAMP_FORMAT, the Additional recommendations section has a table with all the Equivalences.

Below is an example that creates a dictionary with DATE_FORMAT and TIMESTAMP_FORMAT, and calls the options method with that dictionary.

optionsParam = {"DATE_FORMAT": "YYYY/MM/DD", "TIMESTAMP_FORMAT": "YYYY-MM-DD HH24:MI:SS.FF3"}
df = Session.read.options(optionsParam).json(stage)

Scénario 3¶

Entrée

Below is an example that tries to load data from a PARQUET source.

path_parquet_file = "/path/to/file.parquet"

schemaParam = StructType([
        StructField("Name", StringType(), True),
        StructField("Superhero", StringType(), True)
    ])

my_session.read.load(path_parquet_file, "parquet").show()
my_session.read.load(path_parquet_file, "parquet", schema=schemaParam).show()
my_session.read.load(path_parquet_file, "parquet", schema=schemaParam, pathGlobFilter="*.parquet").show()

Sortie

The SMA adds the EWI SPRKPY1082 to let you know that this function is not supported by Snowpark, but it has a workaround.

path_parquet_file = "/path/to/file.parquet"

schemaParam = StructType([
        StructField("Name", StringType(), True),
        StructField("Superhero", StringType(), True)
    ])
#EWI: SPRKPY1082 => The pyspark.sql.readwriter.DataFrameReader.load function is not supported. A workaround is to use Snowpark DataFrameReader format specific method instead (avro csv, json, orc, parquet). The path parameter should be a stage location.

my_session.read.load(path_parquet_file, "parquet").show()
#EWI: SPRKPY1082 => The pyspark.sql.readwriter.DataFrameReader.load function is not supported. A workaround is to use Snowpark DataFrameReader format specific method instead (avro csv, json, orc, parquet). The path parameter should be a stage location.
my_session.read.load(path_parquet_file, "parquet", schema=schemaParam).show()
#EWI: SPRKPY1082 => The pyspark.sql.readwriter.DataFrameReader.load function is not supported. A workaround is to use Snowpark DataFrameReader format specific method instead (avro csv, json, orc, parquet). The path parameter should be a stage location.
my_session.read.load(path_parquet_file, "parquet", schema=schemaParam, pathGlobFilter="*.parquet").show()

Correction recommandée

As a workaround, you can use Snowpark DataFrameReader methods instead.

Fixing path and format parameters:
- Replace the load method with parquet method
- The first parameter path must be in a stage to make an equivalence with Snowpark.

Below is an example that creates a temporal stage and puts the file into it, then calls the PARQUET method.

path_parquet_file = "/path/to/file.parquet"

## Stage creation

temp_stage = f'{Session.get_fully_qualified_current_schema()}.{_generate_prefix("TEMP_STAGE")}'
my_session.sql(f'CREATE TEMPORARY STAGE IF NOT EXISTS {temp_stage}').show()
my_session.file.put(f"file:///path/to/file.parquet", f"@{temp_stage}")
stage_file_path = f"{temp_stage}file.parquet"

schemaParam = StructType([
        StructField("Name", StringType(), True),
        StructField("Superhero", StringType(), True)
    ])

my_session.read.parquet(stage_file_path).show()

Fixing schema parameter:
- The schema can be set by using the schema function as follows:

schemaParam = StructType([
        StructField("name", StringType(), True),
        StructField("city", StringType(), True)
    ])

df = my_session.read.schema(schemaParam).parquet(temp_stage)

Fixing options parameter:

The options between Spark and snowpark are not the same, in this case pathGlobFilter is replaced with PATTERN, the Additional recommendations section has a table with all the Equivalences.

Below is an example that creates a dictionary with PATTERN, and calls the options method with that dictionary.

optionsParam = {"PATTERN": "*.parquet"}
df = Session.read.options(optionsParam).parquet(stage)

Recommandations supplémentaires¶

Tenez compte du fait que les options entre Spark et Snowpark ne sont pas les mêmes, mais qu’elles peuvent être mappées :


Options Spark	Valeur possible	Équivalent Snowpark	Description
header	Vrai ou faux	SKIP_HEADER = 1 / SKIP_HEADER = 0	Pour utiliser la première ligne d’un fichier comme nom de colonne.
delimiter	Tout séparateur de champ à un ou plusieurs caractères	FIELD_DELIMITER	Pour spécifier un ou plusieurs caractères comme séparateur pour chaque colonne/champ.
sep	Tout séparateur de champ à un caractère	FIELD_DELIMITER	Pour spécifier un seul caractère comme séparateur pour chaque colonne/champ.
encoding	UTF-8, UTF-16, etc…	ENCODING	Pour décoder les fichiers CSV selon le type d’encodage donné. L’encodage par défaut est UTF-8
ligneSep	Tout séparateur de lignes à un caractère	RECORD_DELIMITER	Pour définir le séparateur de lignes à utiliser pour l’analyse du fichier.
pathGlobFilter	Modèle de fichier	PATTERN	Pour définir un modèle permettant de lire uniquement les fichiers dont les noms correspondent au modèle.
recursiveFileLookup	Vrai ou faux	N/A	Pour analyser de manière récursive un répertoire afin d’y lire des fichiers. La valeur par défaut de cette option est False.
quote	Caractère unique à mettre entre guillemets	FIELD_OPTIONALLY_ENCLOSED_BY	Permet de mettre entre guillemets les champs/colonnes contenant des champs où le délimiteur/séparateur peut faire partie de la valeur. Ce caractère permet de mettre entre guillemets tous les champs lorsqu’il est utilisé avec l’option quoteAll. La valeur par défaut de cette option est guillemet double (« ).
nullValue	Chaîne pour remplacer la valeur nulle	NULL_IF	Permet de remplacer les valeurs nulles par la chaîne lors de la lecture et de l’écriture du dataframe.
dateFormat	Format de date valide	DATE_FORMAT	Permet de définir une chaîne indiquant un format de date. Le format par défaut est yyyy-MM-dd.
timestampFormat	Format d’horodatage valide	TIMESTAMP_FORMAT	Permet de définir une chaîne indiquant un format d’horodatage. Le format par défaut est yyyy-MM-dd “T’HH: mm:ss.
escape	Tout caractère unique	ESCAPE	Permet de définir un caractère unique comme caractère d’échappement afin de remplacer le caractère d’échappement par défaut (\).
inferSchema	Vrai ou faux	INFER_SCHEMA	Détecte automatiquement le schéma de fichier
mergeSchema	Vrai ou faux	N/A	Pas nécessaire dans Snowflake, car cela se produit chaque fois que infer_schema détermine la structure du fichier parquet

For modifiedBefore / modifiedAfter option you can achieve the same result in Snowflake by using the metadata columns and then adding a filter like: df.filter(METADATA_FILE_LAST_MODIFIED > ‘some_date’).
For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1083¶

Message : La fonction pyspark.sql.readwriter.DataFrameWriter.save n’est pas prise en charge. La solution consiste à utiliser la méthode Snowpark DataFrameWriter copy_into_location à la place.

Catégorie : Avertissement

Description¶

The pyspark.sql.readwriter.DataFrameWriter.save function is not supported. The workaround is to use Snowpark DataFrameWriter methods instead.

Scénarios¶

The spark signature for this method DataFrameWriter.save(path, format, mode, partitionBy, **options) does not exists in Snowpark. Therefore, any usage of the load function it’s going to have an EWI in the output code.

Scénario 1¶

Code d’entrée

Below is an example that tries to save data with CSV format.

path_csv_file = "/path/to/file.csv"

data = [
        ("John", 30, "New York"),
        ("Jane", 25, "San Francisco")
    ]

df = my_session.createDataFrame(data, schema=["Name", "Age", "City"])

df.write.save(path_csv_file, format="csv")
df.write.save(path_csv_file, format="csv", mode="overwrite")
df.write.save(path_csv_file, format="csv", mode="overwrite", lineSep="\r\n", dateFormat="YYYY/MM/DD")
df.write.save(path_csv_file, format="csv", mode="overwrite", partitionBy="City", lineSep="\r\n", dateFormat="YYYY/MM/DD")

Code de sortie

The tool adds this EWI SPRKPY1083 on the output code to let you know that this function is not supported by Snowpark, but it has a workaround.

path_csv_file = "/path/to/file.csv"

data = [
        ("John", 30, "New York"),
        ("Jane", 25, "San Francisco")
    ]

df = my_session.createDataFrame(data, schema=["Name", "Age", "City"])

#EWI: SPRKPY1083 => The pyspark.sql.readwriter.DataFrameWriter.save function is not supported. A workaround is to use Snowpark DataFrameWriter copy_into_location method instead.
df.write.save(path_csv_file, format="csv")
#EWI: SPRKPY1083 => The pyspark.sql.readwriter.DataFrameWriter.save function is not supported. A workaround is to use Snowpark DataFrameWriter copy_into_location method instead.
df.write.save(path_csv_file, format="csv", mode="overwrite")
#EWI: SPRKPY1083 => The pyspark.sql.readwriter.DataFrameWriter.save function is not supported. A workaround is to use Snowpark DataFrameWriter copy_into_location method instead.
df.write.save(path_csv_file, format="csv", mode="overwrite", lineSep="\r\n", dateFormat="YYYY/MM/DD")
#EWI: SPRKPY1083 => The pyspark.sql.readwriter.DataFrameWriter.save function is not supported. A workaround is to use Snowpark DataFrameWriter copy_into_location method instead.
df.write.save(path_csv_file, format="csv", mode="overwrite", partitionBy="City", lineSep="\r\n", dateFormat="YYYY/MM/DD")

Correction recommandée

As a workaround you can use Snowpark DataFrameWriter methods instead.

Fixing path and format parameters:
- Replace the load method with csv or copy_into_location method.
- If you are using copy_into_location method, you need to specify the format with the file_format_type parameter.
- The first parameter path must be in a stage to make an equivalence with Snowpark.

Vous trouverez ci-dessous un exemple qui crée une zone de préparation temporelle et y place le fichier, puis appelle l’une des méthodes mentionnées ci-dessus.

data = [
        ("John", 30, "New York"),
        ("Jane", 25, "San Francisco")
    ]
df = spark.createDataFrame(data, schema=["Name", "Age", "City"])

## Stage creation

temp_stage = f'{Session.get_fully_qualified_current_schema()}.{_generate_prefix("TEMP_STAGE")}'
my_session.sql(f'CREATE TEMPORARY STAGE IF NOT EXISTS {temp_stage}').show()
my_session.file.put(f"file:///path/to/file.csv", f"@{temp_stage}")
stage_file_path = f"{temp_stage}file.csv"

## Using csv method
df.write.csv(stage_file_path)

## Using copy_into_location method
df.write.copy_into_location(stage_file_path, file_format_type="csv")

Fixing mode parameter:
- Use the mode function from Snowpark DataFrameWriter, as follows:

Below is an example that adds into the daisy chain the mode method with overwrite as a parameter.

data = [
        ("John", 30, "New York"),
        ("Jane", 25, "San Francisco")
    ]
df = spark.createDataFrame(data, schema=["Name", "Age", "City"])

## Using csv method
df.write.mode("overwrite").csv(temp_stage)

## Using copy_into_location method
df.write.mode("overwrite").copy_into_location(temp_stage, file_format_type="csv")

Fixing partitionBy parameter:
- Use the partition_by parameter from the CSV method, as follows:

Below is an example that used the partition_by parameter from the CSV method.

data = [
        ("John", 30, "New York"),
        ("Jane", 25, "San Francisco")
    ]
df = spark.createDataFrame(data, schema=["Name", "Age", "City"])

## Using csv method
df.write.csv(temp_stage, partition_by="City")

## Using copy_into_location method
df.write.copy_into_location(temp_stage, file_format_type="csv", partition_by="City")

Fixing options parameter:
- Use the format_type_options parameter from the CSV method, as follows:

Below is an example that creates a dictionary with RECORD_DELIMITER and DATE_FORMAT, and calls the options method with that dictionary.

data = [
        ("John", 30, "New York"),
        ("Jane", 25, "San Francisco")
    ]
df = spark.createDataFrame(data, schema=["Name", "Age", "City"])
optionsParam = {"RECORD_DELIMITER": "\r\n", "DATE_FORMAT": "YYYY/MM/DD"}

## Using csv method
df.write.csv(stage, format_type_options=optionsParam)

## Using copy_into_location method
df.write.csv(stage, file_format_type="csv", format_type_options=optionsParam)

Scénario 2¶

Code d’entrée

Below is an example that tries to save data with JSON format.

path_json_file = "/path/to/file.json"

data = [
        ("John", 30, "New York"),
        ("Jane", 25, "San Francisco")
    ]

df = spark.createDataFrame(data, schema=["Name", "Age", "City"])

df.write.save(path_json_file, format="json")
df.write.save(path_json_file, format="json", mode="overwrite")
df.write.save(path_json_file, format="json", mode="overwrite", dateFormat="YYYY/MM/DD", timestampFormat="YYYY-MM-DD HH24:MI:SS.FF3")
df.write.save(path_json_file, format="json", mode="overwrite", partitionBy="City", dateFormat="YYYY/MM/DD", timestampFormat="YYYY-MM-DD HH24:MI:SS.FF3")

Code de sortie

The tool adds this EWI SPRKPY1083 on the output code to let you know that this function is not supported by Snowpark, but it has a workaround.

path_json_file = "/path/to/file.json"

data = [
        ("John", 30, "New York"),
        ("Jane", 25, "San Francisco")
    ]

df = spark.createDataFrame(data, schema=["Name", "Age", "City"])

#EWI: SPRKPY1083 => The pyspark.sql.readwriter.DataFrameWriter.save function is not supported. A workaround is to use Snowpark DataFrameWriter copy_into_location method instead.
df.write.save(path_json_file, format="json")
#EWI: SPRKPY1083 => The pyspark.sql.readwriter.DataFrameWriter.save function is not supported. A workaround is to use Snowpark DataFrameWriter copy_into_location method instead.
df.write.save(path_json_file, format="json", mode="overwrite")
#EWI: SPRKPY1083 => The pyspark.sql.readwriter.DataFrameWriter.save function is not supported. A workaround is to use Snowpark DataFrameWriter copy_into_location method instead.
df.write.save(path_json_file, format="json", mode="overwrite", dateFormat="YYYY/MM/DD", timestampFormat="YYYY-MM-DD HH24:MI:SS.FF3")
#EWI: SPRKPY1083 => The pyspark.sql.readwriter.DataFrameWriter.save function is not supported. A workaround is to use Snowpark DataFrameWriter copy_into_location method instead.
df.write.save(path_json_file, format="json", mode="overwrite", partitionBy="City", dateFormat="YYYY/MM/DD", timestampFormat="YYYY-MM-DD HH24:MI:SS.FF3")

Correction recommandée

As a workaround you can use Snowpark DataFrameReader methods instead.

Fixing path and format parameters:
- Replace the load method with json or copy_into_location method
- If you are using copy_into_location method, you need to specify the format with the file_format_type parameter.
- The first parameter path must be in a stage to make an equivalence with Snowpark.

Vous trouverez ci-dessous un exemple qui crée une zone de préparation temporelle et y place le fichier, puis appelle l’une des méthodes mentionnées ci-dessus.

data = [
        ("John", 30, "New York"),
        ("Jane", 25, "San Francisco")
    ]
df = spark.createDataFrame(data, schema=["Name", "Age", "City"])

## Stage creation

temp_stage = f'{Session.get_fully_qualified_current_schema()}.{_generate_prefix("TEMP_STAGE")}'
my_session.sql(f'CREATE TEMPORARY STAGE IF NOT EXISTS {temp_stage}').show()
my_session.file.put(f"file:///path/to/file.json", f"@{temp_stage}")
stage_file_path = f"{temp_stage}file.json"

## Using json method
df.write.json(stage_file_path)

## Using copy_into_location method
df.write.copy_into_location(stage_file_path, file_format_type="json")

Fixing mode parameter:
- Use the mode function from Snowpark DataFrameWriter, as follows:

Below is an example that adds into the daisy chain the mode method with overwrite as a parameter.

data = [
        ("John", 30, "New York"),
        ("Jane", 25, "San Francisco")
    ]
df = spark.createDataFrame(data, schema=["Name", "Age", "City"])

## Using json method
df.write.mode("overwrite").json(temp_stage)

## Using copy_into_location method
df.write.mode("overwrite").copy_into_location(temp_stage, file_format_type="json")

Fixing partitionBy parameter:
- Use the partition_by parameter from the CSV method, as follows:

Below is an example that used the partition_by parameter from the CSV method.

data = [
        ("John", 30, "New York"),
        ("Jane", 25, "San Francisco")
    ]
df = spark.createDataFrame(data, schema=["Name", "Age", "City"])

## Using json method
df.write.json(temp_stage, partition_by="City")

## Using copy_into_location method
df.write.copy_into_location(temp_stage, file_format_type="json", partition_by="City")

Fixing options parameter:
- Use the format_type_options parameter from the CSV method, as follows:

The options between spark and snowpark are not the same, in this case dateFormat and timestampFormat are replaced with DATE_FORMAT and TIMESTAMP_FORMAT, the Additional recommendations section has table with all the Equivalences.

Below is an example that creates a dictionary with DATE_FORMAT and TIMESTAMP_FORMAT, and calls the options method with that dictionary.

data = [
        ("John", 30, "New York"),
        ("Jane", 25, "San Francisco")
    ]
df = spark.createDataFrame(data, schema=["Name", "Age", "City"])
optionsParam = {"DATE_FORMAT": "YYYY/MM/DD", "TIMESTAMP_FORMAT": "YYYY-MM-DD HH24:MI:SS.FF3"}

## Using json method
df.write.json(stage, format_type_options=optionsParam)

## Using copy_into_location method
df.write.copy_into_location(stage, file_format_type="json", format_type_options=optionsParam)

Scénario 3¶

Code d’entrée

Below is an example that tries to save data with PARQUET format.

path_parquet_file = "/path/to/file.parquet"

data = [
        ("John", 30, "New York"),
        ("Jane", 25, "San Francisco")
    ]

df = spark.createDataFrame(data, schema=["Name", "Age", "City"])

df.write.save(path_parquet_file, format="parquet")
df.write.save(path_parquet_file, format="parquet", mode="overwrite")
df.write.save(path_parquet_file, format="parquet", mode="overwrite", pathGlobFilter="*.parquet")
df.write.save(path_parquet_file, format="parquet", mode="overwrite", partitionBy="City", pathGlobFilter="*.parquet")

Code de sortie

The tool adds this EWI SPRKPY1083 on the output code to let you know that this function is not supported by Snowpark, but it has a workaround.

path_parquet_file = "/path/to/file.parquet"

data = [
        ("John", 30, "New York"),
        ("Jane", 25, "San Francisco")
    ]

df = spark.createDataFrame(data, schema=["Name", "Age", "City"])

#EWI: SPRKPY1083 => The pyspark.sql.readwriter.DataFrameWriter.save function is not supported. A workaround is to use Snowpark DataFrameWriter copy_into_location method instead.
df.write.save(path_parquet_file, format="parquet")
#EWI: SPRKPY1083 => The pyspark.sql.readwriter.DataFrameWriter.save function is not supported. A workaround is to use Snowpark DataFrameWriter copy_into_location method instead.
df.write.save(path_parquet_file, format="parquet", mode="overwrite")
#EWI: SPRKPY1083 => The pyspark.sql.readwriter.DataFrameWriter.save function is not supported. A workaround is to use Snowpark DataFrameWriter copy_into_location method instead.
df.write.save(path_parquet_file, format="parquet", mode="overwrite", pathGlobFilter="*.parquet")
#EWI: SPRKPY1083 => The pyspark.sql.readwriter.DataFrameWriter.save function is not supported. A workaround is to use Snowpark DataFrameWriter copy_into_location method instead.
df.write.save(path_parquet_file, format="parquet", mode="overwrite", partitionBy="City", pathGlobFilter="*.parquet")

Correction recommandée

As a workaround you can use Snowpark DataFrameReader methods instead.

Fixing path and format parameters:
- Replace the load method with parquet or copy_into_location method.
- If you are using copy_into_location method, you need to specify the format with the file_format_type parameter.
- The first parameter path must be in a stage to make an equivalence with Snowpark.

Vous trouverez ci-dessous un exemple qui crée une zone de préparation temporelle et y place le fichier, puis appelle l’une des méthodes mentionnées ci-dessus.

data = [
        ("John", 30, "New York"),
        ("Jane", 25, "San Francisco")
    ]
df = spark.createDataFrame(data, schema=["Name", "Age", "City"])

## Stage creation

temp_stage = f'{Session.get_fully_qualified_current_schema()}.{_generate_prefix("TEMP_STAGE")}'
my_session.sql(f'CREATE TEMPORARY STAGE IF NOT EXISTS {temp_stage}').show()
my_session.file.put(f"file:///path/to/file.parquet", f"@{temp_stage}")
stage_file_path = f"{temp_stage}file.parquet"

## Using parquet method
df.write.parquet(stage_file_path)

## Using copy_into_location method
df.write.copy_into_location(stage, file_format_type="parquet")

Fixing mode parameter:
- Use the mode function from Snowpark DataFrameWriter, as follows:

Below is an example that adds into the daisy chain the mode method with overwrite as a parameter.

data = [
        ("John", 30, "New York"),
        ("Jane", 25, "San Francisco")
    ]
df = spark.createDataFrame(data, schema=["Name", "Age", "City"])

## Using parquet method
df.write.mode("overwrite").parquet(temp_stage)

## Using copy_into_location method
df.write.mode("overwrite").copy_into_location(stage, file_format_type="parquet")

Fixing partitionBy parameter:
- Use the partition_by parameter from the CSV method, as follows:

Below is an example that used the partition_by parameter from the parquet method.

data = [
        ("John", 30, "New York"),
        ("Jane", 25, "San Francisco")
    ]
df = spark.createDataFrame(data, schema=["Name", "Age", "City"])

## Using parquet method
df.write.parquet(temp_stage, partition_by="City")

## Using copy_into_location method
df.write.copy_into_location(stage, file_format_type="parquet", partition_by="City")

Fixing options parameter:
- Use the format_type_options parameter from the CSV method, as follows:

The options between spark and snowpark are not the same, in this case pathGlobFilter is replaced with PATTERN, the Additional recommendations section has table with all the Equivalences.

Below is an example that creates a dictionary with PATTERN, and calls the options method with that dictionary.

data = [
        ("John", 30, "New York"),
        ("Jane", 25, "San Francisco")
    ]
df = spark.createDataFrame(data, schema=["Name", "Age", "City"])
optionsParam = {"PATTERN": "*.parquet"}

## Using parquet method
df.write.parquet(stage, format_type_options=optionsParam)

## Using copy_into_location method
df.write.copy_into_location(stage, file_format_type="parquet", format_type_options=optionsParam)

Recommandations supplémentaires¶

Tenez compte du fait que les options entre Spark et Snowpark ne sont pas les mêmes, mais qu’elles peuvent être mappées :


Options Spark	Valeur possible	Équivalent Snowpark	Description
header	Vrai ou faux	SKIP_HEADER = 1 / SKIP_HEADER = 0	Pour utiliser la première ligne d’un fichier comme nom de colonne.
delimiter	Tout séparateur de champ à un ou plusieurs caractères	FIELD_DELIMITER	Pour spécifier un ou plusieurs caractères comme séparateur pour chaque colonne/champ.
sep	Tout séparateur de champ à un caractère	FIELD_DELIMITER	Pour spécifier un seul caractère comme séparateur pour chaque colonne/champ.
encoding	UTF-8, UTF-16, etc…	ENCODING	Pour décoder les fichiers CSV selon le type d’encodage donné. L’encodage par défaut est UTF-8
ligneSep	Tout séparateur de lignes à un caractère	RECORD_DELIMITER	Pour définir le séparateur de lignes à utiliser pour l’analyse du fichier.
pathGlobFilter	Modèle de fichier	PATTERN	Pour définir un modèle permettant de lire uniquement les fichiers dont les noms correspondent au modèle.
recursiveFileLookup	Vrai ou faux	N/A	Pour analyser de manière récursive un répertoire afin d’y lire des fichiers. La valeur par défaut de cette option est False.
quote	Caractère unique à mettre entre guillemets	FIELD_OPTIONALLY_ENCLOSED_BY	Permet de mettre entre guillemets les champs/colonnes contenant des champs où le délimiteur/séparateur peut faire partie de la valeur. Ce caractère permet de mettre entre guillemets tous les champs lorsqu’il est utilisé avec l’option quoteAll. La valeur par défaut de cette option est guillemet double (« ).
nullValue	Chaîne pour remplacer la valeur nulle	NULL_IF	Permet de remplacer les valeurs nulles par la chaîne lors de la lecture et de l’écriture du dataframe.
dateFormat	Format de date valide	DATE_FORMAT	Permet de définir une chaîne indiquant un format de date. Le format par défaut est yyyy-MM-dd.
timestampFormat	Format d’horodatage valide	TIMESTAMP_FORMAT	Permet de définir une chaîne indiquant un format d’horodatage. Le format par défaut est yyyy-MM-dd “T’HH: mm:ss.
escape	Tout caractère unique	ESCAPE	Permet de définir un caractère unique comme caractère d’échappement afin de remplacer le caractère d’échappement par défaut (\).
inferSchema	Vrai ou faux	INFER_SCHEMA	Détecte automatiquement le schéma de fichier
mergeSchema	Vrai ou faux	N/A	Pas nécessaire dans Snowflake, car cela se produit chaque fois que infer_schema détermine la structure du fichier parquet

For modifiedBefore / modifiedAfter option you can achieve the same result in Snowflake by using the metadata columns and then add a filter like: df.filter(METADATA_FILE_LAST_MODIFIED > ‘some_date’).
For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1084¶

This issue code has been deprecated since Spark Conversion Core 4.12.0

Message : pyspark.sql.readwriter.DataFrameWriter.option n’est pas pris en charge.

Catégorie : Avertissement

Description¶

The pyspark.sql.readwriter.DataFrameWriter.option function is not supported.

Scénario¶

Code d’entrée

Below is an example using the option method, this method is used to add additional configurations when writing the data of a DataFrame.

path_csv_file = "/path/to/file.csv"
data = [
        ("John", 30, "New York"),
        ("Jane", 25, "San Francisco")
    ]

df = spark.createDataFrame(data, schema=["Name", "Age", "City"])

df.write.option("header", True).csv(csv_file_path)
df.write.option("sep", ";").option("lineSep","-").csv(csv_file_path)

Code de sortie

The tool adds this EWI SPRKPY1084 on the output code to let you know that this function is not supported by Snowpark.

path_csv_file = "/path/to/file.csv"
data = [
        ("John", 30, "New York"),
        ("Jane", 25, "San Francisco")
    ]

df = spark.createDataFrame(data, schema=["Name", "Age", "City"])

#EWI: SPRKPY1084 => The pyspark.sql.readwriter.DataFrameWriter.option function is not supported.

df.write.option("header", True).csv(csv_file_path)
#EWI: SPRKPY1084 => The pyspark.sql.readwriter.DataFrameWriter.option function is not supported.
df.write.option("sep", ";").option("lineSep","-").csv(csv_file_path)

Correction recommandée

La méthode pyspark.sql.readwriter.DataFrameWriter.option n’a pas de correction recommandée.

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1085¶

Message : pyspark.ml.feature.VectorAssembler n’est pas pris en charge.

Catégorie : Avertissement

Description¶

The pyspark.ml.feature.VectorAssembler is not supported.

Scénario¶

Code d’entrée

VectorAssembler est utilisé pour combiner plusieurs colonnes en un seul vecteur.

data = [
        (1, 10.0, 20.0),
        (2, 25.0, 30.0),
        (3, 50.0, 60.0)
    ]

df = SparkSession.createDataFrame(data, schema=["Id", "col1", "col2"])
vector = VectorAssembler(inputCols=["col1", "col2"], output="cols")

Code de sortie

The tool adds this EWI SPRKPY1085 on the output code to let you know that this class is not supported by Snowpark.

data = [
        (1, 10.0, 20.0),
        (2, 25.0, 30.0),
        (3, 50.0, 60.0)
    ]

df = spark.createDataFrame(data, schema=["Id", "col1", "col2"])
#EWI: SPRKPY1085 => The pyspark.ml.feature.VectorAssembler function is not supported.

vector = VectorAssembler(inputCols=["col1", "col2"], output="cols")

Correction recommandée

pyspark.ml.feature.VectorAssembler n’a pas de correction recommandée.

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1086¶

Message : pyspark.ml.linalg.VectorUDT n’est pas pris en charge.

Catégorie : Avertissement

Description¶

The pyspark.ml.linalg.VectorUDT is not supported.

Scénario¶

Code d’entrée

VectorUDT est un type de données permettant de représenter les colonnes d’un vecteur dans un DataFrame.

data = [
        (1, Vectors.dense([10.0, 20.0])),
        (2, Vectors.dense([25.0, 30.0])),
        (3, Vectors.dense([50.0, 60.0]))
    ]

schema = StructType([
        StructField("Id", IntegerType(), True),
        StructField("VectorCol", VectorUDT(), True),
    ])

df = SparkSession.createDataFrame(data, schema=schema)

Code de sortie

The tool adds this EWI SPRKPY1086 on the output code to let you know that this function is not supported by Snowpark.

data = [
        (1, Vectors.dense([10.0, 20.0])),
        (2, Vectors.dense([25.0, 30.0])),
        (3, Vectors.dense([50.0, 60.0]))
    ]

#EWI: SPRKPY1086 => The pyspark.ml.linalg.VectorUDT function is not supported.
schema = StructType([
        StructField("Id", IntegerType(), True),
        StructField("VectorCol", VectorUDT(), True),
    ])

df = spark.createDataFrame(data, schema=schema)

Correction recommandée

pyspark.ml.linalg.VectorUDT n’a pas de correction recommandée.

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1087¶

Message : La fonction pyspark.sql.dataframe.DataFrame.writeTo n’est pas prise en charge, mais une solution de contournement existe.

Catégorie : Avertissement

Description¶

The pyspark.sql.dataframe.DataFrame.writeTo function is not supported. The workaround is to use Snowpark DataFrameWriter SaveAsTable method instead.

Scénario¶

Entrée

Below is an example of a use of the pyspark.sql.dataframe.DataFrame.writeTo function, the dataframe df is written into a table name Personal_info.

df = spark.createDataFrame([["John", "Berry"], ["Rick", "Berry"], ["Anthony", "Davis"]],
                                 schema=["FIRST_NAME", "LAST_NAME"])

df.writeTo("Personal_info")

Sortie

The SMA adds the EWI SPRKPY1087 to the output code to let you know that this function is not supported, but has a workaround.

df = spark.createDataFrame([["John", "Berry"], ["Rick", "Berry"], ["Anthony", "Davis"]],
                                 schema=["FIRST_NAME", "LAST_NAME"])

#EWI: SPRKPY1087 => pyspark.sql.dataframe.DataFrame.writeTo is not supported, but it has a workaround.
df.writeTo("Personal_info")

Correction recommandée

La solution consiste à utiliser la méthode Snowpark DataFrameWriter SaveAsTable à la place.

df = spark.createDataFrame([["John", "Berry"], ["Rick", "Berry"], ["Anthony", "Davis"]],
                                 schema=["FIRST_NAME", "LAST_NAME"])

df.write.saveAsTable("Personal_info")

Recommandations supplémentaires¶

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1088¶

Message : Les valeurs de pyspark.sql.readwriter.DataFrameWriter.option dans Snowpark peuvent être différentes. Une validation obligatoire peut donc être nécessaire.

Catégorie : Avertissement

Description¶

The pyspark.sql.readwriter.DataFrameWriter.option values in Snowpark may be different, so validation might be needed to ensure that the behavior is correct.

Scénarios¶

Il existe plusieurs scénarios, selon que les options sont prises en charge ou non, ou selon le format utilisé pour écrire le fichier.

Scénario 1¶

Entrée

Below is an example of the usage of the method option, adding a sep option, which is currently supported.

df = spark.createDataFrame([(100, "myVal")], ["ID", "Value"])

df.write.option("sep", ",").csv("some_path")

Sortie

The tool adds the EWI SPRKPY1088 indicating that it is required validation.

df = spark.createDataFrame([(100, "myVal")], ["ID", "Value"])
#EWI: SPRKPY1088 => The pyspark.sql.readwriter.DataFrameWriter.option values in Snowpark may be different, so required validation might be needed.
df.write.option("sep", ",").csv("some_path")

Correction recommandée

L’API Snowpark prend en charge ce paramètre, la seule action possible est donc de vérifier le comportement après la migration. Veuillez vous référer à la table des équivalences pour connaître les paramètres pris en charge.

df = spark.createDataFrame([(100, "myVal")], ["ID", "Value"])
#EWI: SPRKPY1088 => The pyspark.sql.readwriter.DataFrameWriter.option values in Snowpark may be different, so required validation might be needed.
df.write.option("sep", ",").csv("some_path")

Scénario 2¶

Entrée

Here the scenario shows the usage of option, but adds a header option, which is not supported.

df = spark.createDataFrame([(100, "myVal")], ["ID", "Value"])

df.write.option("header", True).csv("some_path")

Sortie

The tool adds the EWI SPRKPY1088 indicating that it is required validation is needed.

df = spark.createDataFrame([(100, "myVal")], ["ID", "Value"])
#EWI: SPRKPY1088 => The pyspark.sql.readwriter.DataFrameWriter.option values in Snowpark may be different, so required validation might be needed.
df.write.option("header", True).csv("some_path")

Correction recommandée

For this scenario it is recommended to evaluate the Snowpark format type options to see if it is possible to change it according to your needs. Also, check the behavior after the change.

df = spark.createDataFrame([(100, "myVal")], ["ID", "Value"])
#EWI: SPRKPY1088 => The pyspark.sql.readwriter.DataFrameWriter.option values in Snowpark may be different, so required validation might be needed.
df.write.csv("some_path")

Scénario 3¶

Entrée

This scenario adds a sep option, which is supported and uses the JSON method.

Note: this scenario also applies for PARQUET.

df = spark.createDataFrame([(100, "myVal")], ["ID", "Value"])

df.write.option("sep", ",").json("some_path")

Sortie

The tool adds the EWI SPRKPY1088 indicating that it is required validation is needed.

df = spark.createDataFrame([(100, "myVal")], ["ID", "Value"])
#EWI: SPRKPY1088 => The pyspark.sql.readwriter.DataFrameWriter.option values in Snowpark may be different, so required validation might be needed.
df.write.option("sep", ",").json("some_path")

Correction recommandée

The file format JSON does not support the parameter sep, so it is recommended to evaluate the snowpark format type options to see if it is possible to change it according to your needs. Also, check the behavior after the change.

df = spark.createDataFrame([(100, "myVal")], ["ID", "Value"])
#EWI: SPRKPY1088 => The pyspark.sql.readwriter.DataFrameWriter.option values in Snowpark may be different, so required validation might be needed.
df.write.json("some_path")

Recommandations supplémentaires¶

Since there are some not supported parameters, it is recommended to check the table of equivalences and check the behavior after the transformation.
Table des équivalences :


Option PySpark	Option SnowFlake	Formats de fichier pris en charge	Description
SEP	FIELD_DELIMITER	CSV	Un ou plusieurs caractères d’un octet ou de plusieurs octets qui séparent les champs d’un fichier d’entrée.
LINESEP	RECORD_DELIMITER	CSV	Un ou plusieurs caractères qui séparent les enregistrements dans un fichier d’entrée.
QUOTE	FIELD_OPTIONALLY_ENCLOSED_BY	CSV	Caractère utilisé pour délimiter des chaînes.
NULLVALUE	NULL_IF	CSV	Chaîne utilisée pour les conversions entrante et sortante de SQL NULL.
DATEFORMAT	DATE_FORMAT	CSV	Chaîne qui définit le format des valeurs de date dans les fichiers de données à charger.
TIMESTAMPFORMAT	TIMESTAMP_FORMAT	CSV	Chaîne qui définit le format des valeurs d’horodatage dans les fichiers de données à charger.

Si le paramètre utilisé ne figure pas dans la liste, l’API génère une erreur.

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1089¶

Message : Les valeurs de pyspark.sql.readwriter.DataFrameWriter.options dans Snowpark peuvent être différentes. Une validation obligatoire peut donc être nécessaire.

Catégorie : Avertissement

Description¶

The pyspark.sql.readwriter.DataFrameWriter.options values in Snowpark may be different, so validation might be needed to ensure that the behavior is correct.

Scénarios¶

Il existe plusieurs scénarios, selon que les options sont prises en charge ou non, ou selon le format utilisé pour écrire le fichier.

Scénario 1¶

Entrée

Below is an example of the usage of the method options, adding the options sep and nullValue, which are currently supported.

df = spark.createDataFrame([(1, "myVal")], [2, "myVal2"], [None, "myVal3" ])

df.write.options(nullValue="myVal", sep=",").csv("some_path")

Sortie

The tool adds the EWI SPRKPY1089 indicating that it is required validation.

df = spark.createDataFrame([(1, "myVal")], [2, "myVal2"], [None, "myVal3" ])
#EWI: SPRKPY1089 => The pyspark.sql.readwriter.DataFrameWriter.options values in Snowpark may be different, so required validation might be needed.
df.write.options(nullValue="myVal", sep=",").csv("some_path")

Correction recommandée

L’API Snowpark prend en charge ces paramètres, la seule action possible est donc de vérifier le comportement après la migration. Veuillez vous référer à la table des équivalences pour connaître les paramètres pris en charge.

df = spark.createDataFrame([(1, "myVal")], [2, "myVal2"], [None, "myVal3" ])
#EWI: SPRKPY1089 => The pyspark.sql.readwriter.DataFrameWriter.options values in Snowpark may be different, so required validation might be needed.
df.write.options(nullValue="myVal", sep=",").csv("some_path")

Scénario 2¶

Entrée

Here the scenario shows the usage of options, but adds a header option, which is not supported.

df = spark.createDataFrame([(1, "myVal")], [2, "myVal2"], [None, "myVal3" ])

df.write.options(header=True, sep=",").csv("some_path")

Sortie

The tool adds the EWI SPRKPY1089 indicating that it is required validation is needed.

df = spark.createDataFrame([(1, "myVal")], [2, "myVal2"], [None, "myVal3" ])
#EWI: SPRKPY1089 => The pyspark.sql.readwriter.DataFrameWriter.options values in Snowpark may be different, so required validation might be needed.
df.write.options(header=True, sep=",").csv("some_path")

Correction recommandée

For this scenario it is recommended to evaluate the Snowpark format type options to see if it is possible to change it according to your needs. Also, check the behavior after the change.

df = spark.createDataFrame([(1, "myVal")], [2, "myVal2"], [None, "myVal3" ])
#EWI: SPRKPY1089 => The pyspark.sql.readwriter.DataFrameWriter.options values in Snowpark may be different, so required validation might be needed.
df.write.csv("some_path")

Scénario 3¶

Entrée

This scenario adds a sep option, which is supported and uses the JSON method.

df = spark.createDataFrame([(1, "myVal")], [2, "myVal2"], [None, "myVal3" ])

df.write.options(nullValue="myVal", sep=",").json("some_path")

Sortie

The tool adds the EWI SPRKPY1089 indicating that it is required validation is needed.

Note: this scenario also applies for PARQUET.

df = spark.createDataFrame([(1, "myVal")], [2, "myVal2"], [None, "myVal3" ])
#EWI: SPRKPY1089 => The pyspark.sql.readwriter.DataFrameWriter.options values in Snowpark may be different, so required validation might be needed.
df.write.options(nullValue="myVal", sep=",").json("some_path")

Correction recommandée

df = spark.createDataFrame([(1, "myVal")], [2, "myVal2"], [None, "myVal3" ])
#EWI: SPRKPY1089 => The pyspark.sql.readwriter.DataFrameWriter.options values in Snowpark may be different, so required validation might be needed.
df.write.json("some_path")

Recommandations supplémentaires¶

Since there are some not supported parameters, it is recommended to check the table of equivalences and check the behavior after the transformation.
Table des équivalences :

Snowpark peut prendre en charge une liste d”équivalences pour certains paramètres :


Option PySpark	Option SnowFlake	Formats de fichier pris en charge	Description
SEP	FIELD_DELIMITER	CSV	Un ou plusieurs caractères d’un octet ou de plusieurs octets qui séparent les champs d’un fichier d’entrée.
LINESEP	RECORD_DELIMITER	CSV	Un ou plusieurs caractères qui séparent les enregistrements dans un fichier d’entrée.
QUOTE	FIELD_OPTIONALLY_ENCLOSED_BY	CSV	Caractère utilisé pour délimiter des chaînes.
NULLVALUE	NULL_IF	CSV	Chaîne utilisée pour les conversions entrante et sortante de SQL NULL.
DATEFORMAT	DATE_FORMAT	CSV	Chaîne qui définit le format des valeurs de date dans les fichiers de données à charger.
TIMESTAMPFORMAT	TIMESTAMP_FORMAT	CSV	Chaîne qui définit le format des valeurs d’horodatage dans les fichiers de données à charger.

Si le paramètre utilisé ne figure pas dans la liste, l’API génère une erreur.

For more support, you can email us at sma-support@snowflake.com or post an issue in the SMA.

SPRKPY1101¶

Catégorie¶

Erreur d’analyse.

Description¶

Lorsque l’outil détecte une erreur d’analyse, il tente d’y remédier et poursuit le processus à la ligne suivante. Dans ces cas-là, il affiche l’erreur et des commentaires sur la ligne.

Cet exemple montre comment est traitée une erreur de concordance entre les espaces et les tabulations.

Code d’entrée

def foo():
    x = 5 # Spaces
     y = 6 # Tab

def foo2():
    x=6
    y=7

Code de sortie

def foo():
    x = 5 # Spaces
## EWI: SPRKPY1101 => Unrecognized or invalid CODE STATEMENT @(3, 2). Last valid token was '5' @(2, 9), failed token 'y' @(3, 2)
## y = 6 # Tab

def foo2():
    x=6
    y=7

Recommandations¶

Essayez de corriger la ligne commentée.
For more support, email us at sma-support@snowflake.com. If you have a support contract with Snowflake, reach out to your sales engineer, who can direct your support needs.