modin.pandas.DataFrame.apply¶

DataFrame.apply(func: AggFuncType | UserDefinedFunction, axis: Axis = 0, raw: bool = False, result_type: Literal['expand', 'reduce', 'broadcast'] | None = None, args=(), **kwargs)[source]¶

Apply a function along an axis of the DataFrame.

Objects passed to the function are Series objects whose index is either the DataFrame’s index (axis=0) or the DataFrame’s columns (axis=1). By default (result_type=None), the final return type is inferred from the return type of the applied function. Otherwise, it depends on the result_type argument.

Snowpark pandas currently only supports apply with axis=1 and callable func.

Parameters:

func (function) – A Python function object to apply to each column or row, or a Python function decorated with @udf.
axis ({0 or 'index', 1 or 'columns'}, default 0) –
Axis along which the function is applied:
- 0 or ‘index’: apply function to each column.
- 1 or ‘columns’: apply function to each row.
Snowpark pandas does not yet support axis=0.
raw (bool, default False) –
Determines if row or column is passed as a Series or ndarray object:
- False : passes each row or column as a Series to the function.
- True : the passed function will receive ndarray objects instead.
result_type ({'expand', 'reduce', 'broadcast', None}, default None) –
These only act when axis=1 (columns):
- ’expand’ : list-like results will be turned into columns.
- ’reduce’ : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand’.
- ’broadcast’ : results will be broadcast to the original shape of the DataFrame, the original index and columns will be retained.
Snowpark pandas does not yet support the result_type parameter.
args (tuple) – Positional arguments to pass to func in addition to the array/series.
**kwargs – Additional keyword arguments to pass as keywords arguments to func.

Returns:

Result of applying func along the given axis of the DataFrame.

Return type:

Series or DataFrame

See also

Series.apply : For applying more complex functions on a Series.

DataFrame.applymap : Apply a function elementwise on a whole DataFrame.

Notes

1. When func has a type annotation for its return value, the result will be cast to the corresponding dtype. When no type annotation is provided, data will be converted to VARIANT type in Snowflake, and the result will have dtype=object. In this case, the return value must be JSON-serializable, which can be a valid input to json.dumps (e.g., dict and list objects are JSON-serializable, but bytes and datetime.datetime objects are not). The return type hint is used only when func is a series-to-scalar function.

2. Under the hood, we use Snowflake Vectorized Python UDFs to implement apply() method with axis=1. You can find type mappings from Snowflake SQL types to pandas dtypes here.

3. Snowflake supports two types of NULL values in variant data: JSON NULL and SQL NULL. When no type annotation is provided and Variant data is returned, Python None is translated to JSON NULL, and all other pandas missing values (np.nan, pd.NA, pd.NaT) are translated to SQL NULL.

4. If func is a series-to-series function that can also be used as a scalar-to-scalar function (e.g., np.sqrt, lambda x: x+1), using df.applymap() to apply the function element-wise may give better performance.

5. When func can return a series with different indices, e.g., lambda x: pd.Series([1, 2], index=["a", "b"] if x.sum() > 2 else ["b", "c"]), the values with the same label will be merged together.

6. The index values of returned series from func must be JSON-serializable. For example, lambda x: pd.Series([1], index=[bytes(1)]) will raise a SQL execption because python bytes objects are not JSON-serializable.

7. When func uses any first-party modules or third-party packages inside the function, you need to add these dependencies via session.add_import() and session.add_packages(). Alternatively. specify third-party packages with the @udf decorator. When using the @udf decorator, annotations using PandasSeriesType or PandasDataFrameType are not supported.

8. The Snowpark pandas module cannot currently be referenced inside the definition of func. If you need to call a general pandas API like pd.Timestamp inside func, please use the original pandas module (with import pandas) as a workaround.

Examples

>>> df = pd.DataFrame([[2, 0], [3, 7], [4, 9]], columns=['A', 'B'])
>>> df
   A  B
0  2  0
1  3  7
2  4  9

Copy

Using a reducing function on axis=1:

>>> df.apply(np.sum, axis=1)
0     2
1    10
2    13
dtype: int64

Copy

Returning a list-like object will result in a Series:

>>> df.apply(lambda x: [1, 2], axis=1)
0    [1, 2]
1    [1, 2]
2    [1, 2]
dtype: object

Copy

To work with 3rd party packages, add them to the current session:

>>> import scipy.stats
>>> pd.session.custom_package_usage_config['enabled'] = True
>>> pd.session.add_packages(['numpy', scipy])
>>> df.apply(lambda x: np.dot(x * scipy.stats.norm.cdf(0), x * scipy.stats.norm.cdf(0)), axis=1)
0     1.00
1    14.50
2    24.25
dtype: float64

Copy

or annotate the function with the @udf decorator from Snowpark https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/latest/api/snowflake.snowpark.functions.udf.

>>> from snowflake.snowpark.functions import udf
>>> from snowflake.snowpark.types import DoubleType
>>> @udf(packages=['statsmodels>0.12'], return_type=DoubleType())
... def autocorr(column):
...    import pandas as pd
...    import statsmodels.tsa.stattools
...    return pd.Series(statsmodels.tsa.stattools.pacf_ols(column.values)).mean()
...
>>> df.apply(autocorr, axis=0)  
A    0.857143
B    0.428571
dtype: float64

Copy