modin.pandas.DataFrame.apply¶
- DataFrame.apply(func: AggFuncType | UserDefinedFunction, axis: Axis = 0, raw: bool = False, result_type: Literal['expand', 'reduce', 'broadcast'] | None = None, args=(), **kwargs)[source]¶
Apply a function along an axis of the DataFrame.
Objects passed to the function are Series objects whose index is either the DataFrame’s index (
axis=0
) or the DataFrame’s columns (axis=1
). By default (result_type=None
), the final return type is inferred from the return type of the applied function. Otherwise, it depends on the result_type argument.Snowpark pandas currently only supports
apply
withaxis=1
and callablefunc
.- Parameters:
func (function) – A Python function object to apply to each column or row, or a Python function decorated with @udf.
axis ({0 or 'index', 1 or 'columns'}, default 0) –
Axis along which the function is applied:
0 or ‘index’: apply function to each column.
1 or ‘columns’: apply function to each row.
Snowpark pandas does not yet support
axis=0
.raw (bool, default False) –
Determines if row or column is passed as a Series or ndarray object:
False
: passes each row or column as a Series to the function.True
: the passed function will receive ndarray objects instead.
result_type ({'expand', 'reduce', 'broadcast', None}, default None) –
These only act when
axis=1
(columns):’expand’ : list-like results will be turned into columns.
’reduce’ : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand’.
’broadcast’ : results will be broadcast to the original shape of the DataFrame, the original index and columns will be retained.
Snowpark pandas does not yet support the
result_type
parameter.args (tuple) – Positional arguments to pass to func in addition to the array/series.
**kwargs – Additional keyword arguments to pass as keywords arguments to func.
- Returns:
Result of applying
func
along the given axis of the DataFrame.- Return type:
See also
Series.apply
: For applying more complex functions on a Series.DataFrame.applymap
: Apply a function elementwise on a whole DataFrame.Notes
1. When
func
has a type annotation for its return value, the result will be cast to the corresponding dtype. When no type annotation is provided, data will be converted to VARIANT type in Snowflake, and the result will havedtype=object
. In this case, the return value must be JSON-serializable, which can be a valid input tojson.dumps
(e.g.,dict
andlist
objects are JSON-serializable, butbytes
anddatetime.datetime
objects are not). The return type hint is used only whenfunc
is a series-to-scalar function.2. Under the hood, we use Snowflake Vectorized Python UDFs to implement apply() method with axis=1. You can find type mappings from Snowflake SQL types to pandas dtypes here.
3. Snowflake supports two types of NULL values in variant data: JSON NULL and SQL NULL. When no type annotation is provided and Variant data is returned, Python
None
is translated to JSON NULL, and all other pandas missing values (np.nan, pd.NA, pd.NaT) are translated to SQL NULL.4. If
func
is a series-to-series function that can also be used as a scalar-to-scalar function (e.g.,np.sqrt
,lambda x: x+1
), usingdf.applymap()
to apply the function element-wise may give better performance.5. When
func
can return a series with different indices, e.g.,lambda x: pd.Series([1, 2], index=["a", "b"] if x.sum() > 2 else ["b", "c"])
, the values with the same label will be merged together.6. The index values of returned series from
func
must be JSON-serializable. For example,lambda x: pd.Series([1], index=[bytes(1)])
will raise a SQL execption because pythonbytes
objects are not JSON-serializable.7. When
func
uses any first-party modules or third-party packages inside the function, you need to add these dependencies viasession.add_import()
andsession.add_packages()
. Alternatively. specify third-party packages with the @udf decorator. When using the @udf decorator, annotations using PandasSeriesType or PandasDataFrameType are not supported.8. The Snowpark pandas module cannot currently be referenced inside the definition of
func
. If you need to call a general pandas API likepd.Timestamp
insidefunc
, please use the originalpandas
module (withimport pandas
) as a workaround.Examples
>>> df = pd.DataFrame([[2, 0], [3, 7], [4, 9]], columns=['A', 'B']) >>> df A B 0 2 0 1 3 7 2 4 9
Using a reducing function on
axis=1
:>>> df.apply(np.sum, axis=1) 0 2 1 10 2 13 dtype: int64
Returning a list-like object will result in a Series:
>>> df.apply(lambda x: [1, 2], axis=1) 0 [1, 2] 1 [1, 2] 2 [1, 2] dtype: object
To work with 3rd party packages, add them to the current session:
>>> import scipy.stats >>> pd.session.custom_package_usage_config['enabled'] = True >>> pd.session.add_packages(['numpy', scipy]) >>> df.apply(lambda x: np.dot(x * scipy.stats.norm.cdf(0), x * scipy.stats.norm.cdf(0)), axis=1) 0 1.00 1 14.50 2 24.25 dtype: float64
or annotate the function with the @udf decorator from Snowpark https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/latest/api/snowflake.snowpark.functions.udf.
>>> from snowflake.snowpark.functions import udf >>> from snowflake.snowpark.types import DoubleType >>> @udf(packages=['statsmodels>0.12'], return_type=DoubleType()) ... def autocorr(column): ... import pandas as pd ... import statsmodels.tsa.stattools ... return pd.Series(statsmodels.tsa.stattools.pacf_ols(column.values)).mean() ... >>> df.apply(autocorr, axis=0) A 0.857143 B 0.428571 dtype: float64