Snowpark pandas API

The Snowpark pandas API lets you run your pandas code directly on your data in Snowflake. Just by changing the import statement and a few lines of code, you can get the same pandas-native experience you know and love at the speed and scale of Snowflake. With this API, you can work with much larger datasets so you can avoid the time and expense of porting your pandas pipelines to other big data frameworks or using larger and more expensive VMs. It runs workloads natively in Snowflake through transpilation to SQL, enabling it to take advantage of parallelization and the data governance and security benefits of Snowflake.

Use the Snowpark pandas API if you prioritize familiarity and ease of migration from pandas. Otherwise, use the Snowpark DataFrame API if you prefer a Spark-like workflow and are comfortable with PySpark conventions.

Benefits of using the Snowpark pandas API

  • Meeting Python developers where they are – This API offers a familiar interface to Python developers by providing a pandas-compatible layer that can run natively in Snowflake.

  • Scalable pandas – This API bridges the convenience of pandas with the scalability of mature data infrastructure. pandas can now run at Snowflake speed and scale by leveraging pre-existing query optimization techniques within Snowflake. No code rewrites or complex tuning are required, so you can move from prototype to production seamlessly.

  • Security and governance – Data does not leave the Snowflake secure platform. The Snowpark pandas API allows uniformity within data organizations on how data is accessed, and allows for easier auditing and governance.

  • No additional compute infrastructure to manage and tune – This feature leverages the Snowflake engine, and you do not need to set up or manage any additional compute infrastructure.

How Snowpark pandas compares to Snowpark DataFrames

DataFrames in Snowpark and pandas are semantically different. Snowpark DataFrames are modeled after PySpark, which operates on the original data source, gets the most recent updated data, and does not maintain order for operations. Snowpark pandas DataFrames are modeled after pandas, which operate on a snapshot of the data, maintain order during the operation, and allow for order-based positional indexing.

The Snowpark pandas DataFrame API is intended to extend the Snowpark functionality and provide a familiar interface to pandas users to facilitate easy migration and adoption, and is not a replacement for Snowpark.

You can use the following operations to do conversions between Snowpark DataFrames and Snowpark pandas DataFrames:

  • to_snowpark_pandas – Converts an existing Snowpark DataFrame to a Snowpark pandas DataFrame. This operation assigns an implicit order to each row, and maintains this row order during the lifetime of the DataFrame.

  • to_snowpark – Converts a Snowpark pandas DataFrame to a Snowpark DataFrame. This operation does not maintain the row ordering, and the resulting Snowpark DataFrame operates on a data snapshot of the source Snowpark pandas DataFrame. This behavior means that changes to the underlying table will not be reflected during the evaluation of the Snowpark operations.

How Snowpark pandas differs from native pandas

Snowpark pandas and native pandas have similar DataFrame APIs with matching signatures and similar semantics. Snowpark pandas targets the same API signature as native pandas (pandas 2.2.1) and provides scalable computation with Snowflake. Snowpark pandas respects the semantics described in the native pandas documentation as much as possible, but it uses the Snowflake computation and type system. However, when native pandas executes on a client machine, it uses the Python computation and type system. For information about the type mapping between Snowpark pandas and Snowflake, see Data types.

Like native pandas, Snowpark pandas also has the notion of an index and maintains row ordering. However, their distinct execution environments cause certain nuanced differences in their behavior. This section calls out the key differences to be aware of.

Execution environment

  • pandas: Operates on a single machine and processes data in memory.

  • modin.pandas + snowflake.snowpark.modin.plugin: Integrates with Snowflake, which allows for distributed computing across a cluster of machines. This integration enables handling of much larger datasets that exceed the memory capacity of a single machine. Note that using the Snowpark pandas API requires a connection to Snowflake.

Lazy vs. eager evaluation

  • pandas: Executes operations immediately and materializes results fully in memory after each operation. This eager evaluation of operations might lead to increased memory pressure as data needs to be moved extensively within a machine.

  • modin.pandas + snowflake.snowpark.modin.plugin: Provides the same API experience as pandas. It mimics the eager evaluation model of pandas, but internally builds a lazily-evaluated query graph to enable optimization across operations.

    Fusing and transpiling operations through a query graph enables additional optimization opportunities for the underlying distributed Snowflake compute engine, which decreases both cost and end-to-end pipeline runtime compared to running pandas directly within Snowflake.

    Note

    The following modin.pandas operations always evaluate eagerly:

    • read_snowflake

    • to_snowflake

    • Any APIs whose return value is not a Snowpark pandas object (i.e. DataFrame or Series). For example, df.index and series.index.

Data source and storage

  • pandas: Supports the various readers and writers listed in the pandas documentation in IO tools (text, CSV, HDF5, …).

  • modin.pandas + snowflake.snowpark.modin.plugin: Can read and write from Snowflake tables and read local or staged csv, json, or parquet files.

Data types

  • pandas: Has a rich set of data types, such as integers, floats, strings, datetime types, and categorical types. It also supports user-defined data types. Data types in pandas are typically derived from the underlying data and are enforced strictly.

  • Snowpark.modin.pandas: Maps pandas objects to SQL by translating the pandas data types to the SQL types in Snowflake. A majority of pandas types have a natural equivalent in Snowflake, but the mapping is not always one to one. In some cases, multiple pandas types are mapped to the same SQL type.

The following table lists the type mappings between pandas and Snowflake SQL:

pandas type

Snowflake type

All signed/unsigned integer types, including pandas extended integer types

NUMBER(38, 0)

All float types, including pandas extended float data types

FLOAT

bool, BooleanDtype

BOOLEAN

str, StringDtype

STRING

datetime.time

TIME

datetime.date

DATE

All timezone-naive datetime types

TIMESTAMP_NTZ

All timezone-aware datetime types

TIMESTAMP_TZ

list, tuple, array

ARRAY

dict, json

MAP

Object column with mixed data types

VARIANT

Note

Categorical, period, interval, sparse, and user-defined data types are not supported.

The following table provides the mapping of Snowflake SQL types back to Snowpark pandas types using df.dtypes():

Snowflake type

Snowpark pandas type (df.dtypes())

NUMBER (scale = 0)

int64

NUMBER (scale > 0), REAL

float64

BOOLEAN

bool

STRING, TEXT

object (str)

VARIANT, BINARY, GEOMETRY, GEOGRAPHY

object

ARRAY

object (list)

OBJECT

object (dict)

TIME

object (datetime.time)

TIMESTAMP, TIMESTAMP_NTZ, TIMESTAMP_LTZ, TIMESTAMP_TZ

datetime64[ns]

DATE

object (datetime.date)

When converting from the Snowpark pandas DataFrame to native pandas DataFrame with to_pandas(), the native pandas DataFrame will have refined data types compared to the Snowpark pandas types, which are compatible with the SQL-Python Data Type Mappings for functions and procedures.

Casting and type inference

  • pandas: Relies on NumPy and by default follows the NumPy and Python type system for implicit type casting and inference. For example, it treats booleans as integer types, so 1 + True returns 2.

  • snowpark.modin.pandas: Maps NumPy and Python types to Snowflake types according to the preceding table, and uses the underlying Snowflake type system for implicit type casting and inference. For example, in accordance with the Logical Data Types, it does not implicitly convert booleans to integer types, so 1 + True results in a type conversion error.

Null value handling

  • pandas: Aims to be flexible when handling missing data, so it treats all of Python None, np.nan, pd.NaN, pd.NA, and pd.NaT as missing values.

  • modin.pandas + snowflake.snowpark.modin.plugin: Adopts a similar approach that treats all of the preceding values listed as missing values. These missing values are treated interchangeably and stored as SQL NULL in the Snowflake table.

Offset/frequency aliases

  • pandas: Date offsets in pandas changed in version 2.2.1. The single-letter aliases 'M', 'Q', 'Y', and others have been deprecated in favor of two-letter offsets.

  • Snowpark.modin.pandas: Exclusively uses the new offsets described in the pandas time series documentation.

API coverage

modin.pandas + snowflake.snowpark.modin.plugin currently supports about 45 percent of the total native pandas API, and support for additional API operations is actively in development.

For a current list of supported operations, see Snowpark pandas Supported APIs in the Snowpark pandas API reference.

Getting started with Snowpark pandas

To get started with the Snowpark pandas API, see the instructions and examples in the following sections:

Limitations

Snowpark pandas has the following limitations:

  • Snowpark pandas provides no guarantee of compatibility with OSS third-party libraries. Starting with version 1.14.0a1, however, Snowpark pandas introduces limited compatibility for NumPy, specifically for np.where usage. For more information, see NumPy Interoperability.

    When calling third-party library APIs with a Snowpark pandas dataframe, Snowflake recommends that you convert the Snowpark pandas dataframe to a pandas dataframe by calling to_pandas() before passing the dataframe to the third-party library call.

  • Snowpark pandas is not integrated with Snowpark ML. When using Snowpark ML, we recommend that you convert the Snowpark pandas dataframe to a Snowpark dataframe using to_snowpark() before calling Snowpark ML.

  • Snowpark pandas does not integrate with stored procedures. For example, it is not yet possible to call Snowpark pandas API from within a stored procedure.

  • The library is not available to use out of the box as part of Python worksheets or Snowsight notebooks.

  • Lazy Index objects are not supported. When dataframe.index is called, it returns a native pandas Index object, which requires pulling all data to the client side.

  • Not all pandas APIs have a distributed implementation yet in Snowpark pandas. For unsupported APIs, NotImplementedError is thrown. Operations that have no distributed implementation fall back to a stored procedure. For information about supported APIs, refer to the API reference documentation.

  • Snowpark pandas requires a specific pandas version. Snowpark pandas requires pandas 2.2.1, and only provides compatibility with pandas 2.2.1.

Prerequisites

Python 3.9, 3.10 or 3.11, modin version 0.28.1, and pandas version 2.2.1 are required.

Installing the Snowpark pandas API

To install the Snowpark pandas API:

  1. Download the wheel (.whl) file using your web browser. Refer to the preview feature onboarding email for the wheel file download location. Use the latest version of the wheel file unless you have a reason to use an older one.

  2. Change to your project directory and activate your Python virtual environment before installing the wheel.

    Note

    The API is under active development, so we recommend installing it in a Python virtual environment rather than system-wide. This practice allows each project you create to use a specific version, insulating you from changes in future versions.

    You can create a Python virtual environment for a particular Python version using tools like Anaconda, Miniconda, or virtualenv.

    For example, to use conda to create a Python 3.9 virtual environment, type:

    conda create --name snowpark_pandas python=3.9
    conda activate snowpark_pandas
    
    Copy

    Note

    If you previously installed an older version of Snowpark pandas using Python 3.8 and pandas 1.5.3, you will need to upgrade your Python and pandas versions as described above. Follow the steps to create a new environment with Python 3.9, 3.10, or 3.11.

  3. Use pip to install the wheel file you downloaded, specifying the full path to the .whl file. For example:

    pip install "<path>/<snowpark_pandas_wheel_file>[modin]"
    
    Copy

    Note

    The Snowpark pandas library is an alpha release version of the Snowpark Python library, so if the Snowpark Python library is already installed in your local environment, it will be replaced by the Snowpark pandas library.

Authenticating to Snowflake

Before using the Snowpark pandas API, you must establish a session with the Snowflake database. You can use a config file to choose the connection parameters for your session or you can enumerate them in your code. For more information, see Creating a Session for Snowpark Python. If a unique active Snowpark Python session exists, the Snowpark pandas API will automatically use it. For example:

import modin.pandas as pd
import snowflake.snowpark.modin.plugin
from snowflake.snowpark import Session

CONNECTION_PARAMETERS = {
    'account': '<myaccount>',
    'user': '<myuser>',
    'password': '<mypassword>',
    'role': '<myrole>',
    'database': '<mydatabase>',
    'schema': '<myschema>',
    'warehouse': '<mywarehouse>',
}
session = Session.builder.configs(CONNECTION_PARAMETERS).create()

# Snowpark pandas will automatically pick up the Snowpark session created above.
# It will use that session to create new dataframes.
df = pd.DataFrame([1, 2])
df2 = pd.read_snowflake('CUSTOMER')

# pd.session is the session that Snowpark pandas is using for new dataframes.
# In this case it is the same as the Snowpark session that we've created.
assert pd.session is session

# pd.session is a Snowpark session, so you can do anything with it that you
# can do with any other Snowpark session. For example, you can use it to execute an
# arbitrary SQL query , which results in a Snowpark DataFrame as per the Session API
# https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/latest/api/snowflake.snowpark.Session.sql
# but take note that the results of this will be a Snowpark DataFrame, not a Snowpark pandas DataFrame:
snowpark_df = pd.session.sql('select * from customer')
snowpark_df.show()
Copy

Alternatively, you can configure your Snowpark connection parameters in a configuration file. This eliminates the need to enumerate connection parameters in your code, allowing you to write your Snowpark pandas code almost as you would normally write pandas code. To achieve this, create a configuration file located at ~/.snowflake/connections.toml that looks something like this:

default_connection_name = "default"

[default]
account = "<myaccount>"
user = "<myuser>"
password = "<mypassword>"
role="<myrole>"
database = "<mydatabase>"
schema = "<myschema>"
warehouse = "<mywarehouse>"
Copy

Then in the code, you only need to use snowflake.snowpark.Session.builder.create() to create a session using these credentials.

import modin.pandas as pd
import snowflake.snowpark.modin.plugin
from snowflake.snowpark import Session

# Session.builder.create() will create a default Snowflake connection.
Session.builder.create()
# create a DataFrame.
df = pd.DataFrame([[1, 2], [3, 4]])
Copy

You can also create multiple Snowpark sessions, then assign one of them to Snowpark pandas. Snowpark pandas only uses one session, so you have to explicitly assign one of the sessions to Snowpark pandas with pd.session = pandas_session.

import modin.pandas as pd
import snowflake.snowpark.modin.plugin
from snowflake.snowpark import Session

pandas_session = Session.builder.configs({"user": "<user>", "password": "<password>", "account": "<account1>").create()
other_session = Session.builder.configs({"user": "<user>", "password": "<password>", "account": "<account2>").create()
pd.session = pandas_session
df = pd.DataFrame([1, 2, 3])
Copy

The following example shows that trying to use Snowpark pandas when there is no active Snowpark session will raise a SnowparkSessionException with an error like “Snowpark pandas requires an active snowpark session, but there is none.” Once you create a session, you can use Snowpark pandas. For example:

import modin.pandas as pd
import snowflake.snowpark.modin.plugin

df = pd.DataFrame([1, 2, 3])
Copy

The following example shows that trying to use Snowpark pandas when there are multiple active Snowpark sessions will cause a SnowparkSessionException with a message like, “There are multiple active snowpark sessions, but you need to choose one for Snowpark pandas.”

import modin.pandas as pd
import snowflake.snowpark.modin.plugin
from snowflake.snowpark import Session

pandas_session = Session.builder.configs({"user": "<user>", "password": "<password>", "account": "<account1>"}).create()
other_session = Session.builder.configs({"user": "<user>", "password": "<password>", "account": "<account2>"}).create()
df = pd.DataFrame([1, 2, 3])
Copy

Note

You must set the session used for a new Snowpark pandas DataFrame or Series via modin.pandas.session However, joining or merging DataFrames created with different sessions is not supported, so you should avoid repeatedly setting different sessions and creating DataFrames with different sessions in a workflow.

API Reference

https://docs.snowflake.com/en/LIMITEDACCESS/snowpark-pandas-api/reference/index.html

Best practices

This section describes best practices to follow when using Snowpark pandas.

  • Avoid using iterative code patterns, such as for loops, iterrows, and iteritems. Iterative code patterns quickly increase the generated query complexity. Let Snowpark pandas perform the data distribution and computation parallelization rather than the client code. When it comes to iterative code patterns, try to look for operations that can be performed on the whole dataframe and use the corresponding operations instead.

import modin.pandas as pd
import snowflake.snowpark.modin.plugin

for i in np.arange(0, 50):
  if i % 2 == 0:
    data = pd.concat([data, pd.DataFrame({'A': i, 'B': i + 1}, index=[0])], ignore_index=True)
  else:
    data = pd.concat([data, pd.DataFrame({'A': i}, index=[0])], ignore_index=True)

# Instead of creating one dataframe per row and concatenating them,
# try to directly create the dataframe out of the data, like this:

data = pd.DataFrame(
      {
          "A": range(0, 50),
          "B": [i + 1 if i % 2 == 0 else None for i in range(50)],
      },
)
Copy
  • Avoid calling apply, applymap and transform, which are eventually implemented with UDFs or UDTFs, which might not be as performant as regular SQL queries. If the function applied has an equivalent dataframe or series operation, use that operation instead. For example, instead of df.groupby('col1').apply('sum'), directly call df.groupby('col1').sum().

  • Call to_pandas() before passing the dataframe or series to a third-party library call. Snowpark pandas does not provide a compatibility guarantee with third-party libraries.

  • Use a materialized regular Snowflake table to avoid extra I/O overhead. Snowpark pandas works on top of a data snapshot that only works for regular tables. For other types, including external table, views, and iceberg tables, a temporary table is created before taking the snapshot, which introduces extra materialization overhead.

  • Double check the result type before proceeding to other operations, and do explicit type casting with astype if needed.

    Due to limited type inference capability, df.apply will return results of object (variant) type even if the result contains all integer values. If other operations require the dtype to be int, you can do an explicit type casting by calling the astype method to correct the column type before you continue.

  • Avoid calling APIs that require evaluation and materialization if not necessary.

    APIs that don’t return Series or Dataframe require eager evaluation and materialization to produce the result in the correct type. Reduce calls to those APIs to minimize unnecessary evaluations and materialization.

  • When working with iteratively built queries, use df.to_snowflake and pd.read_snowflake to materialize intermediate results in order to improve the latency of the overall query. For example:

    df = pd.read_snowflake('pandas_test')
    df2 = pd.pivot_table(df, index='index_col', columns='pivot_col') # expensive operation
    df3 = df.merge(df2)
    df4 = df3.where(df2 == True)
    
    Copy

    In the above example, the query to produce df2 is expensive to compute, and is reused in the creation of both df3 and df4. Materializing df2 into a temporary table (making subsequent operations involving df2 a table scan instead of a pivot) can reduce the overall latency of the code block:

    df = pd.read_snowflake('pandas_test')
    df2 = pd.pivot_table(df, index='index_col', columns='pivot_col') # expensive operation
    df2.to_snowflake('TEMP_TABLE_SNOWPARK_PANDAS', index_label='_index', table_type="temp")
    df2 = pd.read_snowflake("TEMP_TABLE_SNOWPARK_PANDAS", index_col='_index')
    df3 = df.merge(df2)
    df4 = df3.where(df2 == True)
    
    Copy

    Make sure to provide a unique name to the index in the call to to_snowflake that can be used to specify the index columns in read_snowflake, both to preserve the index in case it is not the default RangeIndex, and to preserve the order of df2.

Examples

Here is a code example with pandas operations. We start with a Snowflake table named pandas_test, which contains three columns: COL_STR, COL_FLOAT, and COL_INT.

COL_STR  COL_FLOAT  COL_INT
a          2        1
b          4        2
c          6        3
Copy

Next we create a dataframe from the Snowflake table. We drop the column COL_INT and then save the result back to Snowflake with a column named row_position.

import modin.pandas as pd
import snowflake.snowpark.modin.plugin

# Create a dataframe out of a Snowflake table.
df = pd.read_snowflake('pandas_test')

df
COL_STR  COL_FLOAT  COL_INT
0       a        2.0      1.0
1       b        4.0      2.0
2       c        6.0      NaN

df.shape
(3, 3)

df.head(2)
COL_STR  COL_FLOAT  COL_INT
0       a        2.0        1
1       b        4.0        2

df.dropna(subset=["COL_INT"], inplace=True)

df
COL_STR  COL_FLOAT  COL_INT
0       a        2.0        1
1       b        4.0        2

df.shape
(2, 3)

df.head(2)
COL_STR  COL_FLOAT  COL_INT
0       a        2.0        1
1       b        4.0        2

# Save the result back to Snowflake with a row_pos column.
df.reset_index(drop=True).to_snowflake('pandas_test2', index=True, index_label=['row_pos'])
Copy

You end up with a new table, pandas_test2, which looks like this:

row_pos  COL_STR  COL_FLOAT  COL_INT
0          a          2.0          1
1          b          4.0          2
Copy

Here are some more examples.

df = pd.DataFrame({"a": [1,2,3], "b": ["x", "y", "z"]})

df
a  b
0  1  x
1  2  y
2  3  z

df.ndim
2

df.shape
(3, 2)

df.head(1)
a  b
0  1  x

df.tail(1)
a  b
2  3  z
Copy

Indexing

df.columns
Index(['a', 'b'], dtype='object')

df.index
Index([0, 1, 2], dtype='int8')

df["a"]
0    1
1    2
2    3
Name: a, dtype: int8

df["b"]
0    x
1    y
2    z
Name: b, dtype: object

df.iloc[0,1]
'x'

df.loc[df["a"] > 2]
a  b
2  3  z

df.columns = ["c", "d"]
df
c  d
0  1  x
1  2  y
2  3  z

df = df.set_index("c")
df
   d
c
1  x
2  y
3  z

renamed = df.rename(columns={"d": "renamed"})
    renamed
c
1       x
2       y
3       z
Copy

MultiIndex

arrays = [
    ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
    ["one", "two", "one", "two", "one", "two", "one", "two"],
]

tuples = list(zip(*arrays))

index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])

index
MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
        names=['first', 'second'])

pd.Series(np.random.randn(8), index=index)
first  second
bar    one      -0.915158
       two      -0.063246
baz    one       0.454371
       two      -0.411334
foo    one      -1.345568
       two      -0.862540
qux    one       1.757797
       two      -1.525293
dtype: float64
Copy

IO

test_table_name = "test_table"
df.to_snowflake(test_table_name, if_exists="replace", index=False)

df_read = pd.read_snowflake(test_table_name)
df_read
c  d
0  1  x
1  2  y
2  3  z
Copy

Missing values

df = pd.DataFrame([[np.nan, 2, np.nan, 0],
                [3, 4, np.nan, 1],
                [np.nan, np.nan, np.nan, np.nan],
                [np.nan, 3, np.nan, 4]],
                columns=list("ABCD"))
df
    A    B   C    D
0  NaN  2.0 NaN  0.0
1  3.0  4.0 NaN  1.0
2  NaN  NaN NaN  NaN
3  NaN  3.0 NaN  4.0

df.isna()
    A      B     C      D
0   True  False  True  False
1  False  False  True  False
2   True   True  True   True
3   True  False  True  False

df.fillna(0)
    A    B    C    D
0  0.0  2.0  0.0  0.0
1  3.0  4.0  0.0  1.0
2  0.0  0.0  0.0  0.0
3  0.0  3.0  0.0  4.0

df.dropna(how="all")
    A    B   C    D
0  NaN  2.0 NaN  0.0
1  3.0  4.0 NaN  1.0
3  NaN  3.0 NaN  4.0
Copy

Type conversion

df = pd.DataFrame({"int": [1,2,3], "str": ["4", "5", "6"]})
df
int str
0    1   4
1    2   5
2    3   6

df_float = df.astype(float)
df_float
int  str
0  1.0  4.0
1  2.0  5.0
2  3.0  6.0

df_float.dtypes
int    float64
str    float64
dtype: object

pd.to_numeric(df.str)
0    4.0
1    5.0
2    6.0
Name: str, dtype: float64

df = pd.DataFrame({'year': [2015, 2016],
                'month': [2, 3],
                'day': [4, 5]})
pd.to_datetime(df)
0   2015-02-04
1   2016-03-05
dtype: datetime64[ns]
Copy

Binary operations

df_1 = pd.DataFrame([[1,2,3],[4,5,6]])
df_2 = pd.DataFrame([[6,7,8]])
df_1.add(df_2)
    0    1     2
0  7.0  9.0  11.0
1  NaN  NaN   NaN

s1 = pd.Series([1, 2, 3])
s2 = pd.Series([2, 2, 2])
s1 + s2
0    3
1    4
2    5
dtype: int8


df = pd.DataFrame({"A": [1,2,3], "B": [4,5,6]})
df["A+B"] = df["A"] + df["B"]
df
  A  B  A+B
0  1  4    5
1  2  5    7
2  3  6    9
Copy

Aggregation

df = pd.DataFrame([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9],
                [np.nan, np.nan, np.nan]],
                columns=['A', 'B', 'C'])
df.agg(['sum', 'min'])
        A     B     C
sum  12.0  15.0  18.0
min   1.0   2.0   3.0
Copy

Merge

df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [1, 2, 3, 5]})
df1
lkey  value
0  foo      1
1  bar      2
2  baz      3
3  foo      5

df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [5, 6, 7, 8]})
df2
rkey  value
0  foo      5
1  bar      6
2  baz      7
3  foo      8

df1.merge(df2, left_on='lkey', right_on='rkey')
lkey  value_x rkey  value_y
0  foo        1  foo        5
1  foo        1  foo        8
2  bar        2  bar        6
3  baz        3  baz        7
4  foo        5  foo        5
5  foo        5  foo        8

df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
                'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
df
key   A
0  K0  A0
1  K1  A1
2  K2  A2
3  K3  A3
4  K4  A4
5  K5  A5

other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
                    'B': ['B0', 'B1', 'B2']})
df.join(other, lsuffix='_caller', rsuffix='_other')
key_caller   A key_other     B
0         K0  A0        K0    B0
1         K1  A1        K1    B1
2         K2  A2        K2    B2
3         K3  A3      None  None
4         K4  A4      None  None
5         K5  A5      None  None
Copy

Groupby

df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
                            'Parrot', 'Parrot'],
                'Max Speed': [380., 370., 24., 26.]})
df
Animal  Max Speed
0  Falcon      380.0
1  Falcon      370.0
2  Parrot       24.0
3  Parrot       26.0

df.groupby(['Animal']).mean()
        Max Speed
Animal
Falcon      375.0
Parrot       25.0
Copy

Pivot

df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
                        "bar", "bar", "bar", "bar"],
                "B": ["one", "one", "one", "two", "two",
                        "one", "one", "two", "two"],
                "C": ["small", "large", "large", "small",
                        "small", "large", "small", "small",
                        "large"],
                "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
                "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
df
    A    B      C  D  E
0  foo  one  small  1  2
1  foo  one  large  2  4
2  foo  one  large  2  5
3  foo  two  small  3  5
4  foo  two  small  3  6
5  bar  one  large  4  6
6  bar  one  small  5  8
7  bar  two  small  6  9
8  bar  two  large  7  9

pd.pivot_table(df, values='D', index=['A', 'B'],
                    columns=['C'], aggfunc="sum")
C        large  small
A   B
bar one    4.0      5
    two    7.0      6
foo one    4.0      1
    two    NaN      6

df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
                        'two'],
                'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
                'baz': [1, 2, 3, 4, 5, 6],
                'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
df
foo bar  baz zoo
0  one   A    1   x
1  one   B    2   y
2  one   C    3   z
3  two   A    4   q
4  two   B    5   w
5  two   C    6   t
Copy