snowflake.snowpark.DataFrame.join¶

DataFrame.join(right: DataFrame, on: Optional[Union[Column, str, Iterable[str]]] = None, how: Optional[str] = None, *, lsuffix: str = '', rsuffix: str = '', match_condition: Optional[Column] = None, **kwargs) → DataFrame[source]¶

Performs a join of the specified type (how) with the current DataFrame and another DataFrame (right) on a list of columns (on).

Parameters:

right – The other DataFrame to join.
on – A column name or a Column object or a list of them to be used for the join. When a list of column names are specified, this method assumes the named columns are present in both dataframes. You can use keyword using_columns to specify this condition. Note that to avoid breaking changes, when using_columns` is specified, it overrides on.
how –
We support the following join types:
- Inner join: “inner” (the default value)
- Left outer join: “left”, “leftouter”
- Right outer join: “right”, “rightouter”
- Full outer join: “full”, “outer”, “fullouter”
- Left semi join: “semi”, “leftsemi”
- Left anti join: “anti”, “leftanti”
- Cross join: “cross”
- [Preview Feature] Asof join: “asof”
You can also use join_type keyword to specify this condition. Note that to avoid breaking changes, currently when join_type is specified, it overrides how.
lsuffix – Suffix to add to the overlapping columns of the left DataFrame.
rsuffix – Suffix to add to the overlapping columns of the right DataFrame.
match_condition – The match condition for asof join.

Note

When both lsuffix and rsuffix are empty, the overlapping columns will have random column names in the resulting DataFrame. You can reference to these randomly named columns using Column.alias() (See the first usage in Examples).

See also

Usage notes for asof join: https://docs.snowflake.com/LIMITEDACCESS/asof-join#usage-notes

Examples::

>>> from snowflake.snowpark.functions import col
>>> df1 = session.create_dataframe([[1, 2], [3, 4], [5, 6]], schema=["a", "b"])
>>> df2 = session.create_dataframe([[1, 7], [3, 8]], schema=["a", "c"])
>>> df1.join(df2, df1.a == df2.a).select(df1.a.alias("a_1"), df2.a.alias("a_2"), df1.b, df2.c).show()
-----------------------------
|"A_1"  |"A_2"  |"B"  |"C"  |
-----------------------------
|1      |1      |2    |7    |
|3      |3      |4    |8    |
-----------------------------

>>> # refer a single column "a"
>>> df1.join(df2, "a").select(df1.a.alias("a"), df1.b, df2.c).show()
-------------------
|"A"  |"B"  |"C"  |
-------------------
|1    |2    |7    |
|3    |4    |8    |
-------------------

>>> # rename the ambiguous columns
>>> df3 = df1.to_df("df1_a", "b")
>>> df4 = df2.to_df("df2_a", "c")
>>> df3.join(df4, col("df1_a") == col("df2_a")).select(col("df1_a").alias("a"), "b", "c").show()
-------------------
|"A"  |"B"  |"C"  |
-------------------
|1    |2    |7    |
|3    |4    |8    |
-------------------

Copy

>>> # join multiple columns
>>> mdf1 = session.create_dataframe([[1, 2], [3, 4], [5, 6]], schema=["a", "b"])
>>> mdf2 = session.create_dataframe([[1, 2], [3, 4], [7, 6]], schema=["a", "b"])
>>> mdf1.join(mdf2, ["a", "b"]).show()
-------------
|"A"  |"B"  |
-------------
|1    |2    |
|3    |4    |
-------------

>>> mdf1.join(mdf2, (mdf1["a"] < mdf2["a"]) & (mdf1["b"] == mdf2["b"])).select(mdf1["a"].as_("new_a"), mdf1["b"].as_("new_b")).show()
---------------------
|"NEW_A"  |"NEW_B"  |
---------------------
|5        |6        |
---------------------

>>> # use lsuffix and rsuffix to resolve duplicating column names
>>> mdf1.join(mdf2, (mdf1["a"] < mdf2["a"]) & (mdf1["b"] == mdf2["b"]), lsuffix="_left", rsuffix="_right").show()
-----------------------------------------------
|"A_LEFT"  |"B_LEFT"  |"A_RIGHT"  |"B_RIGHT"  |
-----------------------------------------------
|5         |6         |7          |6          |
-----------------------------------------------

>>> mdf1.join(mdf2, (mdf1["a"] < mdf2["a"]) & (mdf1["b"] == mdf2["b"]), rsuffix="_right").show()
-------------------------------------
|"A"  |"B"  |"A_RIGHT"  |"B_RIGHT"  |
-------------------------------------
|5    |6    |7          |6          |
-------------------------------------

>>> # examples of different joins
>>> df5 = session.create_dataframe([3, 4, 5, 5, 6, 7], schema=["id"])
>>> df6 = session.create_dataframe([5, 6, 7, 7, 8, 9], schema=["id"])
>>> # inner join
>>> df5.join(df6, "id", "inner").sort("id").show()
--------
|"ID"  |
--------
|5     |
|5     |
|6     |
|7     |
|7     |
--------

>>> # left/leftouter join
>>> df5.join(df6, "id", "left").sort("id").show()
--------
|"ID"  |
--------
|3     |
|4     |
|5     |
|5     |
|6     |
|7     |
|7     |
--------

>>> # right/rightouter join
>>> df5.join(df6, "id", "right").sort("id").show()
--------
|"ID"  |
--------
|5     |
|5     |
|6     |
|7     |
|7     |
|8     |
|9     |
--------

>>> # full/outer/fullouter join
>>> df5.join(df6, "id", "full").sort("id").show()
--------
|"ID"  |
--------
|3     |
|4     |
|5     |
|5     |
|6     |
|7     |
|7     |
|8     |
|9     |
--------

>>> # semi/leftsemi join
>>> df5.join(df6, "id", "semi").sort("id").show()
--------
|"ID"  |
--------
|5     |
|5     |
|6     |
|7     |
--------

>>> # anti/leftanti join
>>> df5.join(df6, "id", "anti").sort("id").show()
--------
|"ID"  |
--------
|3     |
|4     |
--------

Copy

Note

When performing chained operations, this method will not work if there are ambiguous column names. For example,

>>> df1.filter(df1.a == 1).join(df2, df1.a == df2.a).select(df1.a.alias("a"), df1.b, df2.c) 

Copy

will not work because df1.filter(df1.a == 1) has produced a new dataframe and you cannot refer to df1.a anymore. Instead, you can do either

>>> df1.join(df2, (df1.a == 1) & (df1.a == df2.a)).select(df1.a.alias("a"), df1.b, df2.c).show()
-------------------
|"A"  |"B"  |"C"  |
-------------------
|1    |2    |7    |
-------------------

Copy

>>> df3 = df1.filter(df1.a == 1)
>>> df3.join(df2, df3.a == df2.a).select(df3.a.alias("a"), df3.b, df2.c).show()
-------------------
|"A"  |"B"  |"C"  |
-------------------
|1    |2    |7    |
-------------------

Copy

Examples::

>>> # asof join examples
>>> df1 = session.create_dataframe([['A', 1, 15, 3.21],
...                                 ['A', 2, 16, 3.22],
...                                 ['B', 1, 17, 3.23],
...                                 ['B', 2, 18, 4.23]],
...                                schema=["c1", "c2", "c3", "c4"])
>>> df2 = session.create_dataframe([['A', 1, 14, 3.19],
...                                 ['B', 2, 16, 3.04]],
...                                schema=["c1", "c2", "c3", "c4"])
>>> df1.join(df2, on=["c1", "c2"], how="asof", match_condition=(df1.c3 >= df2.c3)) \
...     .select(df1.c1, df1.c2, df1.c3.alias("C3_1"), df1.c4.alias("C4_1"), df2.c3.alias("C3_2"), df2.c4.alias("C4_2")) \
...     .order_by("c1", "c2").show()
---------------------------------------------------
|"C1"  |"C2"  |"C3_1"  |"C4_1"  |"C3_2"  |"C4_2"  |
---------------------------------------------------
|A     |1     |15      |3.21    |14      |3.19    |
|A     |2     |16      |3.22    |NULL    |NULL    |
|B     |1     |17      |3.23    |NULL    |NULL    |
|B     |2     |18      |4.23    |16      |3.04    |
---------------------------------------------------

>>> df1.join(df2, on=(df1.c1 == df2.c1) & (df1.c2 == df2.c2), how="asof",
...     match_condition=(df1.c3 >= df2.c3), lsuffix="_L", rsuffix="_R") \
...     .order_by("C1_L", "C2_L").show()
-------------------------------------------------------------------------
|"C1_L"  |"C2_L"  |"C3_L"  |"C4_L"  |"C1_R"  |"C2_R"  |"C3_R"  |"C4_R"  |
-------------------------------------------------------------------------
|A       |1       |15      |3.21    |A       |1       |14      |3.19    |
|A       |2       |16      |3.22    |NULL    |NULL    |NULL    |NULL    |
|B       |1       |17      |3.23    |NULL    |NULL    |NULL    |NULL    |
|B       |2       |18      |4.23    |B       |2       |16      |3.04    |
-------------------------------------------------------------------------

>>> df1 = df1.alias("L")
>>> df2 = df2.alias("R")
>>> df1.join(df2, using_columns=["c1", "c2"], how="asof",
...         match_condition=(df1.c3 >= df2.c3)).order_by("C1", "C2").show()
-----------------------------------------------
|"C1"  |"C2"  |"C3L"  |"C4L"  |"C3R"  |"C4R"  |
-----------------------------------------------
|A     |1     |15     |3.21   |14     |3.19   |
|A     |2     |16     |3.22   |NULL   |NULL   |
|B     |1     |17     |3.23   |NULL   |NULL   |
|B     |2     |18     |4.23   |16     |3.04   |
-----------------------------------------------

Copy