You are viewing documentation about an older version (1.2.0). View latest version

snowflake.snowpark.DataFrame.join

DataFrame.join(right: DataFrame, on: ColumnOrName | Iterable[ColumnOrName] | None = None, how: str | None = None, *, lsuffix: str = '', rsuffix: str = '', **kwargs) DataFrame[source]

Performs a join of the specified type (how) with the current DataFrame and another DataFrame (right) on a list of columns (on).

Parameters:
  • right – The other DataFrame to join.

  • on – A column name or a Column object or a list of them to be used for the join. When a list of column names are specified, this method assumes the named columns are present in both dataframes. You can use keyword using_columns to specify this condition. Note that to avoid breaking changes, when using_columns` is specified, it overrides on.

  • how

    We support the following join types:

    • Inner join: “inner” (the default value)

    • Left outer join: “left”, “leftouter”

    • Right outer join: “right”, “rightouter”

    • Full outer join: “full”, “outer”, “fullouter”

    • Left semi join: “semi”, “leftsemi”

    • Left anti join: “anti”, “leftanti”

    • Cross join: “cross”

    You can also use join_type keyword to specify this condition. Note that to avoid breaking changes, currently when join_type is specified, it overrides how.

  • lsuffix – Suffix to add to the overlapping columns of the left DataFrame.

  • rsuffix – Suffix to add to the overlapping columns of the right DataFrame.

Note

When both lsuffix and rsuffix are empty, the overlapping columns will have random column names in the resulting DataFrame. You can reference to these randomly named columns using Column.alias() (See the first usage in Examples).

Examples::
>>> from snowflake.snowpark.functions import col
>>> df1 = session.create_dataframe([[1, 2], [3, 4], [5, 6]], schema=["a", "b"])
>>> df2 = session.create_dataframe([[1, 7], [3, 8]], schema=["a", "c"])
>>> df1.join(df2, df1.a == df2.a).select(df1.a.alias("a_1"), df2.a.alias("a_2"), df1.b, df2.c).show()
-----------------------------
|"A_1"  |"A_2"  |"B"  |"C"  |
-----------------------------
|1      |1      |2    |7    |
|3      |3      |4    |8    |
-----------------------------

>>> # refer a single column "a"
>>> df1.join(df2, "a").select(df1.a.alias("a"), df1.b, df2.c).show()
-------------------
|"A"  |"B"  |"C"  |
-------------------
|1    |2    |7    |
|3    |4    |8    |
-------------------

>>> # rename the ambiguous columns
>>> df3 = df1.to_df("df1_a", "b")
>>> df4 = df2.to_df("df2_a", "c")
>>> df3.join(df4, col("df1_a") == col("df2_a")).select(col("df1_a").alias("a"), "b", "c").show()
-------------------
|"A"  |"B"  |"C"  |
-------------------
|1    |2    |7    |
|3    |4    |8    |
-------------------
Copy
>>> # join multiple columns
>>> mdf1 = session.create_dataframe([[1, 2], [3, 4], [5, 6]], schema=["a", "b"])
>>> mdf2 = session.create_dataframe([[1, 2], [3, 4], [7, 6]], schema=["a", "b"])
>>> mdf1.join(mdf2, ["a", "b"]).show()
-------------
|"A"  |"B"  |
-------------
|1    |2    |
|3    |4    |
-------------

>>> mdf1.join(mdf2, (mdf1["a"] < mdf2["a"]) & (mdf1["b"] == mdf2["b"])).select(mdf1["a"].as_("new_a"), mdf1["b"].as_("new_b")).show()
---------------------
|"NEW_A"  |"NEW_B"  |
---------------------
|5        |6        |
---------------------

>>> # use lsuffix and rsuffix to resolve duplicating column names
>>> mdf1.join(mdf2, (mdf1["a"] < mdf2["a"]) & (mdf1["b"] == mdf2["b"]), lsuffix="_left", rsuffix="_right").show()
-----------------------------------------------
|"A_LEFT"  |"B_LEFT"  |"A_RIGHT"  |"B_RIGHT"  |
-----------------------------------------------
|5         |6         |7          |6          |
-----------------------------------------------

>>> mdf1.join(mdf2, (mdf1["a"] < mdf2["a"]) & (mdf1["b"] == mdf2["b"]), rsuffix="_right").show()
-------------------------------------
|"A"  |"B"  |"A_RIGHT"  |"B_RIGHT"  |
-------------------------------------
|5    |6    |7          |6          |
-------------------------------------
Copy

Note

When performing chained operations, this method will not work if there are ambiguous column names. For example,

>>> df1.filter(df1.a == 1).join(df2, df1.a == df2.a).select(df1.a.alias("a"), df1.b, df2.c) 
Copy

will not work because df1.filter(df1.a == 1) has produced a new dataframe and you cannot refer to df1.a anymore. Instead, you can do either

>>> df1.join(df2, (df1.a == 1) & (df1.a == df2.a)).select(df1.a.alias("a"), df1.b, df2.c).show()
-------------------
|"A"  |"B"  |"C"  |
-------------------
|1    |2    |7    |
-------------------
Copy

or

>>> df3 = df1.filter(df1.a == 1)
>>> df3.join(df2, df3.a == df2.a).select(df3.a.alias("a"), df3.b, df2.c).show()
-------------------
|"A"  |"B"  |"C"  |
-------------------
|1    |2    |7    |
-------------------
Copy