snowflake.snowpark.DataFrame.join¶
- DataFrame.join(right: DataFrame, on: Optional[Union[Column, str, Iterable[str]]] = None, how: Optional[str] = None, *, lsuffix: str = '', rsuffix: str = '', match_condition: Optional[Column] = None, **kwargs) DataFrame[source]¶
- Performs a join of the specified type ( - how) with the current DataFrame and another DataFrame (- right) on a list of columns (- on).- Parameters:
- right – The other - DataFrameto join.
- on – A column name or a - Columnobject or a list of them to be used for the join. When a list of column names are specified, this method assumes the named columns are present in both dataframes. You can use keyword- using_columnsto specify this condition. Note that to avoid breaking changes, when using_columns` is specified, it overrides- on.
- how – - We support the following join types: - Inner join: “inner” (the default value) 
- Left outer join: “left”, “leftouter” 
- Right outer join: “right”, “rightouter” 
- Full outer join: “full”, “outer”, “fullouter” 
- Left semi join: “semi”, “leftsemi” 
- Left anti join: “anti”, “leftanti” 
- Cross join: “cross” 
- Asof join: “asof” 
 - You can also use - join_typekeyword to specify this condition. Note that to avoid breaking changes, currently when- join_typeis specified, it overrides- how.
- lsuffix – Suffix to add to the overlapping columns of the left DataFrame. 
- rsuffix – Suffix to add to the overlapping columns of the right DataFrame. 
- match_condition – The match condition for asof join. 
 
 - Note - When both - lsuffixand- rsuffixare empty, the overlapping columns will have random column names in the resulting DataFrame. You can reference to these randomly named columns using- Column.alias()(See the first usage in Examples).- See also - Usage notes for asof join: https://docs.snowflake.com/sql-reference/constructs/asof-join#usage-notes 
 - Examples::
- >>> from snowflake.snowpark.functions import col >>> df1 = session.create_dataframe([[1, 2], [3, 4], [5, 6]], schema=["a", "b"]) >>> df2 = session.create_dataframe([[1, 7], [3, 8]], schema=["a", "c"]) >>> df1.join(df2, df1.a == df2.a).select(df1.a.alias("a_1"), df2.a.alias("a_2"), df1.b, df2.c).show() ----------------------------- |"A_1" |"A_2" |"B" |"C" | ----------------------------- |1 |1 |2 |7 | |3 |3 |4 |8 | ----------------------------- >>> # refer a single column "a" >>> df1.join(df2, "a").select(df1.a.alias("a"), df1.b, df2.c).show() ------------------- |"A" |"B" |"C" | ------------------- |1 |2 |7 | |3 |4 |8 | ------------------- >>> # rename the ambiguous columns >>> df3 = df1.to_df("df1_a", "b") >>> df4 = df2.to_df("df2_a", "c") >>> df3.join(df4, col("df1_a") == col("df2_a")).select(col("df1_a").alias("a"), "b", "c").show() ------------------- |"A" |"B" |"C" | ------------------- |1 |2 |7 | |3 |4 |8 | ------------------- - >>> # join multiple columns >>> mdf1 = session.create_dataframe([[1, 2], [3, 4], [5, 6]], schema=["a", "b"]) >>> mdf2 = session.create_dataframe([[1, 2], [3, 4], [7, 6]], schema=["a", "b"]) >>> mdf1.join(mdf2, ["a", "b"]).show() ------------- |"A" |"B" | ------------- |1 |2 | |3 |4 | ------------- >>> mdf1.join(mdf2, (mdf1["a"] < mdf2["a"]) & (mdf1["b"] == mdf2["b"])).select(mdf1["a"].as_("new_a"), mdf1["b"].as_("new_b")).show() --------------------- |"NEW_A" |"NEW_B" | --------------------- |5 |6 | --------------------- >>> # use lsuffix and rsuffix to resolve duplicating column names >>> mdf1.join(mdf2, (mdf1["a"] < mdf2["a"]) & (mdf1["b"] == mdf2["b"]), lsuffix="_left", rsuffix="_right").show() ----------------------------------------------- |"A_LEFT" |"B_LEFT" |"A_RIGHT" |"B_RIGHT" | ----------------------------------------------- |5 |6 |7 |6 | ----------------------------------------------- >>> mdf1.join(mdf2, (mdf1["a"] < mdf2["a"]) & (mdf1["b"] == mdf2["b"]), rsuffix="_right").show() ------------------------------------- |"A" |"B" |"A_RIGHT" |"B_RIGHT" | ------------------------------------- |5 |6 |7 |6 | ------------------------------------- >>> # examples of different joins >>> df5 = session.create_dataframe([3, 4, 5, 5, 6, 7], schema=["id"]) >>> df6 = session.create_dataframe([5, 6, 7, 7, 8, 9], schema=["id"]) >>> # inner join >>> df5.join(df6, "id", "inner").sort("id").show() -------- |"ID" | -------- |5 | |5 | |6 | |7 | |7 | -------- >>> # left/leftouter join >>> df5.join(df6, "id", "left").sort("id").show() -------- |"ID" | -------- |3 | |4 | |5 | |5 | |6 | |7 | |7 | -------- >>> # right/rightouter join >>> df5.join(df6, "id", "right").sort("id").show() -------- |"ID" | -------- |5 | |5 | |6 | |7 | |7 | |8 | |9 | -------- >>> # full/outer/fullouter join >>> df5.join(df6, "id", "full").sort("id").show() -------- |"ID" | -------- |3 | |4 | |5 | |5 | |6 | |7 | |7 | |8 | |9 | -------- >>> # semi/leftsemi join >>> df5.join(df6, "id", "semi").sort("id").show() -------- |"ID" | -------- |5 | |5 | |6 | |7 | -------- >>> # anti/leftanti join >>> df5.join(df6, "id", "anti").sort("id").show() -------- |"ID" | -------- |3 | |4 | -------- 
 - Note - When performing chained operations, this method will not work if there are ambiguous column names. For example, - >>> df1.filter(df1.a == 1).join(df2, df1.a == df2.a).select(df1.a.alias("a"), df1.b, df2.c) - will not work because - df1.filter(df1.a == 1)has produced a new dataframe and you cannot refer to- df1.aanymore. Instead, you can do either- >>> df1.join(df2, (df1.a == 1) & (df1.a == df2.a)).select(df1.a.alias("a"), df1.b, df2.c).show() ------------------- |"A" |"B" |"C" | ------------------- |1 |2 |7 | ------------------- - or - >>> df3 = df1.filter(df1.a == 1) >>> df3.join(df2, df3.a == df2.a).select(df3.a.alias("a"), df3.b, df2.c).show() ------------------- |"A" |"B" |"C" | ------------------- |1 |2 |7 | ------------------- - Examples::
- >>> # asof join examples >>> df1 = session.create_dataframe([['A', 1, 15, 3.21], ... ['A', 2, 16, 3.22], ... ['B', 1, 17, 3.23], ... ['B', 2, 18, 4.23]], ... schema=["c1", "c2", "c3", "c4"]) >>> df2 = session.create_dataframe([['A', 1, 14, 3.19], ... ['B', 2, 16, 3.04]], ... schema=["c1", "c2", "c3", "c4"]) >>> df1.join(df2, on=["c1", "c2"], how="asof", match_condition=(df1.c3 >= df2.c3)) \ ... .select(df1.c1, df1.c2, df1.c3.alias("C3_1"), df1.c4.alias("C4_1"), df2.c3.alias("C3_2"), df2.c4.alias("C4_2")) \ ... .order_by("c1", "c2").show() --------------------------------------------------- |"C1" |"C2" |"C3_1" |"C4_1" |"C3_2" |"C4_2" | --------------------------------------------------- |A |1 |15 |3.21 |14 |3.19 | |A |2 |16 |3.22 |NULL |NULL | |B |1 |17 |3.23 |NULL |NULL | |B |2 |18 |4.23 |16 |3.04 | --------------------------------------------------- >>> df1.join(df2, on=(df1.c1 == df2.c1) & (df1.c2 == df2.c2), how="asof", ... match_condition=(df1.c3 >= df2.c3), lsuffix="_L", rsuffix="_R") \ ... .order_by("C1_L", "C2_L").show() ------------------------------------------------------------------------- |"C1_L" |"C2_L" |"C3_L" |"C4_L" |"C1_R" |"C2_R" |"C3_R" |"C4_R" | ------------------------------------------------------------------------- |A |1 |15 |3.21 |A |1 |14 |3.19 | |A |2 |16 |3.22 |NULL |NULL |NULL |NULL | |B |1 |17 |3.23 |NULL |NULL |NULL |NULL | |B |2 |18 |4.23 |B |2 |16 |3.04 | ------------------------------------------------------------------------- >>> df1 = df1.alias("L") >>> df2 = df2.alias("R") >>> df1.join(df2, using_columns=["c1", "c2"], how="asof", ... match_condition=(df1.c3 >= df2.c3)).order_by("C1", "C2").show() ----------------------------------------------- |"C1" |"C2" |"C3L" |"C4L" |"C3R" |"C4R" | ----------------------------------------------- |A |1 |15 |3.21 |14 |3.19 | |A |2 |16 |3.22 |NULL |NULL | |B |1 |17 |3.23 |NULL |NULL | |B |2 |18 |4.23 |16 |3.04 | -----------------------------------------------