snowflake.snowpark.DataFrame.join¶
- DataFrame.join(right: DataFrame, on: Optional[Union[Column, str, Iterable[str]]] = None, how: Optional[str] = None, *, lsuffix: str = '', rsuffix: str = '', match_condition: Optional[Column] = None, **kwargs) DataFrame [source]¶
Performs a join of the specified type (
how
) with the current DataFrame and another DataFrame (right
) on a list of columns (on
).- Parameters:
right – The other
DataFrame
to join.on – A column name or a
Column
object or a list of them to be used for the join. When a list of column names are specified, this method assumes the named columns are present in both dataframes. You can use keywordusing_columns
to specify this condition. Note that to avoid breaking changes, when using_columns` is specified, it overrideson
.how –
We support the following join types:
Inner join: “inner” (the default value)
Left outer join: “left”, “leftouter”
Right outer join: “right”, “rightouter”
Full outer join: “full”, “outer”, “fullouter”
Left semi join: “semi”, “leftsemi”
Left anti join: “anti”, “leftanti”
Cross join: “cross”
[Preview Feature] Asof join: “asof”
You can also use
join_type
keyword to specify this condition. Note that to avoid breaking changes, currently whenjoin_type
is specified, it overrideshow
.lsuffix – Suffix to add to the overlapping columns of the left DataFrame.
rsuffix – Suffix to add to the overlapping columns of the right DataFrame.
match_condition – The match condition for asof join.
Note
When both
lsuffix
andrsuffix
are empty, the overlapping columns will have random column names in the resulting DataFrame. You can reference to these randomly named columns usingColumn.alias()
(See the first usage in Examples).See also
Usage notes for asof join: https://docs.snowflake.com/LIMITEDACCESS/asof-join#usage-notes
- Examples::
>>> from snowflake.snowpark.functions import col >>> df1 = session.create_dataframe([[1, 2], [3, 4], [5, 6]], schema=["a", "b"]) >>> df2 = session.create_dataframe([[1, 7], [3, 8]], schema=["a", "c"]) >>> df1.join(df2, df1.a == df2.a).select(df1.a.alias("a_1"), df2.a.alias("a_2"), df1.b, df2.c).show() ----------------------------- |"A_1" |"A_2" |"B" |"C" | ----------------------------- |1 |1 |2 |7 | |3 |3 |4 |8 | ----------------------------- >>> # refer a single column "a" >>> df1.join(df2, "a").select(df1.a.alias("a"), df1.b, df2.c).show() ------------------- |"A" |"B" |"C" | ------------------- |1 |2 |7 | |3 |4 |8 | ------------------- >>> # rename the ambiguous columns >>> df3 = df1.to_df("df1_a", "b") >>> df4 = df2.to_df("df2_a", "c") >>> df3.join(df4, col("df1_a") == col("df2_a")).select(col("df1_a").alias("a"), "b", "c").show() ------------------- |"A" |"B" |"C" | ------------------- |1 |2 |7 | |3 |4 |8 | -------------------
>>> # join multiple columns >>> mdf1 = session.create_dataframe([[1, 2], [3, 4], [5, 6]], schema=["a", "b"]) >>> mdf2 = session.create_dataframe([[1, 2], [3, 4], [7, 6]], schema=["a", "b"]) >>> mdf1.join(mdf2, ["a", "b"]).show() ------------- |"A" |"B" | ------------- |1 |2 | |3 |4 | ------------- >>> mdf1.join(mdf2, (mdf1["a"] < mdf2["a"]) & (mdf1["b"] == mdf2["b"])).select(mdf1["a"].as_("new_a"), mdf1["b"].as_("new_b")).show() --------------------- |"NEW_A" |"NEW_B" | --------------------- |5 |6 | --------------------- >>> # use lsuffix and rsuffix to resolve duplicating column names >>> mdf1.join(mdf2, (mdf1["a"] < mdf2["a"]) & (mdf1["b"] == mdf2["b"]), lsuffix="_left", rsuffix="_right").show() ----------------------------------------------- |"A_LEFT" |"B_LEFT" |"A_RIGHT" |"B_RIGHT" | ----------------------------------------------- |5 |6 |7 |6 | ----------------------------------------------- >>> mdf1.join(mdf2, (mdf1["a"] < mdf2["a"]) & (mdf1["b"] == mdf2["b"]), rsuffix="_right").show() ------------------------------------- |"A" |"B" |"A_RIGHT" |"B_RIGHT" | ------------------------------------- |5 |6 |7 |6 | ------------------------------------- >>> # examples of different joins >>> df5 = session.create_dataframe([3, 4, 5, 5, 6, 7], schema=["id"]) >>> df6 = session.create_dataframe([5, 6, 7, 7, 8, 9], schema=["id"]) >>> # inner join >>> df5.join(df6, "id", "inner").sort("id").show() -------- |"ID" | -------- |5 | |5 | |6 | |7 | |7 | -------- >>> # left/leftouter join >>> df5.join(df6, "id", "left").sort("id").show() -------- |"ID" | -------- |3 | |4 | |5 | |5 | |6 | |7 | |7 | -------- >>> # right/rightouter join >>> df5.join(df6, "id", "right").sort("id").show() -------- |"ID" | -------- |5 | |5 | |6 | |7 | |7 | |8 | |9 | -------- >>> # full/outer/fullouter join >>> df5.join(df6, "id", "full").sort("id").show() -------- |"ID" | -------- |3 | |4 | |5 | |5 | |6 | |7 | |7 | |8 | |9 | -------- >>> # semi/leftsemi join >>> df5.join(df6, "id", "semi").sort("id").show() -------- |"ID" | -------- |5 | |5 | |6 | |7 | -------- >>> # anti/leftanti join >>> df5.join(df6, "id", "anti").sort("id").show() -------- |"ID" | -------- |3 | |4 | --------
Note
When performing chained operations, this method will not work if there are ambiguous column names. For example,
>>> df1.filter(df1.a == 1).join(df2, df1.a == df2.a).select(df1.a.alias("a"), df1.b, df2.c)
will not work because
df1.filter(df1.a == 1)
has produced a new dataframe and you cannot refer todf1.a
anymore. Instead, you can do either>>> df1.join(df2, (df1.a == 1) & (df1.a == df2.a)).select(df1.a.alias("a"), df1.b, df2.c).show() ------------------- |"A" |"B" |"C" | ------------------- |1 |2 |7 | -------------------
or
>>> df3 = df1.filter(df1.a == 1) >>> df3.join(df2, df3.a == df2.a).select(df3.a.alias("a"), df3.b, df2.c).show() ------------------- |"A" |"B" |"C" | ------------------- |1 |2 |7 | -------------------
>>> # asof join examples >>> df1 = session.create_dataframe([['A', 1, 15, 3.21], ... ['A', 2, 16, 3.22], ... ['B', 1, 17, 3.23], ... ['B', 2, 18, 4.23]], ... schema=["c1", "c2", "c3", "c4"]) >>> df2 = session.create_dataframe([['A', 1, 14, 3.19], ... ['B', 2, 16, 3.04]], ... schema=["c1", "c2", "c3", "c4"]) >>> df1.join(df2, on=["c1", "c2"], how="asof", match_condition=(df1.c3 >= df2.c3)) \ ... .select(df1.c1, df1.c2, df1.c3.alias("C3_1"), df1.c4.alias("C4_1"), df2.c3.alias("C3_2"), df2.c4.alias("C4_2")) \ ... .order_by("c1", "c2").show() --------------------------------------------------- |"C1" |"C2" |"C3_1" |"C4_1" |"C3_2" |"C4_2" | --------------------------------------------------- |A |1 |15 |3.21 |14 |3.19 | |A |2 |16 |3.22 |NULL |NULL | |B |1 |17 |3.23 |NULL |NULL | |B |2 |18 |4.23 |16 |3.04 | ---------------------------------------------------
>>> df1.join(df2, on=(df1.c1 == df2.c1) & (df1.c2 == df2.c2), how="asof", ... match_condition=(df1.c3 >= df2.c3), lsuffix="_L", rsuffix="_R") \ ... .order_by("C1_L", "C2_L").show() ------------------------------------------------------------------------- |"C1_L" |"C2_L" |"C3_L" |"C4_L" |"C1_R" |"C2_R" |"C3_R" |"C4_R" | ------------------------------------------------------------------------- |A |1 |15 |3.21 |A |1 |14 |3.19 | |A |2 |16 |3.22 |NULL |NULL |NULL |NULL | |B |1 |17 |3.23 |NULL |NULL |NULL |NULL | |B |2 |18 |4.23 |B |2 |16 |3.04 | -------------------------------------------------------------------------
>>> df1 = df1.alias("L") >>> df2 = df2.alias("R") >>> df1.join(df2, using_columns=["c1", "c2"], how="asof", ... match_condition=(df1.c3 >= df2.c3)).order_by("C1", "C2").show() ----------------------------------------------- |"C1" |"C2" |"C3L" |"C4L" |"C3R" |"C4R" | ----------------------------------------------- |A |1 |15 |3.21 |14 |3.19 | |A |2 |16 |3.22 |NULL |NULL | |B |1 |17 |3.23 |NULL |NULL | |B |2 |18 |4.23 |16 |3.04 | -----------------------------------------------