DataFrame.drop_duplicates(*subset: Union[str, Iterable[str]]) DataFrame[source]

Creates a new DataFrame by removing duplicated rows on given subset of columns.

If no subset of columns is specified, this function is the same as the distinct() function. The result is non-deterministic when removing duplicated rows from the subset of columns but not all columns.

For example, if we have a DataFrame df, which has columns (“a”, “b”, “c”) and contains three rows (1, 1, 1), (1, 1, 2), (1, 2, 3), the result of df.dropDuplicates("a", "b") can be either (1, 1, 1), (1, 2, 3) or (1, 1, 2), (1, 2, 3)


subset – The column names on which duplicates are dropped.

dropDuplicates() is an alias of drop_duplicates().