DataFrame¶

Classes


`DataFrame`([session, plan, is_cached])	Represents a lazily-evaluated relational dataset that contains a collection of `Row` objects with columns defined by a schema (column name and type).
`DataFrameNaFunctions`(df)	Provides functions for handling missing values in a `DataFrame`.
`DataFrameStatFunctions`(df)	Provides computed statistical functions for DataFrames.

Methods


`DataFrame.agg`(*exprs)	Aggregate the data in the DataFrame.
`DataFrame.approxQuantile`(col, percentile, *)	For a specified numeric column and a list of desired quantiles, returns an approximate value for the column at each of the desired quantiles.
`DataFrame.approx_quantile`(col, percentile, *)	For a specified numeric column and a list of desired quantiles, returns an approximate value for the column at each of the desired quantiles.
`DataFrame.cache_result`(*[, statement_params])	Caches the content of this DataFrame to create a new cached Table DataFrame.
`DataFrame.col`(col_name)	Returns a reference to a column in the DataFrame.
`DataFrame.collect`()	Executes the query representing this DataFrame and returns the result as a list of `Row` objects.
`DataFrame.collect_nowait`(*[, statement_params])	Executes the query representing this DataFrame asynchronously and returns: class:AsyncJob.
`DataFrame.copy_into_table`(table_name, *[, ...])	Executes a COPY INTO <table> command to load data from files in a stage location into a specified table.
`DataFrame.corr`(col1, col2, *[, statement_params])	Calculates the correlation coefficient for non-null pairs in two numeric columns.
`DataFrame.count`()	Executes the query representing this DataFrame and returns the number of rows in the result (similar to the COUNT function in SQL).
`DataFrame.cov`(col1, col2, *[, statement_params])	Calculates the sample covariance for non-null pairs in two numeric columns.
`DataFrame.createOrReplaceTempView`(name, *[, ...])	Creates a temporary view that returns the same results as this DataFrame.
`DataFrame.createOrReplaceView`(name, *[, ...])	Creates a view that captures the computation expressed by this DataFrame.
`DataFrame.create_or_replace_temp_view`(name, *)	Creates a temporary view that returns the same results as this DataFrame.
`DataFrame.create_or_replace_view`(name, *[, ...])	Creates a view that captures the computation expressed by this DataFrame.
`DataFrame.crossJoin`(right, *[, lsuffix, rsuffix])	Performs a cross join, which returns the Cartesian product of the current `DataFrame` and another `DataFrame` (`right`).
`DataFrame.cross_join`(right, *[, lsuffix, ...])	Performs a cross join, which returns the Cartesian product of the current `DataFrame` and another `DataFrame` (`right`).
`DataFrame.crosstab`(col1, col2, *[, ...])	Computes a pair-wise frequency table (a `contingency table`) for the specified columns.
`DataFrame.cube`(*cols)	Performs a SQL GROUP BY CUBE.
`DataFrame.describe`(*cols)	Computes basic statistics for numeric columns, which includes `count`, `mean`, `stddev`, `min`, and `max`.
`DataFrame.distinct`()	Returns a new DataFrame that contains only the rows with distinct values from the current DataFrame.
`DataFrame.drop`(*cols)	Returns a new DataFrame that excludes the columns with the specified names from the output.
`DataFrame.dropDuplicates`(*subset)	Creates a new DataFrame by removing duplicated rows on given subset of columns.
`DataFrame.drop_duplicates`(*subset)	Creates a new DataFrame by removing duplicated rows on given subset of columns.
`DataFrame.dropna`([how, thresh, subset])	Returns a new DataFrame that excludes all rows containing fewer than a specified number of non-null and non-NaN values in the specified columns.
`DataFrame.except_`(other)	Returns a new DataFrame that contains all the rows from the current DataFrame except for the rows that also appear in the `other` DataFrame.
`DataFrame.explain`()	Prints the list of queries that will be executed to evaluate this DataFrame.
`DataFrame.fillna`(value[, subset])	Returns a new DataFrame that replaces all null and NaN values in the specified columns with the values provided.
`DataFrame.filter`(expr)	Filters rows based on the specified conditional expression (similar to WHERE in SQL).
`DataFrame.first`()	Executes the query representing this DataFrame and returns the first `n` rows of the results.
`DataFrame.flatten`(input[, path, outer, ...])	Flattens (explodes) compound values into multiple rows.
`DataFrame.groupBy`(*cols)	Groups rows by the columns specified by expressions (similar to GROUP BY in SQL).
`DataFrame.group_by`(*cols)	Groups rows by the columns specified by expressions (similar to GROUP BY in SQL).
`DataFrame.group_by_grouping_sets`(*grouping_sets)	Performs a SQL GROUP BY GROUPING SETS.
`DataFrame.intersect`(other)	Returns a new DataFrame that contains the intersection of rows from the current DataFrame and another DataFrame (`other`).
`DataFrame.join`(right[, on, how, lsuffix, ...])	Performs a join of the specified type (`how`) with the current DataFrame and another DataFrame (`right`) on a list of columns (`on`).
`DataFrame.join_table_function`(func, ...)	Lateral joins the current DataFrame with the output of the specified table function.
`DataFrame.limit`(n[, offset])	Returns a new DataFrame that contains at most `n` rows from the current DataFrame, skipping `offset` rows from the beginning (similar to LIMIT and OFFSET in SQL).
`DataFrame.minus`(other)	Returns a new DataFrame that contains all the rows from the current DataFrame except for the rows that also appear in the `other` DataFrame.
`DataFrame.natural_join`(right[, how])	Performs a natural join of the specified type (`how`) with the current DataFrame and another DataFrame (`right`).
`DataFrame.orderBy`(*cols[, ascending])	Sorts a DataFrame by the specified expressions (similar to ORDER BY in SQL).
`DataFrame.order_by`(*cols[, ascending])	Sorts a DataFrame by the specified expressions (similar to ORDER BY in SQL).
`DataFrame.pivot`(pivot_col, values)	Rotates this DataFrame by turning the unique values from one column in the input expression into multiple columns and aggregating results where required on any remaining column values.
`DataFrame.randomSplit`(weights[, seed, ...])	Randomly splits the current DataFrame into separate DataFrames, using the specified weights.
`DataFrame.random_split`(weights[, seed, ...])	Randomly splits the current DataFrame into separate DataFrames, using the specified weights.
`DataFrame.rename`(existing, new)	Returns a DataFrame with the specified column `existing` renamed as `new`.
`DataFrame.replace`(to_replace[, value, subset])	Returns a new DataFrame that replaces values in the specified columns.
`DataFrame.rollup`(*cols)	Performs a SQL GROUP BY ROLLUP.
`DataFrame.sample`([frac, n])	Samples rows based on either the number of rows to be returned or a percentage of rows to be returned.
`DataFrame.sampleBy`(col, fractions)	Returns a DataFrame containing a stratified sample without replacement, based on a `dict` that specifies the fraction for each stratum.
`DataFrame.sample_by`(col, fractions)	Returns a DataFrame containing a stratified sample without replacement, based on a `dict` that specifies the fraction for each stratum.
`DataFrame.select`(*cols)	Returns a new DataFrame with the specified Column expressions as output (similar to SELECT in SQL).
`DataFrame.selectExpr`(*exprs)	Projects a set of SQL expressions and returns a new `DataFrame`.
`DataFrame.select_expr`(*exprs)	Projects a set of SQL expressions and returns a new `DataFrame`.
`DataFrame.show`([n, max_width, statement_params])	Evaluates this DataFrame and prints out the first `n` rows with the specified maximum number of characters per column.
`DataFrame.sort`(*cols[, ascending])	Sorts a DataFrame by the specified expressions (similar to ORDER BY in SQL).
`DataFrame.subtract`(other)	Returns a new DataFrame that contains all the rows from the current DataFrame except for the rows that also appear in the `other` DataFrame.
`DataFrame.take`([n, statement_params, block])	Executes the query representing this DataFrame and returns the first `n` rows of the results.
`DataFrame.toDF`(*names)	Creates a new DataFrame containing columns with the specified names.
`DataFrame.toLocalIterator`(*[, ...])	Executes the query representing this DataFrame and returns an iterator of `Row` objects that you can use to retrieve the results.
`DataFrame.toPandas`(*[, statement_params, block])	Executes the query representing this DataFrame and returns the result as a Pandas DataFrame.
`DataFrame.to_df`(*names)	Creates a new DataFrame containing columns with the specified names.
`DataFrame.to_local_iterator`()	Executes the query representing this DataFrame and returns an iterator of `Row` objects that you can use to retrieve the results.
`DataFrame.to_pandas`()	Executes the query representing this DataFrame and returns the result as a Pandas DataFrame.
`DataFrame.to_pandas_batches`()	Executes the query representing this DataFrame and returns an iterator of Pandas dataframes (containing a subset of rows) that you can use to retrieve the results.
`DataFrame.union`(other)	Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (`other`), excluding any duplicate rows.
`DataFrame.unionAll`(other)	Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (`other`), including any duplicate rows.
`DataFrame.unionAllByName`(other)	Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (`other`), including any duplicate rows.
`DataFrame.unionByName`(other)	Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (`other`), excluding any duplicate rows.
`DataFrame.union_all`(other)	Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (`other`), including any duplicate rows.
`DataFrame.union_all_by_name`(other)	Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (`other`), including any duplicate rows.
`DataFrame.union_by_name`(other)	Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (`other`), excluding any duplicate rows.
`DataFrame.unpivot`(value_column, name_column, ...)	Rotates a table by transforming columns into rows.
`DataFrame.where`(expr)	Filters rows based on the specified conditional expression (similar to WHERE in SQL).
`DataFrame.withColumn`(col_name, col)	Returns a DataFrame with an additional column with the specified name `col_name`.
`DataFrame.withColumnRenamed`(existing, new)	Returns a DataFrame with the specified column `existing` renamed as `new`.
`DataFrame.with_column`(col_name, col)	Returns a DataFrame with an additional column with the specified name `col_name`.
`DataFrame.with_column_renamed`(existing, new)	Returns a DataFrame with the specified column `existing` renamed as `new`.
`DataFrame.with_columns`(col_names, values)	Returns a DataFrame with additional columns with the specified names `col_names`.
`DataFrameNaFunctions.drop`([how, thresh, subset])	Returns a new DataFrame that excludes all rows containing fewer than a specified number of non-null and non-NaN values in the specified columns.
`DataFrameNaFunctions.fill`(value[, subset])	Returns a new DataFrame that replaces all null and NaN values in the specified columns with the values provided.
`DataFrameNaFunctions.replace`(to_replace[, ...])	Returns a new DataFrame that replaces values in the specified columns.
`DataFrameStatFunctions.approxQuantile`(col, ...)	For a specified numeric column and a list of desired quantiles, returns an approximate value for the column at each of the desired quantiles.
`DataFrameStatFunctions.approx_quantile`(col, ...)	For a specified numeric column and a list of desired quantiles, returns an approximate value for the column at each of the desired quantiles.
`DataFrameStatFunctions.corr`(col1, col2, *[, ...])	Calculates the correlation coefficient for non-null pairs in two numeric columns.
`DataFrameStatFunctions.cov`(col1, col2, *[, ...])	Calculates the sample covariance for non-null pairs in two numeric columns.
`DataFrameStatFunctions.crosstab`(col1, col2, *)	Computes a pair-wise frequency table (a `contingency table`) for the specified columns.
`DataFrameStatFunctions.sampleBy`(col, fractions)	Returns a DataFrame containing a stratified sample without replacement, based on a `dict` that specifies the fraction for each stratum.
`DataFrameStatFunctions.sample_by`(col, fractions)	Returns a DataFrame containing a stratified sample without replacement, based on a `dict` that specifies the fraction for each stratum.

Attributes


`DataFrame.columns`	Returns all column names as a list.
`DataFrame.na`	Returns a `DataFrameNaFunctions` object that provides functions for handling missing values in the DataFrame.
`DataFrame.queries`	Returns a `dict` that contains a list of queries that will be executed to evaluate this DataFrame with the key queries, and a list of post-execution actions (e.g., queries to clean up temporary objects) with the key post_actions.
`DataFrame.schema`	The definition of the columns in this DataFrame (the "relational schema" for the DataFrame).
`DataFrame.stat`
`DataFrame.write`	Returns a new `DataFrameWriter` object that you can use to write the data in the `DataFrame` to a Snowflake database or a stage location
`DataFrame.is_cached`	Whether the dataframe is cached.