You are viewing documentation about an older version (1.3.0). View latest version

DataFrame¶

Classes

DataFrame([session, plan, is_cached])

Represents a lazily-evaluated relational dataset that contains a collection of Row objects with columns defined by a schema (column name and type).

DataFrameNaFunctions(df)

Provides functions for handling missing values in a DataFrame.

DataFrameStatFunctions(df)

Provides computed statistical functions for DataFrames.

Methods

DataFrame.agg(*exprs)

Aggregate the data in the DataFrame.

DataFrame.approxQuantile(col, percentile, *)

For a specified numeric column and a list of desired quantiles, returns an approximate value for the column at each of the desired quantiles.

DataFrame.approx_quantile(col, percentile, *)

For a specified numeric column and a list of desired quantiles, returns an approximate value for the column at each of the desired quantiles.

DataFrame.cache_result(*[, statement_params])

Caches the content of this DataFrame to create a new cached Table DataFrame.

DataFrame.col(col_name)

Returns a reference to a column in the DataFrame.

DataFrame.collect()

Executes the query representing this DataFrame and returns the result as a list of Row objects.

DataFrame.collect_nowait(*[, ...])

Executes the query representing this DataFrame asynchronously and returns: class:AsyncJob.

DataFrame.copy_into_table(table_name, *[, ...])

Executes a COPY INTO <table> command to load data from files in a stage location into a specified table.

DataFrame.corr(col1, col2, *[, statement_params])

Calculates the correlation coefficient for non-null pairs in two numeric columns.

DataFrame.count()

Executes the query representing this DataFrame and returns the number of rows in the result (similar to the COUNT function in SQL).

DataFrame.cov(col1, col2, *[, statement_params])

Calculates the sample covariance for non-null pairs in two numeric columns.

DataFrame.createOrReplaceTempView(name, *[, ...])

Creates a temporary view that returns the same results as this DataFrame.

DataFrame.createOrReplaceView(name, *[, ...])

Creates a view that captures the computation expressed by this DataFrame.

DataFrame.create_or_replace_temp_view(name, *)

Creates a temporary view that returns the same results as this DataFrame.

DataFrame.create_or_replace_view(name, *[, ...])

Creates a view that captures the computation expressed by this DataFrame.

DataFrame.crossJoin(right, *[, lsuffix, rsuffix])

Performs a cross join, which returns the Cartesian product of the current DataFrame and another DataFrame (right).

DataFrame.cross_join(right, *[, lsuffix, ...])

Performs a cross join, which returns the Cartesian product of the current DataFrame and another DataFrame (right).

DataFrame.crosstab(col1, col2, *[, ...])

Computes a pair-wise frequency table (a contingency table) for the specified columns.

DataFrame.cube(*cols)

Performs a SQL GROUP BY CUBE.

DataFrame.describe(*cols)

Computes basic statistics for numeric columns, which includes count, mean, stddev, min, and max.

DataFrame.distinct()

Returns a new DataFrame that contains only the rows with distinct values from the current DataFrame.

DataFrame.drop(*cols)

Returns a new DataFrame that excludes the columns with the specified names from the output.

DataFrame.dropDuplicates(*subset)

Creates a new DataFrame by removing duplicated rows on given subset of columns.

DataFrame.drop_duplicates(*subset)

Creates a new DataFrame by removing duplicated rows on given subset of columns.

DataFrame.dropna([how, thresh, subset])

Returns a new DataFrame that excludes all rows containing fewer than a specified number of non-null and non-NaN values in the specified columns.

DataFrame.except_(other)

Returns a new DataFrame that contains all the rows from the current DataFrame except for the rows that also appear in the other DataFrame.

DataFrame.explain()

Prints the list of queries that will be executed to evaluate this DataFrame.

DataFrame.fillna(value[, subset])

Returns a new DataFrame that replaces all null and NaN values in the specified columns with the values provided.

DataFrame.filter(expr)

Filters rows based on the specified conditional expression (similar to WHERE in SQL).

DataFrame.first()

Executes the query representing this DataFrame and returns the first n rows of the results.

DataFrame.flatten(input[, path, outer, ...])

Flattens (explodes) compound values into multiple rows.

DataFrame.groupBy(*cols)

Groups rows by the columns specified by expressions (similar to GROUP BY in SQL).

DataFrame.group_by(*cols)

Groups rows by the columns specified by expressions (similar to GROUP BY in SQL).

DataFrame.group_by_grouping_sets(*grouping_sets)

Performs a SQL GROUP BY GROUPING SETS.

DataFrame.intersect(other)

Returns a new DataFrame that contains the intersection of rows from the current DataFrame and another DataFrame (other).

DataFrame.join(right[, on, how, lsuffix, ...])

Performs a join of the specified type (how) with the current DataFrame and another DataFrame (right) on a list of columns (on).

DataFrame.join_table_function(func, ...)

Lateral joins the current DataFrame with the output of the specified table function.

DataFrame.limit(n[, offset])

Returns a new DataFrame that contains at most n rows from the current DataFrame, skipping offset rows from the beginning (similar to LIMIT and OFFSET in SQL).

DataFrame.minus(other)

Returns a new DataFrame that contains all the rows from the current DataFrame except for the rows that also appear in the other DataFrame.

DataFrame.natural_join(right[, how])

Performs a natural join of the specified type (how) with the current DataFrame and another DataFrame (right).

DataFrame.orderBy(*cols[, ascending])

Sorts a DataFrame by the specified expressions (similar to ORDER BY in SQL).

DataFrame.order_by(*cols[, ascending])

Sorts a DataFrame by the specified expressions (similar to ORDER BY in SQL).

DataFrame.pivot(pivot_col, values)

Rotates this DataFrame by turning the unique values from one column in the input expression into multiple columns and aggregating results where required on any remaining column values.

DataFrame.randomSplit(weights[, seed, ...])

Randomly splits the current DataFrame into separate DataFrames, using the specified weights.

DataFrame.random_split(weights[, seed, ...])

Randomly splits the current DataFrame into separate DataFrames, using the specified weights.

DataFrame.rename(existing, new)

Returns a DataFrame with the specified column existing renamed as new.

DataFrame.replace(to_replace[, value, subset])

Returns a new DataFrame that replaces values in the specified columns.

DataFrame.rollup(*cols)

Performs a SQL GROUP BY ROLLUP.

DataFrame.sample([frac, n])

Samples rows based on either the number of rows to be returned or a percentage of rows to be returned.

DataFrame.sampleBy(col, fractions)

Returns a DataFrame containing a stratified sample without replacement, based on a dict that specifies the fraction for each stratum.

DataFrame.sample_by(col, fractions)

Returns a DataFrame containing a stratified sample without replacement, based on a dict that specifies the fraction for each stratum.

DataFrame.select(*cols)

Returns a new DataFrame with the specified Column expressions as output (similar to SELECT in SQL).

DataFrame.selectExpr(*exprs)

Projects a set of SQL expressions and returns a new DataFrame.

DataFrame.select_expr(*exprs)

Projects a set of SQL expressions and returns a new DataFrame.

DataFrame.show([n, max_width, statement_params])

Evaluates this DataFrame and prints out the first n rows with the specified maximum number of characters per column.

DataFrame.sort(*cols[, ascending])

Sorts a DataFrame by the specified expressions (similar to ORDER BY in SQL).

DataFrame.subtract(other)

Returns a new DataFrame that contains all the rows from the current DataFrame except for the rows that also appear in the other DataFrame.

DataFrame.take([n, statement_params, block])

Executes the query representing this DataFrame and returns the first n rows of the results.

DataFrame.toDF(*names)

Creates a new DataFrame containing columns with the specified names.

DataFrame.toLocalIterator(*[, ...])

Executes the query representing this DataFrame and returns an iterator of Row objects that you can use to retrieve the results.

DataFrame.toPandas(*[, statement_params, block])

Executes the query representing this DataFrame and returns the result as a Pandas DataFrame.

DataFrame.to_df(*names)

Creates a new DataFrame containing columns with the specified names.

DataFrame.to_local_iterator()

Executes the query representing this DataFrame and returns an iterator of Row objects that you can use to retrieve the results.

DataFrame.to_pandas()

Executes the query representing this DataFrame and returns the result as a Pandas DataFrame.

DataFrame.to_pandas_batches()

Executes the query representing this DataFrame and returns an iterator of Pandas dataframes (containing a subset of rows) that you can use to retrieve the results.

DataFrame.union(other)

Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (other), excluding any duplicate rows.

DataFrame.unionAll(other)

Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (other), including any duplicate rows.

DataFrame.unionAllByName(other)

Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (other), including any duplicate rows.

DataFrame.unionByName(other)

Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (other), excluding any duplicate rows.

DataFrame.union_all(other)

Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (other), including any duplicate rows.

DataFrame.union_all_by_name(other)

Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (other), including any duplicate rows.

DataFrame.union_by_name(other)

Returns a new DataFrame that contains all the rows in the current DataFrame and another DataFrame (other), excluding any duplicate rows.

DataFrame.unpivot(value_column, name_column, ...)

Rotates a table by transforming columns into rows.

DataFrame.where(expr)

Filters rows based on the specified conditional expression (similar to WHERE in SQL).

DataFrame.withColumn(col_name, col)

Returns a DataFrame with an additional column with the specified name col_name.

DataFrame.withColumnRenamed(existing, new)

Returns a DataFrame with the specified column existing renamed as new.

DataFrame.with_column(col_name, col)

Returns a DataFrame with an additional column with the specified name col_name.

DataFrame.with_column_renamed(existing, new)

Returns a DataFrame with the specified column existing renamed as new.

DataFrame.with_columns(col_names, values)

Returns a DataFrame with additional columns with the specified names col_names.

DataFrameNaFunctions.drop([how, thresh, subset])

Returns a new DataFrame that excludes all rows containing fewer than a specified number of non-null and non-NaN values in the specified columns.

DataFrameNaFunctions.fill(value[, subset])

Returns a new DataFrame that replaces all null and NaN values in the specified columns with the values provided.

DataFrameNaFunctions.replace(to_replace[, ...])

Returns a new DataFrame that replaces values in the specified columns.

DataFrameStatFunctions.approxQuantile(col, ...)

For a specified numeric column and a list of desired quantiles, returns an approximate value for the column at each of the desired quantiles.

DataFrameStatFunctions.approx_quantile(col, ...)

For a specified numeric column and a list of desired quantiles, returns an approximate value for the column at each of the desired quantiles.

DataFrameStatFunctions.corr(col1, col2, *[, ...])

Calculates the correlation coefficient for non-null pairs in two numeric columns.

DataFrameStatFunctions.cov(col1, col2, *[, ...])

Calculates the sample covariance for non-null pairs in two numeric columns.

DataFrameStatFunctions.crosstab(col1, col2, *)

Computes a pair-wise frequency table (a contingency table) for the specified columns.

DataFrameStatFunctions.sampleBy(col, fractions)

Returns a DataFrame containing a stratified sample without replacement, based on a dict that specifies the fraction for each stratum.

DataFrameStatFunctions.sample_by(col, fractions)

Returns a DataFrame containing a stratified sample without replacement, based on a dict that specifies the fraction for each stratum.

Attributes

DataFrame.columns

Returns all column names as a list.

DataFrame.na

Returns a DataFrameNaFunctions object that provides functions for handling missing values in the DataFrame.

DataFrame.queries

Returns a dict that contains a list of queries that will be executed to evaluate this DataFrame with the key queries, and a list of post-execution actions (e.g., queries to clean up temporary objects) with the key post_actions.

DataFrame.schema

The definition of the columns in this DataFrame (the "relational schema" for the DataFrame).

DataFrame.stat

DataFrame.write

Returns a new DataFrameWriter object that you can use to write the data in the DataFrame to a Snowflake database or a stage location

DataFrame.is_cached

Whether the dataframe is cached.