modin.pandas.read_csv¶

Read csv file(s) into a Snowpark pandas DataFrame. This API can read files stored locally or on a Snowflake stage.

Snowpark pandas stages files (unless they’re already staged) and then reads them using Snowflake’s CSV reader.

Parameters:

filepath_or_buffer (str) – Local file location or staged file location to read from. Staged file locations starts with a ‘@’ symbol. To read a local file location with a name starting with @, escape it using a @. For more info on staged files, read here.
sep (str, default ',') – Delimiter to use to separate fields in an input file. Delimiters can be multiple characters in Snowpark pandas.
delimiter (str, default ',') – Alias for sep.
header (int, list of int, None, default 'infer') – Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names. If a non-zero integer or a list of integers is passed, a NotImplementedError will be raised.
names (array-like, optional) – List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this list are not allowed.
index_col (int, str, sequence of int / str, or False, optional, default None) – Column(s) to use as the row labels of the DataFrame, either given as string name or column index. If a sequence of int / str is given, a MultiIndex is used. Note: index_col=False can be used to force pandas to not use the first column as the index, e.g. when you have a malformed file with delimiters at the end of each line.
usecols (list-like or callable, optional) –
Return a subset of the columns. If list-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s). If names are given, the document header row(s) are not taken into account. For example, a valid list-like usecols parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Element order is ignored, so usecols=[0, 1] is the same as [1, 0]. To instantiate a DataFrame from data with element order preserved use pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']] for columns in ['foo', 'bar'] order or pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']] for ['bar', 'foo'] order.

If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True. An example of a valid callable argument would be lambda x: x.upper() in ['AAA', 'BBB', 'DDD'].
dtype (Type name or dict of column -> type, optional) – Data type for data or columns. E.g. {{‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’}} Use str or object together with suitable na_values settings to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.
engine ({{'c', 'python', 'pyarrow', 'snowflake'}}, optional) – Changes the parser for reading CSVs. ‘snowflake’ will use the parser from Snowflake itself, which matches the behavior of the COPY INTO command.
converters (dict, optional) – This parameter is only supported on local files.
true_values (list, optional) – This parameter is only supported on local files.
false_values (list, optional) – This parameter is only supported on local files.
skiprows (list-like, int or callable, optional) – Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.
skipfooter (int, default 0) – This parameter is only supported on local files.
nrows (int, optional) – This parameter is only supported on local files.
na_values (scalar, str, list-like, or dict, optional) – Additional strings to recognize as NA/NaN.
keep_default_na (bool, default True) – This parameter is only supported on local files.
na_filter (bool, default True) – This parameter is only supported on local files.
verbose (bool, default False) – This parameter is only supported on local files.
skip_blank_lines (bool, default True) – If True, skip over blank lines rather than interpreting as NaN values.
parse_dates (bool or list of int or names or list of lists or dict, default False) – This parameter is only supported on local files.
infer_datetime_format (bool, default False) – This parameter is only supported on local files.
keep_date_col (bool, default False) – This parameter is only supported on local files.
date_parser (function, optional) – This parameter is only supported on local files.
date_format (str or dict of column -> format, optional) – This parameter is only supported on local files.
dayfirst (bool, default False) – This parameter is only supported on local files.
cache_dates (bool, default True) – This parameter is not supported and will be ignored.
iterator (bool, default False) – This parameter is not supported and will raise an error.
chunksize (int, optional) – This parameter is not supported and will be ignored.
compression (str, default 'infer') – String (constant) that specifies the current compression algorithm for the data files to be loaded. Snowflake uses this option to detect how already-compressed data files were compressed so that the compressed data in the files can be extracted for loading. List of Snowflake standard compressions .
thousands (str, optional) – This parameter is only supported on local files.
decimal (str, default '.') – This parameter is only supported on local files.
lineterminator (str (length 1), optional) – This parameter is only supported on local files.
quotechar (str (length 1), optional) – The character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.
quoting (int or csv.QUOTE_* instance, default 0) – This parameter is only supported on local files.
doublequote (bool, default True) – This parameter is only supported on local files.
escapechar (str (length 1), optional) – This parameter is only supported on local files.
comment (str, optional) – This parameter is only supported on local files.
encoding (str, default 'utf-8') – Encoding to use for UTF when reading/writing (ex. ‘utf-8’). List of Snowflake standard encodings .
encoding_errors (str, optional, default "strict") – This parameter is only supported on local files.
dialect (str or csv.Dialect, optional) – This parameter is only supported on local files.
on_bad_lines ({{'error', 'warn', 'skip'}} or callable, default 'error') – This parameter is only supported on local files.
delim_whitespace (bool, default False) – This parameter is only supported on local files, not files which have been uploaded to a snowflake stage.
low_memory (bool, default True) – This parameter is not supported and will be ignored.
memory_map (bool, default False) – This parameter is not supported and will be ignored.
float_precision (str, optional) – This parameter is not supported and will be ignored.
dtype_backend ({'numpy_nullable', 'pyarrow'}, default 'numpy_nullable') – This parameter is not supported and will be ignored.

Return type:

Snowpark pandas DataFrame

Raises:

NotImplementedError if a parameter is not supported. –

Notes

Both local files and files staged on Snowflake can be passed into filepath_or_buffer. A single file or a folder that matches a set of files can be passed into filepath_or_buffer. Local files will be processed locally by default using the stand pandas parser before they are uploaded to a staging location as parquet files. This behavior can be overriden by explicitly using the snowflake engine with engine=snowflake

If parsing the file using Snowflake, certain parameters may not be supported and the order of rows in the dataframe may be different than the order of records in an input file. When reading multiple files, there is no deterministic order in which the files are read.

Examples

Read local csv file.

>>> import csv
>>> import tempfile
>>> temp_dir = tempfile.TemporaryDirectory()
>>> temp_dir_name = temp_dir.name
>>> with open(f'{temp_dir_name}/data.csv', 'w') as f:
...     writer = csv.writer(f)
...     writer.writerows([['c1','c2','c3'], [1,2,3], [4,5,6], [7,8,9]])
>>> import modin.pandas as pd
>>> import snowflake.snowpark.modin.plugin
>>> df = pd.read_csv(f'{temp_dir_name}/data.csv')
>>> df
   c1  c2  c3
0   1   2   3
1   4   5   6
2   7   8   9

Read staged csv file.

>>> _ = session.sql("create or replace temp stage mytempstage").collect()
>>> _ = session.file.put(f'{temp_dir_name}/data.csv', '@mytempstage/myprefix')
>>> df2 = pd.read_csv('@mytempstage/myprefix/data.csv')
>>> df2
   c1  c2  c3
0   1   2   3
1   4   5   6
2   7   8   9

Read csv files from a local folder.

>>> with open(f'{temp_dir_name}/data2.csv', 'w') as f:
...     writer = csv.writer(f)
...     writer.writerows([['c1','c2','c3'], [1,2,3], [4,5,6], [7,8,9]])
>>> df3 = pd.read_csv(f'{temp_dir_name}/data2.csv')
>>> df3
   c1  c2  c3
0   1   2   3
1   4   5   6
2   7   8   9

Read csv files from a staged location.

>>> _ = session.file.put(f'{temp_dir_name}/data2.csv', '@mytempstage/myprefix')
>>> df4 = pd.read_csv('@mytempstage/myprefix')
>>> df4
   c1  c2  c3
0   1   2   3
1   4   5   6
2   7   8   9
3   1   2   3
4   4   5   6
5   7   8   9

>>> temp_dir.cleanup()