modin.pandas.read_htmlΒΆ
- modin.pandas.read_html(io, *, match: str | Pattern = '.+', flavor: str | None = None, header: int | Sequence[int] | None = None, index_col: int | Sequence[int] | None = None, skiprows: int | Sequence[int] | slice | None = None, attrs: dict[str, str] | None = None, parse_dates: bool = False, thousands: str | None = ',', encoding: str | None = None, decimal: str = '.', converters: dict | None = None, na_values: Iterable[object] | None = None, keep_default_na: bool = True, displayed_only: bool = True, extract_links: Literal[None, 'header', 'footer', 'body', 'all'] = None, dtype_backend: DtypeBackend | NoDefault = _NoDefault.no_default, storage_options: StorageOptions = None) pd.DataFrame [source]ΒΆ
Read HTML tables into a list of DataFrame objects.
- Parameters:
io (str, path object, or file-like object) β String, path object (implementing os.PathLike[str]), or file-like object implementing a string read() function. The string can represent a URL. Note that lxml only accepts the http, ftp and file url protocols. If you have a URL that starts with βhttpsβ you might try removing the βsβ.
match (str or compiled regular expression, optional) β The set of tables containing text matching this regex or string will be returned. Unless the HTML is extremely simple you will probably need to pass a non-empty string here. Defaults to β.+β (match any non-empty string). The default value will return all tables contained on a page. This value is converted to a regular expression so that there is consistent behavior between Beautiful Soup and lxml.
flavor ({βlxmlβ, βhtml5libβ, βbs4β} or list-like, optional) β The parsing engine (or list of parsing engines) to use. βbs4β and βhtml5libβ are synonymous with each other, they are both there for backwards compatibility. The default of None tries to use lxml to parse and if that fails it falls back on bs4 + html5lib.
header (int or list-like, optional) β The row (or list of rows for a MultiIndex) to use to make the columns headers.
index_col (int or list-like, optional) β The column (or list of columns) to use to create the index.
skiprows (int, list-like or slice, optional) β Number of rows to skip after parsing the column integer. 0-based. If a sequence of integers or a slice is given, will skip the rows indexed by that sequence. Note that a single element sequence means βskip the nth rowβ whereas an integer means βskip n rowsβ.
attrs (dict, optional) β This is a dictionary of attributes that you can pass to use to identify the table in the HTML. These are not checked for validity before being passed to lxml or Beautiful Soup. However, these attributes must be valid HTML table attributes to work correctly. For example, attrs = {βidβ: βtableβ} is a valid attribute dictionary because the βidβ HTML tag attribute is a valid HTML attribute for any HTML tag as per this document. attrs = {βasdfβ: βtableβ} is not a valid attribute dictionary because βasdfβ is not a valid HTML attribute even if it is a valid XML attribute. Valid HTML 4.01 table attributes can be found here. A working draft of the HTML 5 spec can be found here. It contains the latest information on table attributes for the modern web.
parse_dates (bool, optional) β See read_csv() for more details.
thousands (str, optional) β Separator to use to parse thousands. Defaults to β,β.
encoding (str, optional) β The encoding used to decode the web page. Defaults to
None
.``None`` preserves the previous encoding behavior, which depends on the underlying parser library (e.g., the parser library will try to use the encoding provided by the document).decimal (str, default β.β) β Character to recognize as decimal point (e.g. use β,β for European data).
converters (dict, default None) β Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the cell (not column) content, and return the transformed content.
na_values (iterable, default None) β Custom NA values.
keep_default_na (bool, default True) β If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise theyβre appended to.
displayed_only (bool, default True) β Whether elements with βdisplay: noneβ should be parsed.
extract_links ({None, βallβ, βheaderβ, βbodyβ, βfooterβ}) β Table elements in the specified section(s) with <a> tags will have their href extracted.
dtype_backend ({βnumpy_nullableβ, βpyarrowβ}) β Back-end data type applied to the resultant DataFrame (still experimental). If not specified, the default behavior is to not use nullable data types. If specified, the behavior is as follows: - βnumpy_nullableβ: returns nullable-dtype-backed DataFrame - βpyarrowβ: returns pyarrow-backed nullable ArrowDtype DataFrame
storage_options (dict, optional) β Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with βs3://β, and βgcs://β) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.
- Returns:
A list of DataFrames.
- Return type:
dfs
See also
read_csv
Read a comma-separated values (csv) file into DataFrame.
Notes
Before using this function you should read the gotchas about the HTML parsing libraries.
Expect to do some cleanup after you call this function. For example, you might need to manually assign column names if the column names are converted to NaN when you pass the header=0 argument. We try to assume as little as possible about the structure of the table and push the idiosyncrasies of the HTML contained in the table to the user.
This function searches for <table> elements and only for <tr> and <th> rows and <td> elements within each <tr> or <th> element in the table. <td> stands for βtable dataβ. This function attempts to properly handle colspan and rowspan attributes. If the function has a <thead> argument, it is used to construct the header, otherwise the function attempts to find the header within the body (by putting rows with only <th> elements into the header).
Similar to read_csv() the header argument is applied after skiprows is applied.
This function will always return a list of DataFrame or it will fail, i.e., it will not return an empty list, save for some rare cases. It might return an empty list in case of inputs with single row and <td> containing only whitespaces.
Examples
See the read_html documentation in the IO section of the docs for some examples of reading in HTML tables.