You are viewing documentation about an older version (1.0.9). View latest version

snowflake.ml.modeling.cluster.MeanShift

class snowflake.ml.modeling.cluster.MeanShift(*, bandwidth=None, seeds=None, bin_seeding=False, min_bin_freq=1, cluster_all=True, n_jobs=None, max_iter=300, input_cols: Optional[Union[str, Iterable[str]]] = None, output_cols: Optional[Union[str, Iterable[str]]] = None, label_cols: Optional[Union[str, Iterable[str]]] = None, drop_input_cols: Optional[bool] = False, sample_weight_col: Optional[str] = None)

Bases: BaseTransformer

Mean shift clustering using a flat kernel For more details on this class, see sklearn.cluster.MeanShift

bandwidth: float, default=None

Bandwidth used in the flat kernel.

If not given, the bandwidth is estimated using sklearn.cluster.estimate_bandwidth; see the documentation for that function for hints on scalability (see also the Notes, below).

seeds: array-like of shape (n_samples, n_features), default=None

Seeds used to initialize kernels. If not set, the seeds are calculated by clustering.get_bin_seeds with bandwidth as the grid size and default values for other parameters.

bin_seeding: bool, default=False

If true, initial kernel locations are not locations of all points, but rather the location of the discretized version of points, where points are binned onto a grid whose coarseness corresponds to the bandwidth. Setting this option to True will speed up the algorithm because fewer seeds will be initialized. The default value is False. Ignored if seeds argument is not None.

min_bin_freq: int, default=1

To speed up the algorithm, accept only those bins with at least min_bin_freq points as seeds.

cluster_all: bool, default=True

If true, then all points are clustered, even those orphans that are not within any kernel. Orphans are assigned to the nearest kernel. If false, then orphans are given cluster label -1.

n_jobs: int, default=None

The number of jobs to use for the computation. The following tasks benefit from the parallelization:

  • The search of nearest neighbors for bandwidth estimation and label assignments. See the details in the docstring of the NearestNeighbors class.

  • Hill-climbing optimization for all seeds.

See Glossary for more details.

None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

max_iter: int, default=300

Maximum number of iterations, per seed point before the clustering operation terminates (for that seed point), if has not converged yet.

input_cols: Optional[Union[str, List[str]]]

A string or list of strings representing column names that contain features. If this parameter is not specified, all columns in the input DataFrame except the columns specified by label_cols and sample-weight_col parameters are considered input columns.

label_cols: Optional[Union[str, List[str]]]

A string or list of strings representing column names that contain labels. This is a required param for estimators, as there is no way to infer these columns. If this parameter is not specified, then object is fitted without labels(Like a transformer).

output_cols: Optional[Union[str, List[str]]]

A string or list of strings representing column names that will store the output of predict and transform operations. The length of output_cols mus match the expected number of output columns from the specific estimator or transformer class used. If this parameter is not specified, output column names are derived by adding an OUTPUT_ prefix to the label column names. These inferred output column names work for estimator’s predict() method, but output_cols must be set explicitly for transformers.

sample_weight_col: Optional[str]

A string representing the column name containing the examples’ weights. This argument is only required when working with weighted datasets.

drop_input_cols: Optional[bool], default=False

If set, the response of predict(), transform() methods will not contain input columns.

Methods

fit(dataset)

Perform clustering For more details on this function, see sklearn.cluster.MeanShift.fit

predict(dataset)

Predict the closest cluster each sample in X belongs to For more details on this function, see sklearn.cluster.MeanShift.predict

score(dataset)

Method not supported for this class.

set_input_cols(input_cols)

Input columns setter.

to_sklearn()

Get sklearn.cluster.MeanShift object.

Attributes

model_signatures

Returns model signature of current class.