tick.preprocessing.
FeaturesBinarizer
(method='quantile', n_cuts=10, detect_column_type='auto', remove_first=False, bins_boundaries=None)[source]¶Transforms continuous data into bucketed binary data.
This is a scikit-learn transformer that transform an input pandas DataFrame X of shape (n_samples, n_features) into a binary matrix of size (n_samples, n_new_features). Continous features are modified and extended into binary features, using linearly or inter-quantiles spaced bins. Discrete features are binary encoded with K columns, where K is the number of modalities. Other features (none of the above) are left unchanged.
n_cuts : int
, default=10
Number of cut points for continuous features.
method : “quantile” or “linspace”, default=”quantile”
If
"quantile"
quantile-based cuts are used.If
"linspace"
linearly spaced cuts are used.If
"given"
bins_boundaries needs to be provided.
detect_column_type : “auto” or “column_names”, default=”auto”
If
"auto"
feature type detection done automatically.If
"column_names"
feature type detection done using column names. In this case names ending by “:continuous” means continuous while “:discrete” means a discrete feature
remove_first : bool
If
True
, first column of each binarized continuous feature block is removed.
bins_boundaries : list
, default=”none”
Bins boundaries for continuous features.
one_hot_encoder : OneHotEncoder
OneHotEncoders for continuous and discrete features.
bins_boundaries : list
Bins boundaries for continuous features.
mapper : dict
Map modalities to column indexes for categorical features.
feature_type : dict
Features type.
blocks_start : list
List of indices of the beginning of each block of binarized features
blocks_length : list
Length of each block of binarized features
References
http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing
Examples
>>> import numpy as np
>>> from tick.preprocessing import FeaturesBinarizer
>>> features = np.array([[0.00902084, 0., 'z'],
... [0.46599565, 0., 2.],
... [0.52091721, 1., 2.],
... [0.47315496, 1., 1.],
... [0.08180209, 0., 0.],
... [0.45011727, 0., 0.],
... [2.04347947, 1., 20.],
... [-0.9890938, 0., 0.],
... [-0.3063761, 1., 1.],
... [0.27110903, 0., 0.]])
>>> binarizer = FeaturesBinarizer(n_cuts=3)
>>> binarized_features = binarizer.fit_transform(features)
>>> # output comes as a sparse matrix
>>> binarized_features.__class__
<class 'scipy.sparse.csr.csr_matrix'>
>>> # column type is automatically detected
>>> sorted(binarizer.feature_type.items())
[('0', 'continuous'), ('1', 'discrete'), ('2', 'discrete')]
>>> # features is binarized (first column is removed to avoid colinearity)
>>> binarized_features.toarray()
array([[1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1.],
[0., 0., 1., 0., 1., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 1., 0., 1., 0., 0., 1., 0., 0.],
[0., 0., 1., 0., 0., 1., 0., 1., 0., 0., 0.],
[0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0.],
[0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0.],
[1., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0.],
[1., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0.],
[0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0.]])
__init__
(method='quantile', n_cuts=10, detect_column_type='auto', remove_first=False, bins_boundaries=None)[source]¶Initialize self. See help(type(self)) for accurate signature.
cast_to_array
(X)[source]¶Cast input matrix to np.ndarray
.
output : np.ndarray
, np.ndarray
The input matrix and the corresponding column names.
fit
(X)[source]¶Fit the binarization using the features matrix.
X : pd.DataFrame
or np.ndarray
, shape=(n_samples, n_features)
The features matrix.
output : FeaturesBinarizer
The fitted current instance.
fit_transform
(X, y=None, **kwargs)[source]¶Fit and apply the binarization using the features matrix.
X : pd.DataFrame
or np.ndarray
, shape=(n_samples, n_features)
The features matrix.
output : pd.DataFrame
The binarized features matrix. The number of columns is larger than n_features, smaller than n_cuts * n_features, depending on the actual number of columns that have been binarized.
get_params
(deep=True)¶Get parameters for this estimator.
deep : boolean, optional
If True, will return the parameters for this estimator and contained subobjects that are estimators.
params : mapping of string to any
Parameter names mapped to their values.
set_params
(**params)¶Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects
(such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each
component of a nested object.
self :
transform
(X)[source]¶Apply the binarization to the given features matrix.
X : pd.DataFrame
or np.ndarray
, shape=(n_samples, n_features)
The features matrix.
output : pd.DataFrame
The binarized features matrix. The number of columns is larger than n_features, smaller than n_cuts * n_features, depending on the actual number of columns that have been binarized.