tick.preprocessing.FeaturesBinarizer

class tick.preprocessing.FeaturesBinarizer(method='quantile', n_cuts=10, detect_column_type='auto', remove_first=False, bins_boundaries=None)[source]

Transforms continuous data into bucketed binary data.

This is a scikit-learn transformer that transform an input pandas DataFrame X of shape (n_samples, n_features) into a binary matrix of size (n_samples, n_new_features). Continous features are modified and extended into binary features, using linearly or inter-quantiles spaced bins. Discrete features are binary encoded with K columns, where K is the number of modalities. Other features (none of the above) are left unchanged.

Parameters

n_cuts : int, default=10

Number of cut points for continuous features.

method : “quantile” or “linspace”, default=”quantile”

  • If "quantile" quantile-based cuts are used.

  • If "linspace" linearly spaced cuts are used.

  • If "given" bins_boundaries needs to be provided.

detect_column_type : “auto” or “column_names”, default=”auto”

  • If "auto" feature type detection done automatically.

  • If "column_names" feature type detection done using column names. In this case names ending by “:continuous” means continuous while “:discrete” means a discrete feature

remove_first : bool

If True, first column of each binarized continuous feature block is removed.

bins_boundaries : list, default=”none”

Bins boundaries for continuous features.

Attributes

one_hot_encoder : OneHotEncoder

OneHotEncoders for continuous and discrete features.

bins_boundaries : list

Bins boundaries for continuous features.

mapper : dict

Map modalities to column indexes for categorical features.

feature_type : dict

Features type.

blocks_start : list

List of indices of the beginning of each block of binarized features

blocks_length : list

Length of each block of binarized features

References

http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing

Examples

>>> import numpy as np
>>> from tick.preprocessing import FeaturesBinarizer
>>> features = np.array([[0.00902084, 0., 'z'],
...                      [0.46599565, 0., 2.],
...                      [0.52091721, 1., 2.],
...                      [0.47315496, 1., 1.],
...                      [0.08180209, 0., 0.],
...                      [0.45011727, 0., 0.],
...                      [2.04347947, 1., 20.],
...                      [-0.9890938, 0., 0.],
...                      [-0.3063761, 1., 1.],
...                      [0.27110903, 0., 0.]])
>>> binarizer = FeaturesBinarizer(n_cuts=3)
>>> binarized_features = binarizer.fit_transform(features)
>>> # output comes as a sparse matrix
>>> binarized_features.__class__
<class 'scipy.sparse.csr.csr_matrix'>
>>> # column type is automatically detected
>>> sorted(binarizer.feature_type.items())
[('0', 'continuous'), ('1', 'discrete'), ('2', 'discrete')]
>>> # features is binarized (first column is removed to avoid colinearity)
>>> binarized_features.toarray()
array([[1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 1., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0., 1., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 1., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0.]])
__init__(method='quantile', n_cuts=10, detect_column_type='auto', remove_first=False, bins_boundaries=None)[source]

Initialize self. See help(type(self)) for accurate signature.

static cast_to_array(X)[source]

Cast input matrix to np.ndarray.

Returns

output : np.ndarray, np.ndarray

The input matrix and the corresponding column names.

fit(X)[source]

Fit the binarization using the features matrix.

Parameters

X : pd.DataFrame or np.ndarray, shape=(n_samples, n_features)

The features matrix.

Returns

output : FeaturesBinarizer

The fitted current instance.

fit_transform(X, y=None, **kwargs)[source]

Fit and apply the binarization using the features matrix.

Parameters

X : pd.DataFrame or np.ndarray, shape=(n_samples, n_features)

The features matrix.

Returns

output : pd.DataFrame

The binarized features matrix. The number of columns is larger than n_features, smaller than n_cuts * n_features, depending on the actual number of columns that have been binarized.

get_params(deep=True)

Get parameters for this estimator.

Parameters

deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

params : mapping of string to any

Parameter names mapped to their values.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Returns

self :

transform(X)[source]

Apply the binarization to the given features matrix.

Parameters

X : pd.DataFrame or np.ndarray, shape=(n_samples, n_features)

The features matrix.

Returns

output : pd.DataFrame

The binarized features matrix. The number of columns is larger than n_features, smaller than n_cuts * n_features, depending on the actual number of columns that have been binarized.