tick.preprocessing.FeaturesBinarizer

class tick.preprocessing.FeaturesBinarizer(method='quantile', n_cuts=10, detect_column_type='auto', remove_first=False, bins_boundaries=None)[source]

Transforms continuous data into bucketed binary data.

This is a scikit-learn transformer that transform an input pandas DataFrame X of shape (n_samples, n_features) into a binary matrix of size (n_samples, n_new_features). Continous features are modified and extended into binary features, using linearly or inter-quantiles spaced bins. Discrete features are binary encoded with K columns, where K is the number of modalities. Other features (none of the above) are left unchanged.

Parameters:

n_cuts : int, default=10

Number of cut points for continuous features.

method : “quantile” or “linspace”, default=”quantile”

  • If "quantile" quantile-based cuts are used.

  • If "linspace" linearly spaced cuts are used.

  • If "given" bins_boundaries needs to be provided.

detect_column_type : “auto” or “column_names”, default=”auto”

  • If "auto" feature type detection done automatically.

  • If "column_names" feature type detection done using column names. In this case names ending by “:continuous” means continuous while “:discrete” means a discrete feature

remove_first : bool

If True, first column of each binarized continuous feature block is removed.

bins_boundaries : list, default=”none”

Bins boundaries for continuous features.

Attributes:

one_hot_encoder : OneHotEncoder

OneHotEncoders for continuous and discrete features.

bins_boundaries : list

Bins boundaries for continuous features.

mapper : dict

Map modalities to column indexes for categorical features.

feature_type : dict

Features type.

blocks_start : list

List of indices of the beginning of each block of binarized features

blocks_length : list

Length of each block of binarized features

References

http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing

Examples

>>> import numpy as np
>>> from tick.preprocessing import FeaturesBinarizer
>>> features = np.array([[0.00902084, 0., 'z'],
...                      [0.46599565, 0., 2.],
...                      [0.52091721, 1., 2.],
...                      [0.47315496, 1., 1.],
...                      [0.08180209, 0., 0.],
...                      [0.45011727, 0., 0.],
...                      [2.04347947, 1., 20.],
...                      [-0.9890938, 0., 0.],
...                      [-0.3063761, 1., 1.],
...                      [0.27110903, 0., 0.]])
>>> binarizer = FeaturesBinarizer(n_cuts=3)
>>> binarized_features = binarizer.fit_transform(features)
>>> # output comes as a sparse matrix
>>> binarized_features.__class__
<class 'scipy.sparse.csr.csr_matrix'>
>>> # column type is automatically detected
>>> sorted(binarizer.feature_type.items())
[('0', 'continuous'), ('1', 'discrete'), ('2', 'discrete')]
>>> # features is binarized (first column is removed to avoid colinearity)
>>> binarized_features.toarray()
array([[1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 1., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0., 1., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 1., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0.]])