
class tick.preprocessing.FeaturesBinarizer(method='quantile', n_cuts=10, detect_column_type='auto', remove_first=False, bins_boundaries=None)[source]

Transforms continuous data into bucketed binary data.

This is a scikit-learn transformer that transform an input pandas DataFrame X of shape (n_samples, n_features) into a binary matrix of size (n_samples, n_new_features). Continous features are modified and extended into binary features, using linearly or inter-quantiles spaced bins. Discrete features are binary encoded with K columns, where K is the number of modalities. Other features (none of the above) are left unchanged.


n_cuts : int, default=10

Number of cut points for continuous features.

method : “quantile” or “linspace”, default=”quantile”

  • If "quantile" quantile-based cuts are used.

  • If "linspace" linearly spaced cuts are used.

  • If "given" bins_boundaries needs to be provided.

detect_column_type : “auto” or “column_names”, default=”auto”

  • If "auto" feature type detection done automatically.

  • If "column_names" feature type detection done using column names. In this case names ending by “:continuous” means continuous while “:discrete” means a discrete feature

remove_first : bool

If True, first column of each binarized continuous feature block is removed.

bins_boundaries : list, default=”none”

Bins boundaries for continuous features.


one_hot_encoder : OneHotEncoder

OneHotEncoders for continuous and discrete features.

bins_boundaries : list

Bins boundaries for continuous features.

mapper : dict

Map modalities to column indexes for categorical features.

feature_type : dict

Features type.

blocks_start : list

List of indices of the beginning of each block of binarized features

blocks_length : list

Length of each block of binarized features




>>> import numpy as np
>>> from tick.preprocessing import FeaturesBinarizer
>>> features = np.array([[0.00902084, 0., 'z'],
...                      [0.46599565, 0., 2.],
...                      [0.52091721, 1., 2.],
...                      [0.47315496, 1., 1.],
...                      [0.08180209, 0., 0.],
...                      [0.45011727, 0., 0.],
...                      [2.04347947, 1., 20.],
...                      [-0.9890938, 0., 0.],
...                      [-0.3063761, 1., 1.],
...                      [0.27110903, 0., 0.]])
>>> binarizer = FeaturesBinarizer(n_cuts=3)
>>> binarized_features = binarizer.fit_transform(features)
>>> # output comes as a sparse matrix
>>> binarized_features.__class__
<class 'scipy.sparse.csr.csr_matrix'>
>>> # column type is automatically detected
>>> sorted(binarizer.feature_type.items())
[('0', 'continuous'), ('1', 'discrete'), ('2', 'discrete')]
>>> # features is binarized (first column is removed to avoid colinearity)
>>> binarized_features.toarray()
array([[1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1.],
       [0., 0., 1., 0., 1., 0., 0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0., 1., 0., 0., 1., 0., 0.],
       [0., 0., 1., 0., 0., 1., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0.],
       [0., 0., 1., 0., 1., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0.],
       [1., 0., 0., 0., 1., 0., 1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0.]])
__init__(method='quantile', n_cuts=10, detect_column_type='auto', remove_first=False, bins_boundaries=None)[source]

Initialize self. See help(type(self)) for accurate signature.

static cast_to_array(X)[source]

Cast input matrix to np.ndarray.


output : np.ndarray, np.ndarray

The input matrix and the corresponding column names.


Fit the binarization using the features matrix.


X : pd.DataFrame or np.ndarray, shape=(n_samples, n_features)

The features matrix.


output : FeaturesBinarizer

The fitted current instance.

fit_transform(X, y=None, **kwargs)[source]

Fit and apply the binarization using the features matrix.


X : pd.DataFrame or np.ndarray, shape=(n_samples, n_features)

The features matrix.


output : pd.DataFrame

The binarized features matrix. The number of columns is larger than n_features, smaller than n_cuts * n_features, depending on the actual number of columns that have been binarized.


Get parameters for this estimator.


deep : boolean, optional

If True, will return the parameters for this estimator and contained subobjects that are estimators.


params : mapping of string to any

Parameter names mapped to their values.


Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.


self :


Apply the binarization to the given features matrix.


X : pd.DataFrame or np.ndarray, shape=(n_samples, n_features)

The features matrix.


output : pd.DataFrame

The binarized features matrix. The number of columns is larger than n_features, smaller than n_cuts * n_features, depending on the actual number of columns that have been binarized.