MAFESE Library

mafese.selector

class mafese.selector.Selector(problem='classification')[source]

Bases: abc.ABC

Defines an abstract class for Feature Selector.

SUPPORTED_CLASSIFICATION_METRICS = ['PS', 'NPV', 'RS', 'AS', 'F1S', 'F2S', 'FBS', 'SS', 'MCC', 'HS', 'CKS', 'JSI', 'GMS', 'ROC-AUC', 'LS', 'GINI', 'CEL', 'HL', 'KLDL', 'BSL']
SUPPORTED_ESTIMATORS = ['knn', 'svm', 'rf', 'adaboost', 'xgb', 'tree', 'ann']
SUPPORTED_PROBLEMS = ['classification', 'regression']
SUPPORTED_REGRESSION_METRICS = ['EVS', 'ME', 'MAE', 'MSE', 'RMSE', 'MSLE', 'MedAE', 'MRE', 'MRB', 'MAPE', 'SMAPE', 'MAAPE', 'MASE', 'NSE', 'NNSE', 'WI', 'R', 'PCC', 'AR', 'APCC', 'R2S', 'RSQ', 'R2', 'COD', 'AR2', 'ACOD', 'CI', 'DRV', 'KGE', 'GINI', 'GINI_WIKI', 'PCD', 'JSD', 'VAF', 'RAE', 'A10', 'A20', 'A30', 'NRMSE', 'RSE', 'COV', 'COR', 'EC', 'OI', 'CRM']
evaluate(estimator=None, estimator_paras=None, data=None, metrics=None)[source]

Evaluate the new dataset. We will re-train the estimator with training set and return the metrics of both training and testing set

Parameters
  • estimator (str or Estimator instance (from scikit-learn or custom)) –

    If estimator is str, we are currently support:
    • knn: k-nearest neighbors

    • svm: support vector machine

    • rf: random forest

    • adaboost: AdaBoost

    • xgb: Gradient Boosting

    • tree: Extra Trees

    • ann: Artificial Neural Network (Multi-Layer Perceptron)

    If estimator is Estimator instance: you need to make sure that it has fit and predict methods

  • estimator_paras (None or dict, default = None) – The parameters of the estimator, please see the official document of scikit-learn to selected estimator. If None, we use the default parameter for selected estimator

  • data (Data, an instance of Data class. It must have training and testing set) –

  • metrics (tuple, list, default = None) – Depend on the regression or classification you are trying to tackle. The supported metrics can be found at: https://github.com/thieu1995/permetrics

Returns

metrics_results – The metrics for both training and testing set.

Return type

dict.

fit(X, y=None)[source]

Learn the features to select from X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of predictors.

  • y (array-like of shape (n_samples,), default=None) – Target values. This parameter may be ignored for unsupervised learning.

Returns

self – Returns the instance itself.

Return type

object

fit_transform(X, y=None, **fit_params)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

name = 'Feature Selector'
transform(X)[source]

Reduce X to the selected features.

Parameters

X (array of shape [n_samples, n_features]) – The input samples.

Returns

X_r – The input samples with only the selected features.

Return type

array of shape [n_samples, n_selected_features]

mafese.filter

class mafese.filter.FilterSelector(problem='classification', method='ANOVA', n_features=3, n_neighbors=5, n_bins=10, normalized=True)[source]

Bases: mafese.selector.Selector

Defines a FilterSelector class that hold all filter methods for feature selection problems

Parameters
  • problem (str, default = "classification") – The problem you are trying to solve (or type of dataset), “classification” or “regression”

  • method (str, default = "ANOVA") –

    If the problem = “classification”, FilterSelector’s support method can be one of this value:

    • ”CHI”: Chi-Squared statistic

    • ”ANOVA”: ANOVA F-score

    • ”MI”: Mutual information

    • ”KENDALL”: Kendall Tau correlation

    • ”SPEARMAN”: Spearman’s Rho correlation

    • ”POINT”: Point-biserial correlation

    • ”RELIEF”: Original Relief method

    • ”RELIEF-F”: Weighted average Relief based on the frequency of each class

    • ”VLS-RELIEF-F”: Very Large Scale ReliefF

    If the problem = “regression”, FilterSelector’s support method can be one of this value:

    • ”PEARSON”: Pearson correlation

    • ”ANOVA”: ANOVA F-score

    • ”MI”: Mutual information

    • ”KENDALL”: Kendall Tau correlation

    • ”SPEARMAN”: Spearman’s Rho correlation

    • ”POINT”: Point-biserial correlation

    • ”RELIEF”: Original Relief method

    • ”RELIEF-F”: Weighted average Relief based on the frequency of each class

    • ”VLS-RELIEF-F”: Very Large Scale ReliefF

  • n_features (int or float, default=3) – If integer, the parameter is the absolute number of features to select. If float between 0 and 1, it is the fraction of features to select.

  • n_neighbors (int, default=5, Optional) – Number of neighbors to use for computing feature importance scores of Relief-based family

  • n_bins (int, default=10, Optional) – Number of bins to use for discretizing the target variable of Relief-based family in regression problems.

  • normalized (bool, default=True, Optional) – Normalize feature importance scores by the number of instances in the dataset

n_features

The number of selected features.

Type

int

supported_methods

Key: is the support method name Value: is the support method function

Type

dict

method_name

The method that will be used

Type

str

Examples

The following example shows how to retrieve the most informative features in the FilterSelector FS method

>>> import pandas as pd
>>> from mafese.filter import FilterSelector
>>> # load dataset
>>> dataset = pd.read_csv('your_path/dataset.csv', index_col=0).values
>>> X, y = dataset[:, 0:-1], dataset[:, -1]     # Assumption that the last column is label column
>>> # define mafese feature selection method
>>> feat_selector = FilterSelector(problem='classification', method='SPEARMAN', n_features=5)
>>> # find all relevant features
>>> feat_selector.fit(X, y)
>>> # check selected features - True (or 1) is selected, False (or 0) is not selected
>>> print(feat_selector.selected_feature_masks)
array([ True, True, True, False, False, True, False, False, False, True])
>>> print(feat_selector.selected_feature_solution)
array([ 1, 1, 1, 0, 0, 1, 0, 0, 0, 1])
>>> # check the index of selected features
>>> print(feat_selector.selected_feature_indexes)
array([ 0, 1, 2, 5, 9])
>>> # call transform() on X to filter it down to selected features
>>> X_filtered = feat_selector.transform(X)
SUPPORT = {'classification': {'ANOVA': 'f_classification_func', 'CHI': 'chi2_func', 'KENDALL': 'kendall_func', 'MI': 'mutual_info_classif', 'POINT': 'point_func', 'RELIEF': 'relief_func', 'RELIEF-F': 'relief_f_func', 'SPEARMAN': 'spearman_func', 'VLS-RELIEF-F': 'vls_relief_f_func'}, 'regression': {'ANOVA': 'f_regression_func', 'KENDALL': 'kendall_func', 'MI': 'mutual_info_regression', 'PEARSON': 'r_regression', 'POINT': 'point_func', 'RELIEF': 'relief_func', 'RELIEF-F': 'relief_f_func', 'SPEARMAN': 'spearman_func', 'VLS-RELIEF-F': 'vls_relief_f_func'}}
fit(X, y=None)[source]

Learn the features to select from X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of predictors.

  • y (array-like of shape (n_samples,), default=None) – Target values. This parameter may be ignored for unsupervised learning.

Returns

self – Returns the instance itself.

Return type

object

fit_transform(X, y=None, **fit_params)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

transform(X)[source]

Reduce X to the selected features.

Parameters

X (array of shape [n_samples, n_features]) – The input samples.

Returns

X_r – The input samples with only the selected features.

Return type

array of shape [n_samples, n_selected_features]

mafese submodule