Welcome to MAFESE’s documentation!

https://img.shields.io/badge/release-0.1.9-yellow.svg https://img.shields.io/pypi/wheel/gensim.svg https://badge.fury.io/py/mafese.svg https://img.shields.io/pypi/pyversions/mafese.svg https://img.shields.io/pypi/status/mafese.svg https://img.shields.io/pypi/dm/mafese.svg https://github.com/thieu1995/mafese/actions/workflows/publish-package.yaml/badge.svg https://static.pepy.tech/badge/mafese https://img.shields.io/github/release-date/thieu1995/mafese.svg https://readthedocs.org/projects/mafese/badge/?version=latest https://img.shields.io/badge/Chat-on%20Telegram-blue https://img.shields.io/github/contributors/thieu1995/mafese.svg https://img.shields.io/badge/PR-Welcome-%23FF8300.svg? https://zenodo.org/badge/545209353.svg https://img.shields.io/badge/License-GPLv3-blue.svg

MAFESE (Metaheuristic Algorithms for FEature SElection) is the largest python library focused on feature selection using meta-heuristic algorithms.

  • Free software: GNU General Public License (GPL) V3 license

  • Total Wrapper-based (Metaheuristic Algorithms): > 200 methods

  • Total Filter-based (Statistical-based): > 15 methods

  • Total Embedded-based (Tree and Lasso): > 10 methods

  • Total Unsupervised-based: >= 4 methods

  • Total datasets: >= 30 (47 classifications and 7 regressions)

  • Total performance metrics: >= 61 (45 regressions and 16 classifications)

  • Total objective functions (as fitness functions): >= 61 (45 regressions and 16 classifications)

  • Documentation: https://mafese.readthedocs.io/en/latest/

  • Python versions: >= 3.7.x

  • Dependencies: numpy, scipy, scikit-learn, pandas, mealpy, permetrics, plotly, kaleido

Features

  • Our library provides all state-of-the-art feature selection methods:
    • Filter-based FS

    • Embedded-based FS
      • Regularization (Lasso-based)

      • Tree-based methods

    • Wrapper-based FS
      • Sequential-based: forward and backward

      • Recursive-based

      • MHA-based: Metaheuristic Algorithms

    • Unsupervised-based FS

  • We have implemented all feature selection methods based on scipy, scikit-learn and numpy to increase the speed of the algorithms.

Installation

  • Install the current PyPI release:

    $ pip install mafese==0.1.9
    
  • Install directly from source code:

    $ git clone https://github.com/thieu1995/mafese.git
    $ cd mafese
    $ python setup.py install
    
  • In case, you want to install the development version from Github:

    $ pip install git+https://github.com/thieu1995/mafese
    

After installation, you can import MAFESE as any other Python module:

$ python
>>> import mafese
>>> mafese.__version__

Lib’s structure

Current’s structure:

docs
examples
mafese
   data/
      cls/
      ...csv
      reg/
      ...csv
   wrapper/
      mha.py
      recursive.py
      sequential.py
   embedded/
      lasso.py
      tree.py
   filter.py
   unsupervised.py
   utils/
      correlation.py
      data_loader.py
      encoder.py
      estimator.py
      mealpy_util.py
      transfer.py
      validator.py
   __init__.py
   selector.py
README.md
setup.py

Examples

Let’s go through some examples.

First, you need to load your dataset, or you can load own available datasets:

# Load available dataset from MAFESE
from mafese import get_dataset

# Try unknown data
get_dataset("unknown")
# Enter: 1

data = get_dataset("Arrhythmia")

Load your own dataset if you want:

import pandas as pd
from mafese import Data

# load X and y
# NOTE mafese accepts numpy arrays only, hence the .values attribute
dataset = pd.read_csv('examples/dataset.csv', index_col=0).values
X, y = dataset[:, 0:-1], dataset[:, -1]
data = Data(X, y)

Next, split dataset into train and test set:

data.split_train_test(test_size=0.2, inplace=True)
print(data.X_train[:2].shape)
print(data.y_train[:2].shape)

Next, how to use Recursive wrapper-based method:

from mafese.wrapper.recursive import RecursiveSelector

# define mafese feature selection method
feat_selector = RecursiveSelector(problem="classification", estimator="rf", n_features=5)

# find all relevant features - 5 features should be selected
feat_selector.fit(data.X_train, data.y_train)

# check selected features - True (or 1) is selected, False (or 0) is not selected
print(feat_selector.selected_feature_masks)
print(feat_selector.selected_feature_solution)

# check the index of selected features
print(feat_selector.selected_feature_indexes)

# call transform() on X to filter it down to selected features
X_train_selected = feat_selector.transform(data.X_train)
X_test_selected = feat_selector.transform(data.X_test)

Or, how to use Sequential (backward or forward) wrapper-based method:

from mafese.wrapper.sequential import SequentialSelector

# define mafese feature selection method
feat_selector = SequentialSelector(problem="classification", estimator="knn", n_features=3, direction="forward")

# find all relevant features - 5 features should be selected
feat_selector.fit(data.X_train, data.y_train)

# check selected features - True (or 1) is selected, False (or 0) is not selected
print(feat_selector.selected_feature_masks)
print(feat_selector.selected_feature_solution)

# check the index of selected features
print(feat_selector.selected_feature_indexes)

# call transform() on X to filter it down to selected features
X_train_selected = feat_selector.transform(data.X_train)
X_test_selected = feat_selector.transform(data.X_test)

Or, how to use Filter-based feature selection with different correlation methods:

from mafese.filter import FilterSelector

# define mafese feature selection method
feat_selector = FilterSelector(problem='classification', method='SPEARMAN', n_features=5)

# find all relevant features - 5 features should be selected
feat_selector.fit(data.X_train, data.y_train)

# check selected features - True (or 1) is selected, False (or 0) is not selected
print(feat_selector.selected_feature_masks)
print(feat_selector.selected_feature_solution)

# check the index of selected features
print(feat_selector.selected_feature_indexes)

# call transform() on X to filter it down to selected features
X_train_selected = feat_selector.transform(data.X_train)
X_test_selected = feat_selector.transform(data.X_test)

Or, use Metaheuristic-based feature selection with different metaheuristic algorithms:

from mafese.wrapper.mha import MhaSelector
from mafese import get_dataset
from mafese import evaluator
from sklearn.svm import SVC

data = get_dataset("Arrhythmia")
data.split_train_test(test_size=0.2)
print(data.X_train.shape, data.X_test.shape)            # (361, 279) (91, 279)

# define mafese feature selection method
feat_selector = MhaSelector(problem="classification", estimator="knn",
                            optimizer="BaseGA", optimizer_paras=None,
                            transfer_func="vstf_01", obj_name="AS")
# find all relevant features
feat_selector.fit(data.X_train, data.y_train, fit_weights=(0.9, 0.1), verbose=True)

# check selected features - True (or 1) is selected, False (or 0) is not selected
print(feat_selector.selected_feature_masks)
print(feat_selector.selected_feature_solution)

# check the index of selected features
print(feat_selector.selected_feature_indexes)

# call transform() on X to filter it down to selected features
X_train_selected = feat_selector.transform(data.X_train)
X_test_selected = feat_selector.transform(data.X_test)

# Evaluate final dataset with different estimator with multiple performance metrics
results = feat_selector.evaluate(estimator=SVC(), data=data, metrics=["AS", "PS", "RS"])
print(results)
# {'AS_train': 0.77176, 'PS_train': 0.54177, 'RS_train': 0.6205, 'AS_test': 0.72636, 'PS_test': 0.34628, 'RS_test': 0.52747}

Or, use Lasso-based feature selection with different estimator:

from mafese.embedded.lasso import LassoSelector
from mafese import get_dataset
from mafese import evaluator
from sklearn.svm import SVC


data = get_dataset("Arrhythmia")
data.split_train_test(test_size=0.2)
print(data.X_train.shape, data.X_test.shape)            # (361, 279) (91, 279)

# define mafese feature selection method
feat_selector = LassoSelector(problem="classification", estimator="lasso", estimator_paras={"alpha": 0.1})
# find all relevant features
feat_selector.fit(data.X_train, data.y_train)

# check selected features - True (or 1) is selected, False (or 0) is not selected
print(feat_selector.selected_feature_masks)
print(feat_selector.selected_feature_solution)

# check the index of selected features
print(feat_selector.selected_feature_indexes)

# call transform() on X to filter it down to selected features
X_train_selected = feat_selector.transform(data.X_train)
X_test_selected = feat_selector.transform(data.X_test)

# Evaluate final dataset with different estimator with multiple performance metrics
results = feat_selector.evaluate(estimator=SVC(), data=data, metrics=["AS", "PS", "RS"])
print(results)
# {'AS_train': 0.77176, 'PS_train': 0.54177, 'RS_train': 0.6205, 'AS_test': 0.72636, 'PS_test': 0.34628, 'RS_test': 0.52747}

Or, use Tree-based feature selection with different estimator:

from mafese.embedded.tree import TreeSelector
from mafese import get_dataset
from mafese import evaluator
from sklearn.svm import SVC


data = get_dataset("Arrhythmia")
data.split_train_test(test_size=0.2)
print(data.X_train.shape, data.X_test.shape)            # (361, 279) (91, 279)

# define mafese feature selection method
feat_selector = TreeSelector(problem="classification", estimator="tree")
# find all relevant features
feat_selector.fit(data.X_train, data.y_train)

# check selected features - True (or 1) is selected, False (or 0) is not selected
print(feat_selector.selected_feature_masks)
print(feat_selector.selected_feature_solution)

# check the index of selected features
print(feat_selector.selected_feature_indexes)

# call transform() on X to filter it down to selected features
X_train_selected = feat_selector.transform(data.X_train)
X_test_selected = feat_selector.transform(data.X_test)

# Evaluate final dataset with different estimator with multiple performance metrics
results = feat_selector.evaluate(estimator=SVC(), data=data, metrics=["AS", "PS", "RS"])
print(results)
# {'AS_train': 0.77176, 'PS_train': 0.54177, 'RS_train': 0.6205, 'AS_test': 0.72636, 'PS_test': 0.34628, 'RS_test': 0.52747}

For more usage examples please look at [examples](/examples) folder.

MAFESE Library

mafese.selector

class mafese.selector.Selector(problem='classification')[source]

Bases: abc.ABC

Defines an abstract class for Feature Selector.

SUPPORTED_CLASSIFICATION_METRICS = ['PS', 'NPV', 'RS', 'AS', 'F1S', 'F2S', 'FBS', 'SS', 'MCC', 'HS', 'CKS', 'JSI', 'GMS', 'ROC-AUC', 'LS', 'GINI', 'CEL', 'HL', 'KLDL', 'BSL']
SUPPORTED_ESTIMATORS = ['knn', 'svm', 'rf', 'adaboost', 'xgb', 'tree', 'ann']
SUPPORTED_PROBLEMS = ['classification', 'regression']
SUPPORTED_REGRESSION_METRICS = ['EVS', 'ME', 'MAE', 'MSE', 'RMSE', 'MSLE', 'MedAE', 'MRE', 'MRB', 'MAPE', 'SMAPE', 'MAAPE', 'MASE', 'NSE', 'NNSE', 'WI', 'R', 'PCC', 'AR', 'APCC', 'R2S', 'RSQ', 'R2', 'COD', 'AR2', 'ACOD', 'CI', 'DRV', 'KGE', 'GINI', 'GINI_WIKI', 'PCD', 'JSD', 'VAF', 'RAE', 'A10', 'A20', 'A30', 'NRMSE', 'RSE', 'COV', 'COR', 'EC', 'OI', 'CRM']
evaluate(estimator=None, estimator_paras=None, data=None, metrics=None)[source]

Evaluate the new dataset. We will re-train the estimator with training set and return the metrics of both training and testing set

Parameters
  • estimator (str or Estimator instance (from scikit-learn or custom)) –

    If estimator is str, we are currently support:
    • knn: k-nearest neighbors

    • svm: support vector machine

    • rf: random forest

    • adaboost: AdaBoost

    • xgb: Gradient Boosting

    • tree: Extra Trees

    • ann: Artificial Neural Network (Multi-Layer Perceptron)

    If estimator is Estimator instance: you need to make sure that it has fit and predict methods

  • estimator_paras (None or dict, default = None) – The parameters of the estimator, please see the official document of scikit-learn to selected estimator. If None, we use the default parameter for selected estimator

  • data (Data, an instance of Data class. It must have training and testing set) –

  • metrics (tuple, list, default = None) – Depend on the regression or classification you are trying to tackle. The supported metrics can be found at: https://github.com/thieu1995/permetrics

Returns

metrics_results – The metrics for both training and testing set.

Return type

dict.

fit(X, y=None)[source]

Learn the features to select from X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of predictors.

  • y (array-like of shape (n_samples,), default=None) – Target values. This parameter may be ignored for unsupervised learning.

Returns

self – Returns the instance itself.

Return type

object

fit_transform(X, y=None, **fit_params)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

name = 'Feature Selector'
transform(X)[source]

Reduce X to the selected features.

Parameters

X (array of shape [n_samples, n_features]) – The input samples.

Returns

X_r – The input samples with only the selected features.

Return type

array of shape [n_samples, n_selected_features]

mafese.filter

class mafese.filter.FilterSelector(problem='classification', method='ANOVA', n_features=3, n_neighbors=5, n_bins=10, normalized=True)[source]

Bases: mafese.selector.Selector

Defines a FilterSelector class that hold all filter methods for feature selection problems

Parameters
  • problem (str, default = "classification") – The problem you are trying to solve (or type of dataset), “classification” or “regression”

  • method (str, default = "ANOVA") –

    If the problem = “classification”, FilterSelector’s support method can be one of this value:

    • ”CHI”: Chi-Squared statistic

    • ”ANOVA”: ANOVA F-score

    • ”MI”: Mutual information

    • ”KENDALL”: Kendall Tau correlation

    • ”SPEARMAN”: Spearman’s Rho correlation

    • ”POINT”: Point-biserial correlation

    • ”RELIEF”: Original Relief method

    • ”RELIEF-F”: Weighted average Relief based on the frequency of each class

    • ”VLS-RELIEF-F”: Very Large Scale ReliefF

    If the problem = “regression”, FilterSelector’s support method can be one of this value:

    • ”PEARSON”: Pearson correlation

    • ”ANOVA”: ANOVA F-score

    • ”MI”: Mutual information

    • ”KENDALL”: Kendall Tau correlation

    • ”SPEARMAN”: Spearman’s Rho correlation

    • ”POINT”: Point-biserial correlation

    • ”RELIEF”: Original Relief method

    • ”RELIEF-F”: Weighted average Relief based on the frequency of each class

    • ”VLS-RELIEF-F”: Very Large Scale ReliefF

  • n_features (int or float, default=3) – If integer, the parameter is the absolute number of features to select. If float between 0 and 1, it is the fraction of features to select.

  • n_neighbors (int, default=5, Optional) – Number of neighbors to use for computing feature importance scores of Relief-based family

  • n_bins (int, default=10, Optional) – Number of bins to use for discretizing the target variable of Relief-based family in regression problems.

  • normalized (bool, default=True, Optional) – Normalize feature importance scores by the number of instances in the dataset

n_features

The number of selected features.

Type

int

supported_methods

Key: is the support method name Value: is the support method function

Type

dict

method_name

The method that will be used

Type

str

Examples

The following example shows how to retrieve the most informative features in the FilterSelector FS method

>>> import pandas as pd
>>> from mafese.filter import FilterSelector
>>> # load dataset
>>> dataset = pd.read_csv('your_path/dataset.csv', index_col=0).values
>>> X, y = dataset[:, 0:-1], dataset[:, -1]     # Assumption that the last column is label column
>>> # define mafese feature selection method
>>> feat_selector = FilterSelector(problem='classification', method='SPEARMAN', n_features=5)
>>> # find all relevant features
>>> feat_selector.fit(X, y)
>>> # check selected features - True (or 1) is selected, False (or 0) is not selected
>>> print(feat_selector.selected_feature_masks)
array([ True, True, True, False, False, True, False, False, False, True])
>>> print(feat_selector.selected_feature_solution)
array([ 1, 1, 1, 0, 0, 1, 0, 0, 0, 1])
>>> # check the index of selected features
>>> print(feat_selector.selected_feature_indexes)
array([ 0, 1, 2, 5, 9])
>>> # call transform() on X to filter it down to selected features
>>> X_filtered = feat_selector.transform(X)
SUPPORT = {'classification': {'ANOVA': 'f_classification_func', 'CHI': 'chi2_func', 'KENDALL': 'kendall_func', 'MI': 'mutual_info_classif', 'POINT': 'point_func', 'RELIEF': 'relief_func', 'RELIEF-F': 'relief_f_func', 'SPEARMAN': 'spearman_func', 'VLS-RELIEF-F': 'vls_relief_f_func'}, 'regression': {'ANOVA': 'f_regression_func', 'KENDALL': 'kendall_func', 'MI': 'mutual_info_regression', 'PEARSON': 'r_regression', 'POINT': 'point_func', 'RELIEF': 'relief_func', 'RELIEF-F': 'relief_f_func', 'SPEARMAN': 'spearman_func', 'VLS-RELIEF-F': 'vls_relief_f_func'}}
fit(X, y=None)[source]

Learn the features to select from X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of predictors.

  • y (array-like of shape (n_samples,), default=None) – Target values. This parameter may be ignored for unsupervised learning.

Returns

self – Returns the instance itself.

Return type

object

fit_transform(X, y=None, **fit_params)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

transform(X)[source]

Reduce X to the selected features.

Parameters

X (array of shape [n_samples, n_features]) – The input samples.

Returns

X_r – The input samples with only the selected features.

Return type

array of shape [n_samples, n_selected_features]

mafese submodule

mafese.utils package
mafese.utils.correlation module

Refs: 1. https://docs.scipy.org/doc/scipy/reference/stats.html#correlation-functions 2. https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection

mafese.utils.correlation.chi2_func(X, y)[source]
mafese.utils.correlation.f_classification_func(X, y)[source]
mafese.utils.correlation.f_regression_func(X, y, center=True, force_finite=True)[source]
mafese.utils.correlation.kendall_func(X, y)[source]
mafese.utils.correlation.point_func(X, y)[source]
mafese.utils.correlation.relief_f_func(X, y, n_neighbors=5, n_bins=10, problem='classification', normalized=True, **kwargs)[source]

Performs Relief-F feature selection on the input dataset X and target variable y. Returns a vector of feature importance scores

Parameters
  • X (numpy array) – Input dataset of shape (n_samples, n_features).

  • y (numpy array) – Target variable of shape (n_samples,).

  • n_neighbors (int, default=5) – Number of neighbors to use for computing feature importance scores.

  • n_bins (int, default=10) – Number of bins to use for discretizing the target variable in regression problems.

  • problem (str) – The problem of dataset, either regression or classification If regression, discretize the target variable into n_bins classes

  • normalized (bool, default=True) – Normalize feature importance scores by the number of instances in the dataset

Returns

importance score – Vector of feature importance scores, with shape (n_features,).

Return type

np.ndarray

mafese.utils.correlation.relief_func(X, y, n_neighbors=5, n_bins=10, problem='classification', normalized=True, **kwargs)[source]

Performs Relief feature selection on the input dataset X and target variable y. Returns a vector of feature importance scores.

Parameters
  • X (numpy array) – Input dataset of shape (n_samples, n_features).

  • y (numpy array) – Target variable of shape (n_samples,).

  • n_neighbors (int, default=5) – Number of neighbors to use for computing feature importance scores.

  • n_bins (int, default=10) – Number of bins to use for discretizing the target variable in regression problems.

  • problem (str) – The problem of dataset, either regression or classification If regression, discretize the target variable into n_bins classes

  • normalized (bool, default=True) – Normalize feature importance scores by the number of instances in the dataset

Returns

importance score – Vector of feature importance scores, with shape (n_features,).

Return type

np.ndarray

mafese.utils.correlation.select_bests(importance_scores=None, n_features=3)[source]

Select features according to the k highest scores or percentile of the highest scores.

Parameters
  • importance_scores (array-like of shape (n_features,)) – Scores of features.

  • n_features (int, float. default=3) –

    Number of selected features.

    • If float, it should be in range of (0, 1). That represent the percentile of the highest scores.

    • If int, it should be in range of (1, N-1). N is total number of features in your dataset.

Returns

mask

Return type

Number of top features to select.

mafese.utils.correlation.spearman_func(X, y)[source]
mafese.utils.correlation.vls_relief_f_func(X, y, n_neighbors=5, n_bins=10, problem='classification', normalized=True, **kwargs)[source]

Performs Very Large Scale ReliefF feature selection on the input dataset X and target variable y. Returns a vector of feature importance scores

Parameters
  • X (numpy array) – Input dataset of shape (n_samples, n_features).

  • y (numpy array) – Target variable of shape (n_samples,).

  • n_neighbors (int, default=5) – Number of neighbors to use for computing feature importance scores.

  • n_bins (int, default=10) – Number of bins to use for discretizing the target variable in regression problems.

  • problem (str) – The problem of dataset, either regression or classification If regression, discretize the target variable into n_bins classes

  • normalized (bool, default=True) – Normalize feature importance scores by the number of instances in the dataset

Returns

importance score – Vector of feature importance scores, with shape (n_features,).

Return type

np.ndarray

mafese.utils.data_loader module
class mafese.utils.data_loader.Data(X, y)[source]

Bases: object

The structure of our supported Data class

Parameters
  • X (np.ndarray) – The features of your data

  • y (np.ndarray) – The labels of your data

set_train_test(X_train=None, y_train=None, X_test=None, y_test=None)[source]

Function use to set your own X_train, y_train, X_test, y_test in case you don’t want to use our split function

Parameters
  • X_train (np.ndarray) –

  • y_train (np.ndarray) –

  • X_test (np.ndarray) –

  • y_test (np.ndarray) –

split_train_test(test_size=0.2, train_size=None, random_state=41, shuffle=True, stratify=None, inplace=True)[source]

The wrapper of the split_train_test function in scikit-learn library.

mafese.utils.data_loader.get_dataset(dataset_name)[source]

Helper function to retrieve the data

Parameters

dataset_name (str) – Name of the dataset

Returns

data – The instance of Data class, that hold X and y variables.

Return type

Data

mafese.utils.encoder module
class mafese.utils.encoder.LabelEncoder[source]

Bases: object

Encode categorical features as integer labels.

fit(y)[source]

Fit label encoder to a given set of labels.

yarray-like

Labels to encode.

fit_transform(y)[source]

Fit label encoder and return encoded labels.

Parameters

y (array-like of shape (n_samples,)) – Target values.

Returns

y – Encoded labels.

Return type

array-like of shape (n_samples,)

inverse_transform(y)[source]

Transform integer labels to original labels.

yarray-like

Encoded integer labels.

original_labelsarray-like

Original labels.

transform(y)[source]

Transform labels to encoded integer labels.

yarray-like

Labels to encode.

encoded_labelsarray-like

Encoded integer labels.

mafese.utils.estimator module
mafese.utils.estimator.get_general_estimator(problem, name, paras=None)[source]
mafese.utils.estimator.get_lasso_based_estimator(problem, name, paras=None)[source]
mafese.utils.estimator.get_recursive_estimator(problem, name, paras=None)[source]
mafese.utils.estimator.get_tree_based_estimator(problem, name, paras=None)[source]
mafese.utils.mealpy_util module
class mafese.utils.mealpy_util.FeatureSelectionProblem(lb, ub, minmax, data=None, estimator=None, transfer_func=None, obj_name=None, metric_class=None, fit_weights=(0.9, 0.1), fit_sign=1, obj_paras=None, name='Feature Selection Problem', **kwargs)[source]

Bases: mealpy.utils.problem.Problem

amend_position(position=None, lb=None, ub=None)[source]

The goal is to transform the solution into the right format corresponding to the problem. For example, with discrete problems, floating-point numbers must be converted to integers to ensure the solution is in the correct format.

Parameters
  • position – vector position (location) of the solution.

  • lb – list of lower bound values

  • ub – list of upper bound values

Returns

Amended position (make the right format of the solution)

fit_func(solution)[source]

Fitness function

Parameters

x (numpy.ndarray) – Solution.

Returns

Function value of x.

Return type

float

mafese.utils.transfer module
mafese.utils.transfer.sstf_01(x)[source]
mafese.utils.transfer.sstf_02(x)[source]
mafese.utils.transfer.sstf_03(x)[source]
mafese.utils.transfer.sstf_04(x)[source]
mafese.utils.transfer.vstf_01(x)[source]
mafese.utils.transfer.vstf_02(x)[source]
mafese.utils.transfer.vstf_03(x)[source]
mafese.utils.transfer.vstf_04(x)[source]
mafese.utils.validator module
mafese.utils.validator.check_bool(name: str, value: bool, bound=(True, False))[source]
mafese.utils.validator.check_float(name: str, value: int, bound=None)[source]
mafese.utils.validator.check_int(name: str, value: int, bound=None)[source]
mafese.utils.validator.check_str(name: str, value: str, bound=None)[source]
mafese.utils.validator.check_tuple_float(name: str, values: tuple, bounds=None)[source]
mafese.utils.validator.check_tuple_int(name: str, values: tuple, bounds=None)[source]
mafese.utils.validator.is_in_bound(value, bound)[source]
mafese.utils.validator.is_str_in_list(value: str, my_list: list)[source]
mafese.wrapper package
mafese.wrapper.recursive module
class mafese.wrapper.recursive.RecursiveSelector(problem='classification', estimator='knn', estimator_paras=None, n_features=3, step=1, verbose=0, importance_getter='auto')[source]

Bases: mafese.selector.Selector

Defines a RecursiveSelector class that hold all RecursiveSelector Feature Selection methods for feature selection problems

Parameters
  • problem (str, default = "classification") – The problem you are trying to solve (or type of dataset), “classification” or “regression”

  • estimator (str or Estimator instance (from scikit-learn or custom)) –

    If estimator is str, we are currently support:
    • svm: support vector machine with kernel = ‘linear’

    • rf: random forest

    • adaboost: AdaBoost

    • xgb: Gradient Boosting

    • tree: Extra Trees

    If estimator is Estimator instance: you need to make sure it is has a fit method that provides information about feature importance (e.g. coef_, feature_importances_).

  • estimator_paras (None or dict, default = None) – The parameters of the estimator, please see the official document of scikit-learn to selected estimator. If None, we use the best parameter for selected estimator

  • n_features (int or float, default=3) – The number of features to select. If None, half of the features are selected. If integer, the parameter is the absolute number of features to select. If float between 0 and 1, it is the fraction of features to select.

  • step (int or float, default=1) – If greater than or equal to 1, then step corresponds to the (integer) number of features to remove at each iteration. If within (0.0, 1.0), then step corresponds to the percentage (rounded down) of features to remove at each iteration.

  • verbose (int, default=0) – Controls verbosity of output.

  • importance_getter (str or callable, default='auto') –

    If ‘auto’, uses the feature importance either through a coef_ or feature_importances_ attributes of estimator.

    Also accepts a string that specifies an attribute name/path for extracting feature importance (implemented with attrgetter). For example, give regressor_.coef_ in case of TransformedTargetRegressor or named_steps.clf.feature_importances_ in case of class:~sklearn.pipeline.Pipeline with its last step named clf.

    If callable, overrides the default feature importance getter. The callable is passed with the fitted estimator and it should return importance for each feature.

Examples

The following example shows how to retrieve the most informative features in the RecursiveSelector FS method

>>> import pandas as pd
>>> from mafese.wrapper.recursive import RecursiveSelector
>>> # load dataset
>>> dataset = pd.read_csv('your_path/dataset.csv', index_col=0).values
>>> X, y = dataset[:, 0:-1], dataset[:, -1]     # Assumption that the last column is label column
>>> # define mafese feature selection method
>>> feat_selector = RecursiveSelector(problem="classification", estimator="rf", n_features=5)
>>> # find all relevant features
>>> feat_selector.fit(X, y)
>>> # check selected features - True (or 1) is selected, False (or 0) is not selected
>>> print(feat_selector.selected_feature_masks)
array([ True, True, True, False, False, True, False, False, False, True])
>>> print(feat_selector.selected_feature_solution)
array([ 1, 1, 1, 0, 0, 1, 0, 0, 0, 1])
>>> # check the index of selected features
>>> print(feat_selector.selected_feature_indexes)
array([ 0, 1, 2, 5, 9])
>>> # call transform() on X to filter it down to selected features
>>> X_filtered = feat_selector.transform(X)
SUPPORT = ['svm', 'rf', 'adaboost', 'xgb', 'tree']
fit(X, y=None)[source]

Learn the features to select from X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of predictors.

  • y (array-like of shape (n_samples,), default=None) – Target values. This parameter may be ignored for unsupervised learning.

Returns

self – Returns the instance itself.

Return type

object

mafese.wrapper.sequential module
class mafese.wrapper.sequential.SequentialSelector(problem='classification', estimator='knn', estimator_paras=None, n_features=3, direction='forward', tol=None, scoring=None, cv=5, n_jobs=None)[source]

Bases: mafese.selector.Selector

Defines a SequentialSelector class that hold all Forward or Backward Feature Selection methods for feature selection problems

Parameters
  • problem (str, default = "classification") – The problem you are trying to solve (or type of dataset), “classification” or “regression”

  • estimator (str or Estimator instance (from scikit-learn or custom)) –

    If estimator is str, we are currently support:
    • knn: k-nearest neighbors

    • svm: support vector machine

    • rf: random forest

    • adaboost: AdaBoost

    • xgb: Gradient Boosting

    • tree: Extra Trees

    • ann: Artificial Neural Network (Multi-Layer Perceptron)

    If estimator is Estimator instance: you need to make sure it is has a fit method that provides information about feature importance (e.g. coef_, feature_importances_).

  • estimator_paras (None or dict, default = None) – The parameters of the estimator, please see the official document of scikit-learn to selected estimator. If None, we use the default parameter for selected estimator

  • n_features (int or float, default=3) – The number of features to select. If None, half of the features are selected. If integer, the parameter is the absolute number of features to select. If float between 0 and 1, it is the fraction of features to select.

  • direction ({'forward', 'backward'}, default='forward') – Whether to perform forward selection or backward selection.

  • tol (float, default=None) – If the score is not incremented by at least tol between two consecutive feature additions or removals, stop adding or removing. tol can be negative when removing features using direction=”backward”. It can be useful to reduce the number of features at the cost of a small decrease in the score. tol is enabled only when n_features is “auto”.

  • scoring (str or callable, default=None) – A single str (see scoring_parameter) or a callable to evaluate the predictions on the test set. NOTE that when using a custom scorer, it should return a single value. If None, the estimator’s score method is used.

  • cv (int, cross-validation generator or an iterable, default=None) –

    Determines the cross-validation splitting strategy. Possible inputs for cv are:

    • None, to use the default 5-fold cross validation,

    • integer, to specify the number of folds in a (Stratified)KFold,

    • CV splitter,

    • An iterable yielding (train, test) splits as arrays of indices.

    For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls.

  • n_jobs (int, default=None) – Number of jobs to run in parallel. When evaluating a new feature to add or remove, the cross-validation procedure is parallel over the folds. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.

Examples

The following example shows how to retrieve the most informative features in the Sequential-based (forward, backward) FS method

>>> import pandas as pd
>>> from mafese.wrapper.sequential import SequentialSelector
>>> # load dataset
>>> dataset = pd.read_csv('your_path/dataset.csv', index_col=0).values
>>> X, y = dataset[:, 0:-1], dataset[:, -1]     # Assumption that the last column is label column
>>> # define mafese feature selection method
>>> feat_selector = SequentialSelector(problem="classification", estimator="knn", n_features=5, direction="forward")
>>> # find all relevant features
>>> feat_selector.fit(X, y)
>>> # check selected features - True (or 1) is selected, False (or 0) is not selected
>>> print(feat_selector.selected_feature_masks)
array([ True, True, True, False, False, True, False, False, False, True])
>>> print(feat_selector.selected_feature_solution)
array([ 1, 1, 1, 0, 0, 1, 0, 0, 0, 1])
>>> # check the index of selected features
>>> print(feat_selector.selected_feature_indexes)
array([ 0, 1, 2, 5, 9])
>>> # call transform() on X to filter it down to selected features
>>> X_filtered = feat_selector.transform(X)
SUPPORT = ['knn', 'svm', 'rf', 'adaboost', 'xgb', 'tree', 'ann']
fit(X, y=None)[source]

Learn the features to select from X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of predictors.

  • y (array-like of shape (n_samples,), default=None) – Target values. This parameter may be ignored for unsupervised learning.

Returns

self – Returns the instance itself.

Return type

object

mafese.wrapper.mha module
class mafese.wrapper.mha.MhaSelector(problem='classification', estimator='knn', estimator_paras=None, optimizer='BaseGA', optimizer_paras=None, transfer_func='vstf_01', obj_name=None)[source]

Bases: mafese.selector.Selector

Defines a MhaSelector class that hold all Metaheuristic-based Feature Selection methods for feature selection problems

Parameters
  • problem (str, default = "classification") – The problem you are trying to solve (or type of dataset), “classification” or “regression”

  • estimator (str or Estimator instance (from scikit-learn or custom)) –

    If estimator is str, we are currently support:
    • knn: k-nearest neighbors

    • svm: support vector machine

    • rf: random forest

    • adaboost: AdaBoost

    • xgb: Gradient Boosting

    • tree: Extra Trees

    • ann: Artificial Neural Network (Multi-Layer Perceptron)

    If estimator is Estimator instance: you need to make sure that it has fit and predict methods

  • estimator_paras (None or dict, default = None) – The parameters of the estimator, please see the official document of scikit-learn to selected estimator. If None, we use the default parameter for selected estimator

  • optimizer (str or instance of Optimizer class (from Mealpy library), default = "BaseGA") – The Metaheuristic Algorithm that use to solve the feature selection problem. Current supported list, please check it here: https://github.com/thieu1995/mealpy. If a custom optimizer is passed, make sure it is an instance of Optimizer class.

  • optimizer_paras (None or dict of parameter, default=None) – The parameter for the optimizer object. If None, the default parameters of optimizer is used (defined in https://github.com/thieu1995/mealpy.) If dict is passed, make sure it has at least epoch and pop_size parameters.

  • transfer_func (str or callable function, default="vstf_01") –

    The transfer function used to convert solution from float to integer. Current supported list:
    • v-shape transfer function: “vstf_01”, “vstf_02”, “vstf_03”, “vstf_04”

    • s-shape transfer function: “sstf_01”, “sstf_02”, “sstf_03”, “sstf_04”

    If callable function, make sure it return a list/tuple/np.ndarray values.

  • obj_name (None or str, default=None) –

    The name of objective for the problem, also depend on the problem is classification and regression.

    • If problem is classification, None will be replaced by AS (Accuracy score).

    • If problem is regression, None will be replaced by MSE (Mean squared error).

Examples

The following example shows how to retrieve the most informative features in the MhaSelector FS method

>>> import pandas as pd
>>> from mafese.wrapper.mha import MhaSelector
>>> # load dataset
>>> dataset = pd.read_csv('your_path/dataset.csv', index_col=0).values
>>> X, y = dataset[:, 0:-1], dataset[:, -1]     # Assumption that the last column is label column
>>> # define mafese feature selection method
>>> feat_selector = MhaSelector(problem="classification", estimator="rf", optimizer="BaseGA")
>>> # find all relevant features - 5 features should be selected
>>> feat_selector.fit(X, y)
>>> # check selected features - True (or 1) is selected, False (or 0) is not selected
>>> print(feat_selector.selected_feature_masks)
array([ True, True, True, False, False, True, False, False, False, True])
>>> print(feat_selector.selected_feature_solution)
array([ 1, 1, 1, 0, 0, 1, 0, 0, 0, 1])
>>> # check the index of selected features
>>> print(feat_selector.selected_feature_indexes)
array([ 0, 1, 2, 5, 9])
>>> # call transform() on X to filter it down to selected features
>>> X_filtered = feat_selector.transform(X)
SUPPORT = {'classification_objective': {'AS': 'max', 'BSL': 'min', 'CEL': 'min', 'CKS': 'max', 'F1S': 'max', 'F2S': 'max', 'FBS': 'max', 'GINI': 'min', 'GMS': 'max', 'HL': 'min', 'HS': 'max', 'JSI': 'max', 'KLDL': 'min', 'LS': 'max', 'MCC': 'max', 'NPV': 'max', 'PS': 'max', 'ROC-AUC': 'max', 'RS': 'max', 'SS': 'max'}, 'estimator': ['knn', 'svm', 'rf', 'adaboost', 'xgb', 'tree', 'ann'], 'optimizer': ['OriginalABC', 'OriginalACOR', 'AugmentedAEO', 'EnhancedAEO', 'ImprovedAEO', 'ModifiedAEO', 'OriginalAEO', 'MGTO', 'OriginalAGTO', 'BaseALO', 'OriginalALO', 'OriginalAO', 'OriginalAOA', 'IARO', 'LARO', 'OriginalARO', 'OriginalASO', 'OriginalAVOA', 'OriginalArchOA', 'AdaptiveBA', 'ModifiedBA', 'OriginalBA', 'BaseBBO', 'OriginalBBO', 'OriginalBBOA', 'OriginalBES', 'ABFO', 'OriginalBFO', 'OriginalBMO', 'BaseBRO', 'OriginalBRO', 'OriginalBSA', 'ImprovedBSO', 'OriginalBSO', 'CleverBookBeesA', 'OriginalBeesA', 'ProbBeesA', 'OriginalCA', 'OriginalCDO', 'OriginalCEM', 'OriginalCGO', 'BaseCHIO', 'OriginalCHIO', 'OriginalCOA', 'OCRO', 'OriginalCRO', 'OriginalCSA', 'OriginalCSO', 'OriginalCircleSA', 'OriginalCoatiOA', 'BaseDE', 'JADE', 'SADE', 'SAP_DE', 'DevDMOA', 'OriginalDMOA', 'OriginalDO', 'BaseEFO', 'OriginalEFO', 'OriginalEHO', 'AdaptiveEO', 'ModifiedEO', 'OriginalEO', 'OriginalEOA', 'LevyEP', 'OriginalEP', 'CMA_ES', 'LevyES', 'OriginalES', 'Simple_CMA_ES', 'OriginalESOA', 'OriginalEVO', 'OriginalFA', 'BaseFBIO', 'OriginalFBIO', 'OriginalFFA', 'OriginalFFO', 'OriginalFLA', 'BaseFOA', 'OriginalFOA', 'WhaleFOA', 'OriginalFOX', 'OriginalFPA', 'BaseGA', 'EliteMultiGA', 'EliteSingleGA', 'MultiGA', 'SingleGA', 'OriginalGBO', 'BaseGCO', 'OriginalGCO', 'OriginalGJO', 'OriginalGOA', 'BaseGSKA', 'OriginalGSKA', 'Matlab101GTO', 'Matlab102GTO', 'OriginalGTO', 'GWO_WOA', 'IGWO', 'OriginalGWO', 'RW_GWO', 'OriginalHBA', 'OriginalHBO', 'OriginalHC', 'SwarmHC', 'OriginalHCO', 'OriginalHGS', 'OriginalHGSO', 'OriginalHHO', 'BaseHS', 'OriginalHS', 'OriginalICA', 'OriginalINFO', 'OriginalIWO', 'BaseJA', 'LevyJA', 'OriginalJA', 'BaseLCO', 'ImprovedLCO', 'OriginalLCO', 'OriginalMA', 'BaseMFO', 'OriginalMFO', 'OriginalMGO', 'OriginalMPA', 'OriginalMRFO', 'WMQIMRFO', 'OriginalMSA', 'BaseMVO', 'OriginalMVO', 'OriginalNGO', 'ImprovedNMRA', 'OriginalNMRA', 'OriginalNRO', 'OriginalOOA', 'OriginalPFA', 'OriginalPOA', 'CL_PSO', 'C_PSO', 'HPSO_TVAC', 'OriginalPSO', 'PPSO', 'OriginalPSS', 'BaseQSA', 'ImprovedQSA', 'LevyQSA', 'OppoQSA', 'OriginalQSA', 'OriginalRIME', 'OriginalRUN', 'GaussianSA', 'OriginalSA', 'SwarmSA', 'BaseSARO', 'OriginalSARO', 'BaseSBO', 'OriginalSBO', 'BaseSCA', 'OriginalSCA', 'QleSCA', 'OriginalSCSO', 'ImprovedSFO', 'OriginalSFO', 'L_SHADE', 'OriginalSHADE', 'OriginalSHIO', 'OriginalSHO', 'ImprovedSLO', 'ModifiedSLO', 'OriginalSLO', 'BaseSMA', 'OriginalSMA', 'DevSOA', 'OriginalSOA', 'OriginalSOS', 'DevSPBO', 'OriginalSPBO', 'OriginalSRSR', 'BaseSSA', 'OriginalSSA', 'OriginalSSDO', 'OriginalSSO', 'OriginalSSpiderA', 'OriginalSSpiderO', 'OriginalSTO', 'OriginalSeaHO', 'OriginalServalOA', 'OriginalTDO', 'BaseTLO', 'ImprovedTLO', 'OriginalTLO', 'OriginalTOA', 'OriginalTPO', 'OriginalTS', 'OriginalTSA', 'OriginalTSO', 'EnhancedTWO', 'LevyTWO', 'OppoTWO', 'OriginalTWO', 'BaseVCS', 'OriginalVCS', 'OriginalWCA', 'OriginalWDO', 'OriginalWHO', 'HI_WOA', 'OriginalWOA', 'OriginalWaOA', 'OriginalWarSO', 'OriginalZOA'], 'regression_objective': {'A10': 'max', 'A20': 'max', 'A30': 'max', 'ACOD': 'max', 'APCC': 'max', 'AR': 'max', 'AR2': 'max', 'CI': 'max', 'COD': 'max', 'COR': 'max', 'COV': 'max', 'CRM': 'min', 'DRV': 'min', 'EC': 'max', 'EVS': 'max', 'GINI': 'min', 'GINI_WIKI': 'min', 'JSD': 'min', 'KGE': 'max', 'MAAPE': 'min', 'MAE': 'min', 'MAPE': 'min', 'MASE': 'min', 'ME': 'min', 'MRB': 'min', 'MRE': 'min', 'MSE': 'min', 'MSLE': 'min', 'MedAE': 'min', 'NNSE': 'max', 'NRMSE': 'min', 'NSE': 'max', 'OI': 'max', 'PCC': 'max', 'PCD': 'max', 'R': 'max', 'R2': 'max', 'R2S': 'max', 'RAE': 'min', 'RMSE': 'min', 'RSE': 'min', 'RSQ': 'max', 'SMAPE': 'min', 'VAF': 'max', 'WI': 'max'}, 'transfer_func': ['vstf_01', 'vstf_02', 'vstf_03', 'vstf_04', 'sstf_01', 'sstf_02', 'sstf_03', 'sstf_04']}
fit(X, y=None, fit_weights=(0.9, 0.1), verbose=True, mode='single', n_workers=None, termination=None)[source]
Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The training input samples.

  • y (array-like of shape (n_samples,)) – The target values.

  • fit_weights (list, tuple or np.ndarray, default = (0.9, 0.1)) – The first weight is for objective value and the second weight is for the number of features

  • verbose (int, default = True) – Controls verbosity of output.

  • mode (str, default = 'single') –

    The mode used in Optimizer belongs to Mealpy library. Parallel: ‘process’, ‘thread’; Sequential: ‘swarm’, ‘single’.

    • ’process’: The parallel mode with multiple cores run the tasks

    • ’thread’: The parallel mode with multiple threads run the tasks

    • ’swarm’: The sequential mode that no effect on updating phase of other agents

    • ’single’: The sequential mode that effect on updating phase of other agents, default

  • n_workers (int or None, default = None) – The number of workers (cores or threads) to do the tasks (effect only on parallel mode)

  • termination (dict or None, default = None) – The termination dictionary or an instance of Termination class. It is for Optimizer belongs to Mealpy library.

fit_transform(X, y=None, fit_weights=(0.9, 0.1), verbose=True, mode='single', n_workers=None, termination=None)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

get_best_obj_and_fit()[source]
transform(X)[source]

Reduce X to the selected features.

Parameters

X (array of shape [n_samples, n_features]) – The input samples.

Returns

X_r – The input samples with only the selected features.

Return type

array of shape [n_samples, n_selected_features]

class mafese.wrapper.mha.MultiMhaSelector(problem='classification', estimator='knn', estimator_paras=None, list_optimizers=('BaseGA',), list_optimizer_paras=None, transfer_func='vstf_01', obj_name=None)[source]

Bases: mafese.selector.Selector

SUPPORT = {'classification_objective': {'AS': 'max', 'BSL': 'min', 'CEL': 'min', 'CKS': 'max', 'F1S': 'max', 'F2S': 'max', 'FBS': 'max', 'GINI': 'min', 'GMS': 'max', 'HL': 'min', 'HS': 'max', 'JSI': 'max', 'KLDL': 'min', 'LS': 'max', 'MCC': 'max', 'NPV': 'max', 'PS': 'max', 'ROC-AUC': 'max', 'RS': 'max', 'SS': 'max'}, 'estimator': ['knn', 'svm', 'rf', 'adaboost', 'xgb', 'tree', 'ann'], 'optimizer': ['OriginalABC', 'OriginalACOR', 'AugmentedAEO', 'EnhancedAEO', 'ImprovedAEO', 'ModifiedAEO', 'OriginalAEO', 'MGTO', 'OriginalAGTO', 'BaseALO', 'OriginalALO', 'OriginalAO', 'OriginalAOA', 'IARO', 'LARO', 'OriginalARO', 'OriginalASO', 'OriginalAVOA', 'OriginalArchOA', 'AdaptiveBA', 'ModifiedBA', 'OriginalBA', 'BaseBBO', 'OriginalBBO', 'OriginalBBOA', 'OriginalBES', 'ABFO', 'OriginalBFO', 'OriginalBMO', 'BaseBRO', 'OriginalBRO', 'OriginalBSA', 'ImprovedBSO', 'OriginalBSO', 'CleverBookBeesA', 'OriginalBeesA', 'ProbBeesA', 'OriginalCA', 'OriginalCDO', 'OriginalCEM', 'OriginalCGO', 'BaseCHIO', 'OriginalCHIO', 'OriginalCOA', 'OCRO', 'OriginalCRO', 'OriginalCSA', 'OriginalCSO', 'OriginalCircleSA', 'OriginalCoatiOA', 'BaseDE', 'JADE', 'SADE', 'SAP_DE', 'DevDMOA', 'OriginalDMOA', 'OriginalDO', 'BaseEFO', 'OriginalEFO', 'OriginalEHO', 'AdaptiveEO', 'ModifiedEO', 'OriginalEO', 'OriginalEOA', 'LevyEP', 'OriginalEP', 'CMA_ES', 'LevyES', 'OriginalES', 'Simple_CMA_ES', 'OriginalESOA', 'OriginalEVO', 'OriginalFA', 'BaseFBIO', 'OriginalFBIO', 'OriginalFFA', 'OriginalFFO', 'OriginalFLA', 'BaseFOA', 'OriginalFOA', 'WhaleFOA', 'OriginalFOX', 'OriginalFPA', 'BaseGA', 'EliteMultiGA', 'EliteSingleGA', 'MultiGA', 'SingleGA', 'OriginalGBO', 'BaseGCO', 'OriginalGCO', 'OriginalGJO', 'OriginalGOA', 'BaseGSKA', 'OriginalGSKA', 'Matlab101GTO', 'Matlab102GTO', 'OriginalGTO', 'GWO_WOA', 'IGWO', 'OriginalGWO', 'RW_GWO', 'OriginalHBA', 'OriginalHBO', 'OriginalHC', 'SwarmHC', 'OriginalHCO', 'OriginalHGS', 'OriginalHGSO', 'OriginalHHO', 'BaseHS', 'OriginalHS', 'OriginalICA', 'OriginalINFO', 'OriginalIWO', 'BaseJA', 'LevyJA', 'OriginalJA', 'BaseLCO', 'ImprovedLCO', 'OriginalLCO', 'OriginalMA', 'BaseMFO', 'OriginalMFO', 'OriginalMGO', 'OriginalMPA', 'OriginalMRFO', 'WMQIMRFO', 'OriginalMSA', 'BaseMVO', 'OriginalMVO', 'OriginalNGO', 'ImprovedNMRA', 'OriginalNMRA', 'OriginalNRO', 'OriginalOOA', 'OriginalPFA', 'OriginalPOA', 'CL_PSO', 'C_PSO', 'HPSO_TVAC', 'OriginalPSO', 'PPSO', 'OriginalPSS', 'BaseQSA', 'ImprovedQSA', 'LevyQSA', 'OppoQSA', 'OriginalQSA', 'OriginalRIME', 'OriginalRUN', 'GaussianSA', 'OriginalSA', 'SwarmSA', 'BaseSARO', 'OriginalSARO', 'BaseSBO', 'OriginalSBO', 'BaseSCA', 'OriginalSCA', 'QleSCA', 'OriginalSCSO', 'ImprovedSFO', 'OriginalSFO', 'L_SHADE', 'OriginalSHADE', 'OriginalSHIO', 'OriginalSHO', 'ImprovedSLO', 'ModifiedSLO', 'OriginalSLO', 'BaseSMA', 'OriginalSMA', 'DevSOA', 'OriginalSOA', 'OriginalSOS', 'DevSPBO', 'OriginalSPBO', 'OriginalSRSR', 'BaseSSA', 'OriginalSSA', 'OriginalSSDO', 'OriginalSSO', 'OriginalSSpiderA', 'OriginalSSpiderO', 'OriginalSTO', 'OriginalSeaHO', 'OriginalServalOA', 'OriginalTDO', 'BaseTLO', 'ImprovedTLO', 'OriginalTLO', 'OriginalTOA', 'OriginalTPO', 'OriginalTS', 'OriginalTSA', 'OriginalTSO', 'EnhancedTWO', 'LevyTWO', 'OppoTWO', 'OriginalTWO', 'BaseVCS', 'OriginalVCS', 'OriginalWCA', 'OriginalWDO', 'OriginalWHO', 'HI_WOA', 'OriginalWOA', 'OriginalWaOA', 'OriginalWarSO', 'OriginalZOA'], 'regression_objective': {'A10': 'max', 'A20': 'max', 'A30': 'max', 'ACOD': 'max', 'APCC': 'max', 'AR': 'max', 'AR2': 'max', 'CI': 'max', 'COD': 'max', 'COR': 'max', 'COV': 'max', 'CRM': 'min', 'DRV': 'min', 'EC': 'max', 'EVS': 'max', 'GINI': 'min', 'GINI_WIKI': 'min', 'JSD': 'min', 'KGE': 'max', 'MAAPE': 'min', 'MAE': 'min', 'MAPE': 'min', 'MASE': 'min', 'ME': 'min', 'MRB': 'min', 'MRE': 'min', 'MSE': 'min', 'MSLE': 'min', 'MedAE': 'min', 'NNSE': 'max', 'NRMSE': 'min', 'NSE': 'max', 'OI': 'max', 'PCC': 'max', 'PCD': 'max', 'R': 'max', 'R2': 'max', 'R2S': 'max', 'RAE': 'min', 'RMSE': 'min', 'RSE': 'min', 'RSQ': 'max', 'SMAPE': 'min', 'VAF': 'max', 'WI': 'max'}, 'transfer_func': ['vstf_01', 'vstf_02', 'vstf_03', 'vstf_04', 'sstf_01', 'sstf_02', 'sstf_03', 'sstf_04']}
evaluate(estimator=None, estimator_paras=None, data=None, metrics=None, save_path='history', verbose=False)[source]

Evaluate the new dataset. We will re-train the estimator with training set and return the metrics of both training and testing set

Parameters
  • estimator (str or Estimator instance (from scikit-learn or custom)) –

    If estimator is str, we are currently support:
    • knn: k-nearest neighbors

    • svm: support vector machine

    • rf: random forest

    • adaboost: AdaBoost

    • xgb: Gradient Boosting

    • tree: Extra Trees

    • ann: Artificial Neural Network (Multi-Layer Perceptron)

    If estimator is Estimator instance: you need to make sure that it has fit and predict methods

  • estimator_paras (None or dict, default = None) – The parameters of the estimator, please see the official document of scikit-learn to selected estimator. If None, we use the default parameter for selected estimator

  • data (Data, an instance of Data class. It must have training and testing set) –

  • metrics (tuple, list, default = None) – Depend on the regression or classification you are trying to tackle. The supported metrics can be found at: https://github.com/thieu1995/permetrics

  • save_path (str, default="history") – The path to save the file

  • verbose (bool, default=False) – Print the results to console or not.

Returns

metrics_results – The metrics for both training and testing set.

Return type

dict.

export_boxplot_figures(xlabel='Model', ylabel='Global best fitness value', title='Boxplot of comparison models', show_legend=True, show_mean_only=False, exts=('.png', '.pdf'))[source]
export_convergence_figures(xlabel='Epoch', ylabel='Fitness value', title='Convergence chart of comparison models', exts=('.png', '.pdf'))[source]
fit(X, y=None, n_trials=2, n_jobs=2, save_path='history', save_results=True, verbose=True, fit_weights=(0.9, 0.1), mode='single', n_workers=None, termination=None)[source]
Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The training input samples.

  • y (array-like of shape (n_samples,)) – The target values.

  • n_trials (int. Number of repetitions) –

  • n_jobs (int, None. Number of processes will be used to speed up the computation (<=1 or None: sequential, >=2: parallel)) –

  • save_path (str. The path to the folder that hold results) –

  • save_results (bool. Save the global best fitness and loss (convergence/fitness) during generations to csv file (default: True)) –

  • fit_weights (list, tuple or np.ndarray, default = (0.9, 0.1)) – The first weight is for objective value and the second weight is for the number of features

  • verbose (int, default = True) – Controls verbosity of output.

  • mode (str, default = 'single') –

    The mode used in Optimizer belongs to Mealpy library. Parallel: ‘process’, ‘thread’; Sequential: ‘swarm’, ‘single’.

    • ’process’: The parallel mode with multiple cores run the tasks

    • ’thread’: The parallel mode with multiple threads run the tasks

    • ’swarm’: The sequential mode that no effect on updating phase of other agents

    • ’single’: The sequential mode that effect on updating phase of other agents, default

  • n_workers (int or None, default = None) – The number of workers (cores or threads) used in Optimizer (effect only on parallel mode)

  • termination (dict or None, default = None) – The termination dictionary or an instance of Termination class. It is for Optimizer belongs to Mealpy library.

fit_transform(X, y=None, n_trials=2, n_jobs=2, save_path='history', save_results=True, verbose=True, fit_weights=(0.9, 0.1), mode='single', n_workers=None, termination=None)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

transform(X, trial=1, model='BaseGA', all_models=False)[source]

Reduce X to the selected features.

Parameters

X (array of shape [n_samples, n_features]) – The input samples.

Returns

X_r – The input samples with only the selected features.

Return type

array of shape [n_samples, n_selected_features]

mafese.embedded package
mafese.embedded.lasso module
class mafese.embedded.lasso.LassoSelector(problem='classification', estimator='lasso', estimator_paras=None, threshold=None, norm_order=1, max_features=None)[source]

Bases: mafese.selector.Selector

Defines a LassoSelector class that hold all Lasso-based Feature Selection methods for feature selection problems

Parameters
  • problem (str, default = "classification") – The problem you are trying to solve (or type of dataset), “classification” or “regression”

  • estimator (str, default = 'lasso') –

    We are currently support:
    • lasso: lasso estimator (both regression and classification)

    • lr: Logistic Regression (classification)

    • svm: LinearSVC, support vector machine (classification)

  • estimator_paras (None or dict, default = None) – The parameters of the estimator, please see the official document of scikit-learn to selected estimator. If None, we use the best parameter for selected estimator

  • threshold (str or float, default=None) – The threshold value to use for feature selection. Features whose absolute importance value is greater or equal are kept while the others are discarded. If “median” (resp. “mean”), then the threshold value is the median (resp. the mean) of the feature importances. A scaling factor (e.g., “1.25*mean”) may also be used. If None and if the estimator has a parameter penalty set to l1, either explicitly or implicitly (e.g, Lasso), the threshold used is 1e-5. Otherwise, “mean” is used by default.

  • norm_order (non-zero int, inf, -inf, default=1) – Order of the norm used to filter the vectors of coefficients below threshold in the case where the coef_ attribute of the estimator is of dimension 2.

  • max_features (int, callable, default=None) –

    The maximum number of features to select.

    • If an integer, then it specifies the maximum number of features to allow.

    • If a callable, then it specifies how to calculate the maximum number of features allowed by using the output of max_feaures(X).

    • If None, then all features are kept.

    To only select based on max_features, set threshold=-np.inf.

Examples

The following example shows how to retrieve the most informative features in the Lasso-based FS method

>>> import pandas as pd
>>> from mafese.embedded.lasso import LassoSelector
>>> # load dataset
>>> dataset = pd.read_csv('your_path/dataset.csv', index_col=0).values
>>> X, y = dataset[:, 0:-1], dataset[:, -1]     # Assumption that the last column is label column
>>> # define mafese feature selection method
>>> feat_selector = LassoSelector(problem="classification", estimator="lasso", estimator_paras={"alpha": 0.1})
>>> # find all relevant features
>>> feat_selector.fit(X, y)
>>> # check selected features - True (or 1) is selected, False (or 0) is not selected
>>> print(feat_selector.selected_feature_masks)
array([ True, True, True, False, False, True, False, False, False, True])
>>> print(feat_selector.selected_feature_solution)
array([ 1, 1, 1, 0, 0, 1, 0, 0, 0, 1])
>>> # check the index of selected features
>>> print(feat_selector.selected_feature_indexes)
array([ 0, 1, 2, 5, 9])
>>> # call transform() on X to filter it down to selected features
>>> X_filtered = feat_selector.transform(X)
SUPPORT = {'classification': ['lasso', 'lr', 'svm'], 'regression': ['lasso']}
fit(X, y=None)[source]

Learn the features to select from X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of predictors.

  • y (array-like of shape (n_samples,), default=None) – Target values. This parameter may be ignored for unsupervised learning.

Returns

self – Returns the instance itself.

Return type

object

mafese.embedded.tree module
class mafese.embedded.tree.TreeSelector(problem='classification', estimator='tree', estimator_paras=None, threshold=None, norm_order=1, max_features=None)[source]

Bases: mafese.selector.Selector

Defines a TreeSelector class that hold all Tree-based Feature Selection methods for feature selection problems

Parameters
  • problem (str, default = "classification") – The problem you are trying to solve (or type of dataset), “classification” or “regression”

  • estimator (str, default = 'tree') –

    We are currently support:
    • rf: random forest

    • adaboost: AdaBoost

    • xgb: Gradient Boosting

    • tree: Extra Trees

  • estimator_paras (None or dict, default = None) – The parameters of the estimator, please see the official document of scikit-learn to selected estimator. If None, we use the best parameter for selected estimator

  • threshold (str or float, default=None) – The threshold value to use for feature selection. Features whose absolute importance value is greater or equal are kept while the others are discarded. If “median” (resp. “mean”), then the threshold value is the median (resp. the mean) of the feature importances. A scaling factor (e.g., “1.25*mean”) may also be used. If None and if the estimator has a parameter penalty set to l1, either explicitly or implicitly (e.g, Lasso), the threshold used is 1e-5. Otherwise, “mean” is used by default.

  • norm_order (non-zero int, inf, -inf, default=1) – Order of the norm used to filter the vectors of coefficients below threshold in the case where the coef_ attribute of the estimator is of dimension 2.

  • max_features (int, callable, default=None) –

    The maximum number of features to select.

    • If an integer, then it specifies the maximum number of features to allow.

    • If a callable, then it specifies how to calculate the maximum number of features allowed by using the output of max_feaures(X).

    • If None, then all features are kept.

    To only select based on max_features, set threshold=-np.inf.

Examples

The following example shows how to retrieve the most informative features in the Tree-based FS method

>>> import pandas as pd
>>> from mafese.embedded.tree import TreeSelector
>>> # load dataset
>>> dataset = pd.read_csv('your_path/dataset.csv', index_col=0).values
>>> X, y = dataset[:, 0:-1], dataset[:, -1]     # Assumption that the last column is label column
>>> # define mafese feature selection method
>>> feat_selector = TreeSelector(problem="classification", estimator="tree")
>>> # find all relevant features
>>> feat_selector.fit(X, y)
>>> # check selected features - True (or 1) is selected, False (or 0) is not selected
>>> print(feat_selector.selected_feature_masks)
array([ True, True, True, False, False, True, False, False, False, True])
>>> print(feat_selector.selected_feature_solution)
array([ 1, 1, 1, 0, 0, 1, 0, 0, 0, 1])
>>> # check the index of selected features
>>> print(feat_selector.selected_feature_indexes)
array([ 0, 1, 2, 5, 9])
>>> # call transform() on X to filter it down to selected features
>>> X_filtered = feat_selector.transform(X)
SUPPORTED = ['rf', 'adaboost', 'xgb', 'tree']
fit(X, y=None)[source]

Learn the features to select from X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of predictors.

  • y (array-like of shape (n_samples,), default=None) – Target values. This parameter may be ignored for unsupervised learning.

Returns

self – Returns the instance itself.

Return type

object

Citation Request

Please include these citations if you plan to use this library:

@software{nguyen_van_thieu_2023_7969043,
  author       = {Nguyen Van Thieu, Ngoc Hung Nguyen, Ali Asghar Heidari},
  title        = {Feature Selection using Metaheuristics Made Easy: Open Source MAFESE Library in Python},
  month        = may,
  year         = 2023,
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.7969042},
  url          = {https://github.com/thieu1995/mafese}
}

@article{van2023mealpy,
  title={MEALPY: An open-source library for latest meta-heuristic algorithms in Python},
  author={Van Thieu, Nguyen and Mirjalili, Seyedali},
  journal={Journal of Systems Architecture},
  year={2023},
  publisher={Elsevier},
  doi={10.1016/j.sysarc.2023.102871}
}

If you have an open-ended or a research question, you can contact me via nguyenthieu2102@gmail.com

License

The project is licensed under GNU General Public License (GPL) V3 license.

Indices and tables