Welcome to MAFESE’s documentation!¶
MAFESE (Metaheuristic Algorithms for FEature SElection) is the largest python library focused on feature selection using meta-heuristic algorithms.
Free software: GNU General Public License (GPL) V3 license
Total Wrapper-based (Metaheuristic Algorithms): > 200 methods
Total Filter-based (Statistical-based): > 15 methods
Total Embedded-based (Tree and Lasso): > 10 methods
Total Unsupervised-based: >= 4 methods
Total datasets: >= 30 (47 classifications and 7 regressions)
Total performance metrics: >= 61 (45 regressions and 16 classifications)
Total objective functions (as fitness functions): >= 61 (45 regressions and 16 classifications)
Documentation: https://mafese.readthedocs.io/en/latest/
Python versions: >= 3.7.x
Dependencies: numpy, scipy, scikit-learn, pandas, mealpy, permetrics, plotly, kaleido
Features¶
- Our library provides all state-of-the-art feature selection methods:
Filter-based FS
- Embedded-based FS
Regularization (Lasso-based)
Tree-based methods
- Wrapper-based FS
Sequential-based: forward and backward
Recursive-based
MHA-based: Metaheuristic Algorithms
Unsupervised-based FS
We have implemented all feature selection methods based on scipy, scikit-learn and numpy to increase the speed of the algorithms.
Installation¶
Install the current PyPI release:
$ pip install mafese==0.1.9
Install directly from source code:
$ git clone https://github.com/thieu1995/mafese.git $ cd mafese $ python setup.py install
In case, you want to install the development version from Github:
$ pip install git+https://github.com/thieu1995/mafese
After installation, you can import MAFESE as any other Python module:
$ python
>>> import mafese
>>> mafese.__version__
Lib’s structure¶
Current’s structure:
docs
examples
mafese
data/
cls/
...csv
reg/
...csv
wrapper/
mha.py
recursive.py
sequential.py
embedded/
lasso.py
tree.py
filter.py
unsupervised.py
utils/
correlation.py
data_loader.py
encoder.py
estimator.py
mealpy_util.py
transfer.py
validator.py
__init__.py
selector.py
README.md
setup.py
Examples¶
Let’s go through some examples.
First, you need to load your dataset, or you can load own available datasets:
# Load available dataset from MAFESE
from mafese import get_dataset
# Try unknown data
get_dataset("unknown")
# Enter: 1
data = get_dataset("Arrhythmia")
Load your own dataset if you want:
import pandas as pd
from mafese import Data
# load X and y
# NOTE mafese accepts numpy arrays only, hence the .values attribute
dataset = pd.read_csv('examples/dataset.csv', index_col=0).values
X, y = dataset[:, 0:-1], dataset[:, -1]
data = Data(X, y)
Next, split dataset into train and test set:
data.split_train_test(test_size=0.2, inplace=True)
print(data.X_train[:2].shape)
print(data.y_train[:2].shape)
Next, how to use Recursive wrapper-based method:
from mafese.wrapper.recursive import RecursiveSelector
# define mafese feature selection method
feat_selector = RecursiveSelector(problem="classification", estimator="rf", n_features=5)
# find all relevant features - 5 features should be selected
feat_selector.fit(data.X_train, data.y_train)
# check selected features - True (or 1) is selected, False (or 0) is not selected
print(feat_selector.selected_feature_masks)
print(feat_selector.selected_feature_solution)
# check the index of selected features
print(feat_selector.selected_feature_indexes)
# call transform() on X to filter it down to selected features
X_train_selected = feat_selector.transform(data.X_train)
X_test_selected = feat_selector.transform(data.X_test)
Or, how to use Sequential (backward or forward) wrapper-based method:
from mafese.wrapper.sequential import SequentialSelector
# define mafese feature selection method
feat_selector = SequentialSelector(problem="classification", estimator="knn", n_features=3, direction="forward")
# find all relevant features - 5 features should be selected
feat_selector.fit(data.X_train, data.y_train)
# check selected features - True (or 1) is selected, False (or 0) is not selected
print(feat_selector.selected_feature_masks)
print(feat_selector.selected_feature_solution)
# check the index of selected features
print(feat_selector.selected_feature_indexes)
# call transform() on X to filter it down to selected features
X_train_selected = feat_selector.transform(data.X_train)
X_test_selected = feat_selector.transform(data.X_test)
Or, how to use Filter-based feature selection with different correlation methods:
from mafese.filter import FilterSelector
# define mafese feature selection method
feat_selector = FilterSelector(problem='classification', method='SPEARMAN', n_features=5)
# find all relevant features - 5 features should be selected
feat_selector.fit(data.X_train, data.y_train)
# check selected features - True (or 1) is selected, False (or 0) is not selected
print(feat_selector.selected_feature_masks)
print(feat_selector.selected_feature_solution)
# check the index of selected features
print(feat_selector.selected_feature_indexes)
# call transform() on X to filter it down to selected features
X_train_selected = feat_selector.transform(data.X_train)
X_test_selected = feat_selector.transform(data.X_test)
Or, use Metaheuristic-based feature selection with different metaheuristic algorithms:
from mafese.wrapper.mha import MhaSelector
from mafese import get_dataset
from mafese import evaluator
from sklearn.svm import SVC
data = get_dataset("Arrhythmia")
data.split_train_test(test_size=0.2)
print(data.X_train.shape, data.X_test.shape) # (361, 279) (91, 279)
# define mafese feature selection method
feat_selector = MhaSelector(problem="classification", estimator="knn",
optimizer="BaseGA", optimizer_paras=None,
transfer_func="vstf_01", obj_name="AS")
# find all relevant features
feat_selector.fit(data.X_train, data.y_train, fit_weights=(0.9, 0.1), verbose=True)
# check selected features - True (or 1) is selected, False (or 0) is not selected
print(feat_selector.selected_feature_masks)
print(feat_selector.selected_feature_solution)
# check the index of selected features
print(feat_selector.selected_feature_indexes)
# call transform() on X to filter it down to selected features
X_train_selected = feat_selector.transform(data.X_train)
X_test_selected = feat_selector.transform(data.X_test)
# Evaluate final dataset with different estimator with multiple performance metrics
results = feat_selector.evaluate(estimator=SVC(), data=data, metrics=["AS", "PS", "RS"])
print(results)
# {'AS_train': 0.77176, 'PS_train': 0.54177, 'RS_train': 0.6205, 'AS_test': 0.72636, 'PS_test': 0.34628, 'RS_test': 0.52747}
Or, use Lasso-based feature selection with different estimator:
from mafese.embedded.lasso import LassoSelector
from mafese import get_dataset
from mafese import evaluator
from sklearn.svm import SVC
data = get_dataset("Arrhythmia")
data.split_train_test(test_size=0.2)
print(data.X_train.shape, data.X_test.shape) # (361, 279) (91, 279)
# define mafese feature selection method
feat_selector = LassoSelector(problem="classification", estimator="lasso", estimator_paras={"alpha": 0.1})
# find all relevant features
feat_selector.fit(data.X_train, data.y_train)
# check selected features - True (or 1) is selected, False (or 0) is not selected
print(feat_selector.selected_feature_masks)
print(feat_selector.selected_feature_solution)
# check the index of selected features
print(feat_selector.selected_feature_indexes)
# call transform() on X to filter it down to selected features
X_train_selected = feat_selector.transform(data.X_train)
X_test_selected = feat_selector.transform(data.X_test)
# Evaluate final dataset with different estimator with multiple performance metrics
results = feat_selector.evaluate(estimator=SVC(), data=data, metrics=["AS", "PS", "RS"])
print(results)
# {'AS_train': 0.77176, 'PS_train': 0.54177, 'RS_train': 0.6205, 'AS_test': 0.72636, 'PS_test': 0.34628, 'RS_test': 0.52747}
Or, use Tree-based feature selection with different estimator:
from mafese.embedded.tree import TreeSelector
from mafese import get_dataset
from mafese import evaluator
from sklearn.svm import SVC
data = get_dataset("Arrhythmia")
data.split_train_test(test_size=0.2)
print(data.X_train.shape, data.X_test.shape) # (361, 279) (91, 279)
# define mafese feature selection method
feat_selector = TreeSelector(problem="classification", estimator="tree")
# find all relevant features
feat_selector.fit(data.X_train, data.y_train)
# check selected features - True (or 1) is selected, False (or 0) is not selected
print(feat_selector.selected_feature_masks)
print(feat_selector.selected_feature_solution)
# check the index of selected features
print(feat_selector.selected_feature_indexes)
# call transform() on X to filter it down to selected features
X_train_selected = feat_selector.transform(data.X_train)
X_test_selected = feat_selector.transform(data.X_test)
# Evaluate final dataset with different estimator with multiple performance metrics
results = feat_selector.evaluate(estimator=SVC(), data=data, metrics=["AS", "PS", "RS"])
print(results)
# {'AS_train': 0.77176, 'PS_train': 0.54177, 'RS_train': 0.6205, 'AS_test': 0.72636, 'PS_test': 0.34628, 'RS_test': 0.52747}
For more usage examples please look at [examples](/examples) folder.
MAFESE Library¶
mafese.selector¶
- class mafese.selector.Selector(problem='classification')[source]¶
Bases:
abc.ABC
Defines an abstract class for Feature Selector.
- SUPPORTED_CLASSIFICATION_METRICS = ['PS', 'NPV', 'RS', 'AS', 'F1S', 'F2S', 'FBS', 'SS', 'MCC', 'HS', 'CKS', 'JSI', 'GMS', 'ROC-AUC', 'LS', 'GINI', 'CEL', 'HL', 'KLDL', 'BSL']¶
- SUPPORTED_ESTIMATORS = ['knn', 'svm', 'rf', 'adaboost', 'xgb', 'tree', 'ann']¶
- SUPPORTED_PROBLEMS = ['classification', 'regression']¶
- SUPPORTED_REGRESSION_METRICS = ['EVS', 'ME', 'MAE', 'MSE', 'RMSE', 'MSLE', 'MedAE', 'MRE', 'MRB', 'MAPE', 'SMAPE', 'MAAPE', 'MASE', 'NSE', 'NNSE', 'WI', 'R', 'PCC', 'AR', 'APCC', 'R2S', 'RSQ', 'R2', 'COD', 'AR2', 'ACOD', 'CI', 'DRV', 'KGE', 'GINI', 'GINI_WIKI', 'PCD', 'JSD', 'VAF', 'RAE', 'A10', 'A20', 'A30', 'NRMSE', 'RSE', 'COV', 'COR', 'EC', 'OI', 'CRM']¶
- evaluate(estimator=None, estimator_paras=None, data=None, metrics=None)[source]¶
Evaluate the new dataset. We will re-train the estimator with training set and return the metrics of both training and testing set
- Parameters
estimator (str or Estimator instance (from scikit-learn or custom)) –
- If estimator is str, we are currently support:
knn: k-nearest neighbors
svm: support vector machine
rf: random forest
adaboost: AdaBoost
xgb: Gradient Boosting
tree: Extra Trees
ann: Artificial Neural Network (Multi-Layer Perceptron)
If estimator is Estimator instance: you need to make sure that it has fit and predict methods
estimator_paras (None or dict, default = None) – The parameters of the estimator, please see the official document of scikit-learn to selected estimator. If None, we use the default parameter for selected estimator
data (Data, an instance of Data class. It must have training and testing set) –
metrics (tuple, list, default = None) – Depend on the regression or classification you are trying to tackle. The supported metrics can be found at: https://github.com/thieu1995/permetrics
- Returns
metrics_results – The metrics for both training and testing set.
- Return type
dict.
- fit(X, y=None)[source]¶
Learn the features to select from X.
- Parameters
X (array-like of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of predictors.
y (array-like of shape (n_samples,), default=None) – Target values. This parameter may be ignored for unsupervised learning.
- Returns
self – Returns the instance itself.
- Return type
object
- fit_transform(X, y=None, **fit_params)[source]¶
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
ndarray array of shape (n_samples, n_features_new)
- name = 'Feature Selector'¶
mafese.filter¶
- class mafese.filter.FilterSelector(problem='classification', method='ANOVA', n_features=3, n_neighbors=5, n_bins=10, normalized=True)[source]¶
Bases:
mafese.selector.Selector
Defines a FilterSelector class that hold all filter methods for feature selection problems
- Parameters
problem (str, default = "classification") – The problem you are trying to solve (or type of dataset), “classification” or “regression”
method (str, default = "ANOVA") –
If the problem = “classification”, FilterSelector’s support method can be one of this value:
”CHI”: Chi-Squared statistic
”ANOVA”: ANOVA F-score
”MI”: Mutual information
”KENDALL”: Kendall Tau correlation
”SPEARMAN”: Spearman’s Rho correlation
”POINT”: Point-biserial correlation
”RELIEF”: Original Relief method
”RELIEF-F”: Weighted average Relief based on the frequency of each class
”VLS-RELIEF-F”: Very Large Scale ReliefF
If the problem = “regression”, FilterSelector’s support method can be one of this value:
”PEARSON”: Pearson correlation
”ANOVA”: ANOVA F-score
”MI”: Mutual information
”KENDALL”: Kendall Tau correlation
”SPEARMAN”: Spearman’s Rho correlation
”POINT”: Point-biserial correlation
”RELIEF”: Original Relief method
”RELIEF-F”: Weighted average Relief based on the frequency of each class
”VLS-RELIEF-F”: Very Large Scale ReliefF
n_features (int or float, default=3) – If integer, the parameter is the absolute number of features to select. If float between 0 and 1, it is the fraction of features to select.
n_neighbors (int, default=5, Optional) – Number of neighbors to use for computing feature importance scores of Relief-based family
n_bins (int, default=10, Optional) – Number of bins to use for discretizing the target variable of Relief-based family in regression problems.
normalized (bool, default=True, Optional) – Normalize feature importance scores by the number of instances in the dataset
- n_features¶
The number of selected features.
- Type
int
- supported_methods¶
Key: is the support method name Value: is the support method function
- Type
dict
- method_name¶
The method that will be used
- Type
str
Examples
The following example shows how to retrieve the most informative features in the FilterSelector FS method
>>> import pandas as pd >>> from mafese.filter import FilterSelector >>> # load dataset >>> dataset = pd.read_csv('your_path/dataset.csv', index_col=0).values >>> X, y = dataset[:, 0:-1], dataset[:, -1] # Assumption that the last column is label column >>> # define mafese feature selection method >>> feat_selector = FilterSelector(problem='classification', method='SPEARMAN', n_features=5) >>> # find all relevant features >>> feat_selector.fit(X, y) >>> # check selected features - True (or 1) is selected, False (or 0) is not selected >>> print(feat_selector.selected_feature_masks) array([ True, True, True, False, False, True, False, False, False, True]) >>> print(feat_selector.selected_feature_solution) array([ 1, 1, 1, 0, 0, 1, 0, 0, 0, 1]) >>> # check the index of selected features >>> print(feat_selector.selected_feature_indexes) array([ 0, 1, 2, 5, 9]) >>> # call transform() on X to filter it down to selected features >>> X_filtered = feat_selector.transform(X)
- SUPPORT = {'classification': {'ANOVA': 'f_classification_func', 'CHI': 'chi2_func', 'KENDALL': 'kendall_func', 'MI': 'mutual_info_classif', 'POINT': 'point_func', 'RELIEF': 'relief_func', 'RELIEF-F': 'relief_f_func', 'SPEARMAN': 'spearman_func', 'VLS-RELIEF-F': 'vls_relief_f_func'}, 'regression': {'ANOVA': 'f_regression_func', 'KENDALL': 'kendall_func', 'MI': 'mutual_info_regression', 'PEARSON': 'r_regression', 'POINT': 'point_func', 'RELIEF': 'relief_func', 'RELIEF-F': 'relief_f_func', 'SPEARMAN': 'spearman_func', 'VLS-RELIEF-F': 'vls_relief_f_func'}}¶
- fit(X, y=None)[source]¶
Learn the features to select from X.
- Parameters
X (array-like of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of predictors.
y (array-like of shape (n_samples,), default=None) – Target values. This parameter may be ignored for unsupervised learning.
- Returns
self – Returns the instance itself.
- Return type
object
- fit_transform(X, y=None, **fit_params)[source]¶
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
ndarray array of shape (n_samples, n_features_new)
mafese submodule¶
mafese.utils package¶
mafese.utils.correlation module¶
Refs: 1. https://docs.scipy.org/doc/scipy/reference/stats.html#correlation-functions 2. https://scikit-learn.org/stable/modules/classes.html#module-sklearn.feature_selection
- mafese.utils.correlation.relief_f_func(X, y, n_neighbors=5, n_bins=10, problem='classification', normalized=True, **kwargs)[source]¶
Performs Relief-F feature selection on the input dataset X and target variable y. Returns a vector of feature importance scores
- Parameters
X (numpy array) – Input dataset of shape (n_samples, n_features).
y (numpy array) – Target variable of shape (n_samples,).
n_neighbors (int, default=5) – Number of neighbors to use for computing feature importance scores.
n_bins (int, default=10) – Number of bins to use for discretizing the target variable in regression problems.
problem (str) – The problem of dataset, either regression or classification If regression, discretize the target variable into n_bins classes
normalized (bool, default=True) – Normalize feature importance scores by the number of instances in the dataset
- Returns
importance score – Vector of feature importance scores, with shape (n_features,).
- Return type
np.ndarray
- mafese.utils.correlation.relief_func(X, y, n_neighbors=5, n_bins=10, problem='classification', normalized=True, **kwargs)[source]¶
Performs Relief feature selection on the input dataset X and target variable y. Returns a vector of feature importance scores.
- Parameters
X (numpy array) – Input dataset of shape (n_samples, n_features).
y (numpy array) – Target variable of shape (n_samples,).
n_neighbors (int, default=5) – Number of neighbors to use for computing feature importance scores.
n_bins (int, default=10) – Number of bins to use for discretizing the target variable in regression problems.
problem (str) – The problem of dataset, either regression or classification If regression, discretize the target variable into n_bins classes
normalized (bool, default=True) – Normalize feature importance scores by the number of instances in the dataset
- Returns
importance score – Vector of feature importance scores, with shape (n_features,).
- Return type
np.ndarray
- mafese.utils.correlation.select_bests(importance_scores=None, n_features=3)[source]¶
Select features according to the k highest scores or percentile of the highest scores.
- Parameters
importance_scores (array-like of shape (n_features,)) – Scores of features.
n_features (int, float. default=3) –
Number of selected features.
If float, it should be in range of (0, 1). That represent the percentile of the highest scores.
If int, it should be in range of (1, N-1). N is total number of features in your dataset.
- Returns
mask
- Return type
Number of top features to select.
- mafese.utils.correlation.vls_relief_f_func(X, y, n_neighbors=5, n_bins=10, problem='classification', normalized=True, **kwargs)[source]¶
Performs Very Large Scale ReliefF feature selection on the input dataset X and target variable y. Returns a vector of feature importance scores
- Parameters
X (numpy array) – Input dataset of shape (n_samples, n_features).
y (numpy array) – Target variable of shape (n_samples,).
n_neighbors (int, default=5) – Number of neighbors to use for computing feature importance scores.
n_bins (int, default=10) – Number of bins to use for discretizing the target variable in regression problems.
problem (str) – The problem of dataset, either regression or classification If regression, discretize the target variable into n_bins classes
normalized (bool, default=True) – Normalize feature importance scores by the number of instances in the dataset
- Returns
importance score – Vector of feature importance scores, with shape (n_features,).
- Return type
np.ndarray
mafese.utils.data_loader module¶
- class mafese.utils.data_loader.Data(X, y)[source]¶
Bases:
object
The structure of our supported Data class
- Parameters
X (np.ndarray) – The features of your data
y (np.ndarray) – The labels of your data
mafese.utils.encoder module¶
- class mafese.utils.encoder.LabelEncoder[source]¶
Bases:
object
Encode categorical features as integer labels.
- fit_transform(y)[source]¶
Fit label encoder and return encoded labels.
- Parameters
y (array-like of shape (n_samples,)) – Target values.
- Returns
y – Encoded labels.
- Return type
array-like of shape (n_samples,)
mafese.utils.estimator module¶
mafese.utils.mealpy_util module¶
- class mafese.utils.mealpy_util.FeatureSelectionProblem(lb, ub, minmax, data=None, estimator=None, transfer_func=None, obj_name=None, metric_class=None, fit_weights=(0.9, 0.1), fit_sign=1, obj_paras=None, name='Feature Selection Problem', **kwargs)[source]¶
Bases:
mealpy.utils.problem.Problem
- amend_position(position=None, lb=None, ub=None)[source]¶
The goal is to transform the solution into the right format corresponding to the problem. For example, with discrete problems, floating-point numbers must be converted to integers to ensure the solution is in the correct format.
- Parameters
position – vector position (location) of the solution.
lb – list of lower bound values
ub – list of upper bound values
- Returns
Amended position (make the right format of the solution)
mafese.utils.transfer module¶
mafese.utils.validator module¶
mafese.wrapper package¶
mafese.wrapper.recursive module¶
- class mafese.wrapper.recursive.RecursiveSelector(problem='classification', estimator='knn', estimator_paras=None, n_features=3, step=1, verbose=0, importance_getter='auto')[source]¶
Bases:
mafese.selector.Selector
Defines a RecursiveSelector class that hold all RecursiveSelector Feature Selection methods for feature selection problems
- Parameters
problem (str, default = "classification") – The problem you are trying to solve (or type of dataset), “classification” or “regression”
estimator (str or Estimator instance (from scikit-learn or custom)) –
- If estimator is str, we are currently support:
svm: support vector machine with kernel = ‘linear’
rf: random forest
adaboost: AdaBoost
xgb: Gradient Boosting
tree: Extra Trees
If estimator is Estimator instance: you need to make sure it is has a
fit
method that provides information about feature importance (e.g. coef_, feature_importances_).estimator_paras (None or dict, default = None) – The parameters of the estimator, please see the official document of scikit-learn to selected estimator. If None, we use the best parameter for selected estimator
n_features (int or float, default=3) – The number of features to select. If None, half of the features are selected. If integer, the parameter is the absolute number of features to select. If float between 0 and 1, it is the fraction of features to select.
step (int or float, default=1) – If greater than or equal to 1, then
step
corresponds to the (integer) number of features to remove at each iteration. If within (0.0, 1.0), thenstep
corresponds to the percentage (rounded down) of features to remove at each iteration.verbose (int, default=0) – Controls verbosity of output.
importance_getter (str or callable, default='auto') –
If ‘auto’, uses the feature importance either through a coef_ or feature_importances_ attributes of estimator.
Also accepts a string that specifies an attribute name/path for extracting feature importance (implemented with attrgetter). For example, give regressor_.coef_ in case of
TransformedTargetRegressor
or named_steps.clf.feature_importances_ in case of class:~sklearn.pipeline.Pipeline with its last step named clf.If callable, overrides the default feature importance getter. The callable is passed with the fitted estimator and it should return importance for each feature.
Examples
The following example shows how to retrieve the most informative features in the RecursiveSelector FS method
>>> import pandas as pd >>> from mafese.wrapper.recursive import RecursiveSelector >>> # load dataset >>> dataset = pd.read_csv('your_path/dataset.csv', index_col=0).values >>> X, y = dataset[:, 0:-1], dataset[:, -1] # Assumption that the last column is label column >>> # define mafese feature selection method >>> feat_selector = RecursiveSelector(problem="classification", estimator="rf", n_features=5) >>> # find all relevant features >>> feat_selector.fit(X, y) >>> # check selected features - True (or 1) is selected, False (or 0) is not selected >>> print(feat_selector.selected_feature_masks) array([ True, True, True, False, False, True, False, False, False, True]) >>> print(feat_selector.selected_feature_solution) array([ 1, 1, 1, 0, 0, 1, 0, 0, 0, 1]) >>> # check the index of selected features >>> print(feat_selector.selected_feature_indexes) array([ 0, 1, 2, 5, 9]) >>> # call transform() on X to filter it down to selected features >>> X_filtered = feat_selector.transform(X)
- SUPPORT = ['svm', 'rf', 'adaboost', 'xgb', 'tree']¶
- fit(X, y=None)[source]¶
Learn the features to select from X.
- Parameters
X (array-like of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of predictors.
y (array-like of shape (n_samples,), default=None) – Target values. This parameter may be ignored for unsupervised learning.
- Returns
self – Returns the instance itself.
- Return type
object
mafese.wrapper.sequential module¶
- class mafese.wrapper.sequential.SequentialSelector(problem='classification', estimator='knn', estimator_paras=None, n_features=3, direction='forward', tol=None, scoring=None, cv=5, n_jobs=None)[source]¶
Bases:
mafese.selector.Selector
Defines a SequentialSelector class that hold all Forward or Backward Feature Selection methods for feature selection problems
- Parameters
problem (str, default = "classification") – The problem you are trying to solve (or type of dataset), “classification” or “regression”
estimator (str or Estimator instance (from scikit-learn or custom)) –
- If estimator is str, we are currently support:
knn: k-nearest neighbors
svm: support vector machine
rf: random forest
adaboost: AdaBoost
xgb: Gradient Boosting
tree: Extra Trees
ann: Artificial Neural Network (Multi-Layer Perceptron)
If estimator is Estimator instance: you need to make sure it is has a
fit
method that provides information about feature importance (e.g. coef_, feature_importances_).estimator_paras (None or dict, default = None) – The parameters of the estimator, please see the official document of scikit-learn to selected estimator. If None, we use the default parameter for selected estimator
n_features (int or float, default=3) – The number of features to select. If None, half of the features are selected. If integer, the parameter is the absolute number of features to select. If float between 0 and 1, it is the fraction of features to select.
direction ({'forward', 'backward'}, default='forward') – Whether to perform forward selection or backward selection.
tol (float, default=None) – If the score is not incremented by at least tol between two consecutive feature additions or removals, stop adding or removing. tol can be negative when removing features using direction=”backward”. It can be useful to reduce the number of features at the cost of a small decrease in the score. tol is enabled only when n_features is “auto”.
scoring (str or callable, default=None) – A single str (see scoring_parameter) or a callable to evaluate the predictions on the test set. NOTE that when using a custom scorer, it should return a single value. If None, the estimator’s score method is used.
cv (int, cross-validation generator or an iterable, default=None) –
Determines the cross-validation splitting strategy. Possible inputs for cv are:
None, to use the default 5-fold cross validation,
integer, to specify the number of folds in a (Stratified)KFold,
CV splitter,
An iterable yielding (train, test) splits as arrays of indices.
For integer/None inputs, if the estimator is a classifier and
y
is either binary or multiclass,StratifiedKFold
is used. In all other cases,KFold
is used. These splitters are instantiated with shuffle=False so the splits will be the same across calls.n_jobs (int, default=None) – Number of jobs to run in parallel. When evaluating a new feature to add or remove, the cross-validation procedure is parallel over the folds.
None
means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors.
Examples
The following example shows how to retrieve the most informative features in the Sequential-based (forward, backward) FS method
>>> import pandas as pd >>> from mafese.wrapper.sequential import SequentialSelector >>> # load dataset >>> dataset = pd.read_csv('your_path/dataset.csv', index_col=0).values >>> X, y = dataset[:, 0:-1], dataset[:, -1] # Assumption that the last column is label column >>> # define mafese feature selection method >>> feat_selector = SequentialSelector(problem="classification", estimator="knn", n_features=5, direction="forward") >>> # find all relevant features >>> feat_selector.fit(X, y) >>> # check selected features - True (or 1) is selected, False (or 0) is not selected >>> print(feat_selector.selected_feature_masks) array([ True, True, True, False, False, True, False, False, False, True]) >>> print(feat_selector.selected_feature_solution) array([ 1, 1, 1, 0, 0, 1, 0, 0, 0, 1]) >>> # check the index of selected features >>> print(feat_selector.selected_feature_indexes) array([ 0, 1, 2, 5, 9]) >>> # call transform() on X to filter it down to selected features >>> X_filtered = feat_selector.transform(X)
- SUPPORT = ['knn', 'svm', 'rf', 'adaboost', 'xgb', 'tree', 'ann']¶
- fit(X, y=None)[source]¶
Learn the features to select from X.
- Parameters
X (array-like of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of predictors.
y (array-like of shape (n_samples,), default=None) – Target values. This parameter may be ignored for unsupervised learning.
- Returns
self – Returns the instance itself.
- Return type
object
mafese.wrapper.mha module¶
- class mafese.wrapper.mha.MhaSelector(problem='classification', estimator='knn', estimator_paras=None, optimizer='BaseGA', optimizer_paras=None, transfer_func='vstf_01', obj_name=None)[source]¶
Bases:
mafese.selector.Selector
Defines a MhaSelector class that hold all Metaheuristic-based Feature Selection methods for feature selection problems
- Parameters
problem (str, default = "classification") – The problem you are trying to solve (or type of dataset), “classification” or “regression”
estimator (str or Estimator instance (from scikit-learn or custom)) –
- If estimator is str, we are currently support:
knn: k-nearest neighbors
svm: support vector machine
rf: random forest
adaboost: AdaBoost
xgb: Gradient Boosting
tree: Extra Trees
ann: Artificial Neural Network (Multi-Layer Perceptron)
If estimator is Estimator instance: you need to make sure that it has fit and predict methods
estimator_paras (None or dict, default = None) – The parameters of the estimator, please see the official document of scikit-learn to selected estimator. If None, we use the default parameter for selected estimator
optimizer (str or instance of Optimizer class (from Mealpy library), default = "BaseGA") – The Metaheuristic Algorithm that use to solve the feature selection problem. Current supported list, please check it here: https://github.com/thieu1995/mealpy. If a custom optimizer is passed, make sure it is an instance of Optimizer class.
optimizer_paras (None or dict of parameter, default=None) – The parameter for the optimizer object. If None, the default parameters of optimizer is used (defined in https://github.com/thieu1995/mealpy.) If dict is passed, make sure it has at least epoch and pop_size parameters.
transfer_func (str or callable function, default="vstf_01") –
- The transfer function used to convert solution from float to integer. Current supported list:
v-shape transfer function: “vstf_01”, “vstf_02”, “vstf_03”, “vstf_04”
s-shape transfer function: “sstf_01”, “sstf_02”, “sstf_03”, “sstf_04”
If callable function, make sure it return a list/tuple/np.ndarray values.
obj_name (None or str, default=None) –
The name of objective for the problem, also depend on the problem is classification and regression.
If problem is classification, None will be replaced by AS (Accuracy score).
If problem is regression, None will be replaced by MSE (Mean squared error).
Examples
The following example shows how to retrieve the most informative features in the MhaSelector FS method
>>> import pandas as pd >>> from mafese.wrapper.mha import MhaSelector >>> # load dataset >>> dataset = pd.read_csv('your_path/dataset.csv', index_col=0).values >>> X, y = dataset[:, 0:-1], dataset[:, -1] # Assumption that the last column is label column >>> # define mafese feature selection method >>> feat_selector = MhaSelector(problem="classification", estimator="rf", optimizer="BaseGA") >>> # find all relevant features - 5 features should be selected >>> feat_selector.fit(X, y) >>> # check selected features - True (or 1) is selected, False (or 0) is not selected >>> print(feat_selector.selected_feature_masks) array([ True, True, True, False, False, True, False, False, False, True]) >>> print(feat_selector.selected_feature_solution) array([ 1, 1, 1, 0, 0, 1, 0, 0, 0, 1]) >>> # check the index of selected features >>> print(feat_selector.selected_feature_indexes) array([ 0, 1, 2, 5, 9]) >>> # call transform() on X to filter it down to selected features >>> X_filtered = feat_selector.transform(X)
- SUPPORT = {'classification_objective': {'AS': 'max', 'BSL': 'min', 'CEL': 'min', 'CKS': 'max', 'F1S': 'max', 'F2S': 'max', 'FBS': 'max', 'GINI': 'min', 'GMS': 'max', 'HL': 'min', 'HS': 'max', 'JSI': 'max', 'KLDL': 'min', 'LS': 'max', 'MCC': 'max', 'NPV': 'max', 'PS': 'max', 'ROC-AUC': 'max', 'RS': 'max', 'SS': 'max'}, 'estimator': ['knn', 'svm', 'rf', 'adaboost', 'xgb', 'tree', 'ann'], 'optimizer': ['OriginalABC', 'OriginalACOR', 'AugmentedAEO', 'EnhancedAEO', 'ImprovedAEO', 'ModifiedAEO', 'OriginalAEO', 'MGTO', 'OriginalAGTO', 'BaseALO', 'OriginalALO', 'OriginalAO', 'OriginalAOA', 'IARO', 'LARO', 'OriginalARO', 'OriginalASO', 'OriginalAVOA', 'OriginalArchOA', 'AdaptiveBA', 'ModifiedBA', 'OriginalBA', 'BaseBBO', 'OriginalBBO', 'OriginalBBOA', 'OriginalBES', 'ABFO', 'OriginalBFO', 'OriginalBMO', 'BaseBRO', 'OriginalBRO', 'OriginalBSA', 'ImprovedBSO', 'OriginalBSO', 'CleverBookBeesA', 'OriginalBeesA', 'ProbBeesA', 'OriginalCA', 'OriginalCDO', 'OriginalCEM', 'OriginalCGO', 'BaseCHIO', 'OriginalCHIO', 'OriginalCOA', 'OCRO', 'OriginalCRO', 'OriginalCSA', 'OriginalCSO', 'OriginalCircleSA', 'OriginalCoatiOA', 'BaseDE', 'JADE', 'SADE', 'SAP_DE', 'DevDMOA', 'OriginalDMOA', 'OriginalDO', 'BaseEFO', 'OriginalEFO', 'OriginalEHO', 'AdaptiveEO', 'ModifiedEO', 'OriginalEO', 'OriginalEOA', 'LevyEP', 'OriginalEP', 'CMA_ES', 'LevyES', 'OriginalES', 'Simple_CMA_ES', 'OriginalESOA', 'OriginalEVO', 'OriginalFA', 'BaseFBIO', 'OriginalFBIO', 'OriginalFFA', 'OriginalFFO', 'OriginalFLA', 'BaseFOA', 'OriginalFOA', 'WhaleFOA', 'OriginalFOX', 'OriginalFPA', 'BaseGA', 'EliteMultiGA', 'EliteSingleGA', 'MultiGA', 'SingleGA', 'OriginalGBO', 'BaseGCO', 'OriginalGCO', 'OriginalGJO', 'OriginalGOA', 'BaseGSKA', 'OriginalGSKA', 'Matlab101GTO', 'Matlab102GTO', 'OriginalGTO', 'GWO_WOA', 'IGWO', 'OriginalGWO', 'RW_GWO', 'OriginalHBA', 'OriginalHBO', 'OriginalHC', 'SwarmHC', 'OriginalHCO', 'OriginalHGS', 'OriginalHGSO', 'OriginalHHO', 'BaseHS', 'OriginalHS', 'OriginalICA', 'OriginalINFO', 'OriginalIWO', 'BaseJA', 'LevyJA', 'OriginalJA', 'BaseLCO', 'ImprovedLCO', 'OriginalLCO', 'OriginalMA', 'BaseMFO', 'OriginalMFO', 'OriginalMGO', 'OriginalMPA', 'OriginalMRFO', 'WMQIMRFO', 'OriginalMSA', 'BaseMVO', 'OriginalMVO', 'OriginalNGO', 'ImprovedNMRA', 'OriginalNMRA', 'OriginalNRO', 'OriginalOOA', 'OriginalPFA', 'OriginalPOA', 'CL_PSO', 'C_PSO', 'HPSO_TVAC', 'OriginalPSO', 'PPSO', 'OriginalPSS', 'BaseQSA', 'ImprovedQSA', 'LevyQSA', 'OppoQSA', 'OriginalQSA', 'OriginalRIME', 'OriginalRUN', 'GaussianSA', 'OriginalSA', 'SwarmSA', 'BaseSARO', 'OriginalSARO', 'BaseSBO', 'OriginalSBO', 'BaseSCA', 'OriginalSCA', 'QleSCA', 'OriginalSCSO', 'ImprovedSFO', 'OriginalSFO', 'L_SHADE', 'OriginalSHADE', 'OriginalSHIO', 'OriginalSHO', 'ImprovedSLO', 'ModifiedSLO', 'OriginalSLO', 'BaseSMA', 'OriginalSMA', 'DevSOA', 'OriginalSOA', 'OriginalSOS', 'DevSPBO', 'OriginalSPBO', 'OriginalSRSR', 'BaseSSA', 'OriginalSSA', 'OriginalSSDO', 'OriginalSSO', 'OriginalSSpiderA', 'OriginalSSpiderO', 'OriginalSTO', 'OriginalSeaHO', 'OriginalServalOA', 'OriginalTDO', 'BaseTLO', 'ImprovedTLO', 'OriginalTLO', 'OriginalTOA', 'OriginalTPO', 'OriginalTS', 'OriginalTSA', 'OriginalTSO', 'EnhancedTWO', 'LevyTWO', 'OppoTWO', 'OriginalTWO', 'BaseVCS', 'OriginalVCS', 'OriginalWCA', 'OriginalWDO', 'OriginalWHO', 'HI_WOA', 'OriginalWOA', 'OriginalWaOA', 'OriginalWarSO', 'OriginalZOA'], 'regression_objective': {'A10': 'max', 'A20': 'max', 'A30': 'max', 'ACOD': 'max', 'APCC': 'max', 'AR': 'max', 'AR2': 'max', 'CI': 'max', 'COD': 'max', 'COR': 'max', 'COV': 'max', 'CRM': 'min', 'DRV': 'min', 'EC': 'max', 'EVS': 'max', 'GINI': 'min', 'GINI_WIKI': 'min', 'JSD': 'min', 'KGE': 'max', 'MAAPE': 'min', 'MAE': 'min', 'MAPE': 'min', 'MASE': 'min', 'ME': 'min', 'MRB': 'min', 'MRE': 'min', 'MSE': 'min', 'MSLE': 'min', 'MedAE': 'min', 'NNSE': 'max', 'NRMSE': 'min', 'NSE': 'max', 'OI': 'max', 'PCC': 'max', 'PCD': 'max', 'R': 'max', 'R2': 'max', 'R2S': 'max', 'RAE': 'min', 'RMSE': 'min', 'RSE': 'min', 'RSQ': 'max', 'SMAPE': 'min', 'VAF': 'max', 'WI': 'max'}, 'transfer_func': ['vstf_01', 'vstf_02', 'vstf_03', 'vstf_04', 'sstf_01', 'sstf_02', 'sstf_03', 'sstf_04']}¶
- fit(X, y=None, fit_weights=(0.9, 0.1), verbose=True, mode='single', n_workers=None, termination=None)[source]¶
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The training input samples.
y (array-like of shape (n_samples,)) – The target values.
fit_weights (list, tuple or np.ndarray, default = (0.9, 0.1)) – The first weight is for objective value and the second weight is for the number of features
verbose (int, default = True) – Controls verbosity of output.
mode (str, default = 'single') –
The mode used in Optimizer belongs to Mealpy library. Parallel: ‘process’, ‘thread’; Sequential: ‘swarm’, ‘single’.
’process’: The parallel mode with multiple cores run the tasks
’thread’: The parallel mode with multiple threads run the tasks
’swarm’: The sequential mode that no effect on updating phase of other agents
’single’: The sequential mode that effect on updating phase of other agents, default
n_workers (int or None, default = None) – The number of workers (cores or threads) to do the tasks (effect only on parallel mode)
termination (dict or None, default = None) – The termination dictionary or an instance of Termination class. It is for Optimizer belongs to Mealpy library.
- fit_transform(X, y=None, fit_weights=(0.9, 0.1), verbose=True, mode='single', n_workers=None, termination=None)[source]¶
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
ndarray array of shape (n_samples, n_features_new)
- class mafese.wrapper.mha.MultiMhaSelector(problem='classification', estimator='knn', estimator_paras=None, list_optimizers=('BaseGA',), list_optimizer_paras=None, transfer_func='vstf_01', obj_name=None)[source]¶
Bases:
mafese.selector.Selector
- SUPPORT = {'classification_objective': {'AS': 'max', 'BSL': 'min', 'CEL': 'min', 'CKS': 'max', 'F1S': 'max', 'F2S': 'max', 'FBS': 'max', 'GINI': 'min', 'GMS': 'max', 'HL': 'min', 'HS': 'max', 'JSI': 'max', 'KLDL': 'min', 'LS': 'max', 'MCC': 'max', 'NPV': 'max', 'PS': 'max', 'ROC-AUC': 'max', 'RS': 'max', 'SS': 'max'}, 'estimator': ['knn', 'svm', 'rf', 'adaboost', 'xgb', 'tree', 'ann'], 'optimizer': ['OriginalABC', 'OriginalACOR', 'AugmentedAEO', 'EnhancedAEO', 'ImprovedAEO', 'ModifiedAEO', 'OriginalAEO', 'MGTO', 'OriginalAGTO', 'BaseALO', 'OriginalALO', 'OriginalAO', 'OriginalAOA', 'IARO', 'LARO', 'OriginalARO', 'OriginalASO', 'OriginalAVOA', 'OriginalArchOA', 'AdaptiveBA', 'ModifiedBA', 'OriginalBA', 'BaseBBO', 'OriginalBBO', 'OriginalBBOA', 'OriginalBES', 'ABFO', 'OriginalBFO', 'OriginalBMO', 'BaseBRO', 'OriginalBRO', 'OriginalBSA', 'ImprovedBSO', 'OriginalBSO', 'CleverBookBeesA', 'OriginalBeesA', 'ProbBeesA', 'OriginalCA', 'OriginalCDO', 'OriginalCEM', 'OriginalCGO', 'BaseCHIO', 'OriginalCHIO', 'OriginalCOA', 'OCRO', 'OriginalCRO', 'OriginalCSA', 'OriginalCSO', 'OriginalCircleSA', 'OriginalCoatiOA', 'BaseDE', 'JADE', 'SADE', 'SAP_DE', 'DevDMOA', 'OriginalDMOA', 'OriginalDO', 'BaseEFO', 'OriginalEFO', 'OriginalEHO', 'AdaptiveEO', 'ModifiedEO', 'OriginalEO', 'OriginalEOA', 'LevyEP', 'OriginalEP', 'CMA_ES', 'LevyES', 'OriginalES', 'Simple_CMA_ES', 'OriginalESOA', 'OriginalEVO', 'OriginalFA', 'BaseFBIO', 'OriginalFBIO', 'OriginalFFA', 'OriginalFFO', 'OriginalFLA', 'BaseFOA', 'OriginalFOA', 'WhaleFOA', 'OriginalFOX', 'OriginalFPA', 'BaseGA', 'EliteMultiGA', 'EliteSingleGA', 'MultiGA', 'SingleGA', 'OriginalGBO', 'BaseGCO', 'OriginalGCO', 'OriginalGJO', 'OriginalGOA', 'BaseGSKA', 'OriginalGSKA', 'Matlab101GTO', 'Matlab102GTO', 'OriginalGTO', 'GWO_WOA', 'IGWO', 'OriginalGWO', 'RW_GWO', 'OriginalHBA', 'OriginalHBO', 'OriginalHC', 'SwarmHC', 'OriginalHCO', 'OriginalHGS', 'OriginalHGSO', 'OriginalHHO', 'BaseHS', 'OriginalHS', 'OriginalICA', 'OriginalINFO', 'OriginalIWO', 'BaseJA', 'LevyJA', 'OriginalJA', 'BaseLCO', 'ImprovedLCO', 'OriginalLCO', 'OriginalMA', 'BaseMFO', 'OriginalMFO', 'OriginalMGO', 'OriginalMPA', 'OriginalMRFO', 'WMQIMRFO', 'OriginalMSA', 'BaseMVO', 'OriginalMVO', 'OriginalNGO', 'ImprovedNMRA', 'OriginalNMRA', 'OriginalNRO', 'OriginalOOA', 'OriginalPFA', 'OriginalPOA', 'CL_PSO', 'C_PSO', 'HPSO_TVAC', 'OriginalPSO', 'PPSO', 'OriginalPSS', 'BaseQSA', 'ImprovedQSA', 'LevyQSA', 'OppoQSA', 'OriginalQSA', 'OriginalRIME', 'OriginalRUN', 'GaussianSA', 'OriginalSA', 'SwarmSA', 'BaseSARO', 'OriginalSARO', 'BaseSBO', 'OriginalSBO', 'BaseSCA', 'OriginalSCA', 'QleSCA', 'OriginalSCSO', 'ImprovedSFO', 'OriginalSFO', 'L_SHADE', 'OriginalSHADE', 'OriginalSHIO', 'OriginalSHO', 'ImprovedSLO', 'ModifiedSLO', 'OriginalSLO', 'BaseSMA', 'OriginalSMA', 'DevSOA', 'OriginalSOA', 'OriginalSOS', 'DevSPBO', 'OriginalSPBO', 'OriginalSRSR', 'BaseSSA', 'OriginalSSA', 'OriginalSSDO', 'OriginalSSO', 'OriginalSSpiderA', 'OriginalSSpiderO', 'OriginalSTO', 'OriginalSeaHO', 'OriginalServalOA', 'OriginalTDO', 'BaseTLO', 'ImprovedTLO', 'OriginalTLO', 'OriginalTOA', 'OriginalTPO', 'OriginalTS', 'OriginalTSA', 'OriginalTSO', 'EnhancedTWO', 'LevyTWO', 'OppoTWO', 'OriginalTWO', 'BaseVCS', 'OriginalVCS', 'OriginalWCA', 'OriginalWDO', 'OriginalWHO', 'HI_WOA', 'OriginalWOA', 'OriginalWaOA', 'OriginalWarSO', 'OriginalZOA'], 'regression_objective': {'A10': 'max', 'A20': 'max', 'A30': 'max', 'ACOD': 'max', 'APCC': 'max', 'AR': 'max', 'AR2': 'max', 'CI': 'max', 'COD': 'max', 'COR': 'max', 'COV': 'max', 'CRM': 'min', 'DRV': 'min', 'EC': 'max', 'EVS': 'max', 'GINI': 'min', 'GINI_WIKI': 'min', 'JSD': 'min', 'KGE': 'max', 'MAAPE': 'min', 'MAE': 'min', 'MAPE': 'min', 'MASE': 'min', 'ME': 'min', 'MRB': 'min', 'MRE': 'min', 'MSE': 'min', 'MSLE': 'min', 'MedAE': 'min', 'NNSE': 'max', 'NRMSE': 'min', 'NSE': 'max', 'OI': 'max', 'PCC': 'max', 'PCD': 'max', 'R': 'max', 'R2': 'max', 'R2S': 'max', 'RAE': 'min', 'RMSE': 'min', 'RSE': 'min', 'RSQ': 'max', 'SMAPE': 'min', 'VAF': 'max', 'WI': 'max'}, 'transfer_func': ['vstf_01', 'vstf_02', 'vstf_03', 'vstf_04', 'sstf_01', 'sstf_02', 'sstf_03', 'sstf_04']}¶
- evaluate(estimator=None, estimator_paras=None, data=None, metrics=None, save_path='history', verbose=False)[source]¶
Evaluate the new dataset. We will re-train the estimator with training set and return the metrics of both training and testing set
- Parameters
estimator (str or Estimator instance (from scikit-learn or custom)) –
- If estimator is str, we are currently support:
knn: k-nearest neighbors
svm: support vector machine
rf: random forest
adaboost: AdaBoost
xgb: Gradient Boosting
tree: Extra Trees
ann: Artificial Neural Network (Multi-Layer Perceptron)
If estimator is Estimator instance: you need to make sure that it has fit and predict methods
estimator_paras (None or dict, default = None) – The parameters of the estimator, please see the official document of scikit-learn to selected estimator. If None, we use the default parameter for selected estimator
data (Data, an instance of Data class. It must have training and testing set) –
metrics (tuple, list, default = None) – Depend on the regression or classification you are trying to tackle. The supported metrics can be found at: https://github.com/thieu1995/permetrics
save_path (str, default="history") – The path to save the file
verbose (bool, default=False) – Print the results to console or not.
- Returns
metrics_results – The metrics for both training and testing set.
- Return type
dict.
- export_boxplot_figures(xlabel='Model', ylabel='Global best fitness value', title='Boxplot of comparison models', show_legend=True, show_mean_only=False, exts=('.png', '.pdf'))[source]¶
- export_convergence_figures(xlabel='Epoch', ylabel='Fitness value', title='Convergence chart of comparison models', exts=('.png', '.pdf'))[source]¶
- fit(X, y=None, n_trials=2, n_jobs=2, save_path='history', save_results=True, verbose=True, fit_weights=(0.9, 0.1), mode='single', n_workers=None, termination=None)[source]¶
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The training input samples.
y (array-like of shape (n_samples,)) – The target values.
n_trials (int. Number of repetitions) –
n_jobs (int, None. Number of processes will be used to speed up the computation (<=1 or None: sequential, >=2: parallel)) –
save_path (str. The path to the folder that hold results) –
save_results (bool. Save the global best fitness and loss (convergence/fitness) during generations to csv file (default: True)) –
fit_weights (list, tuple or np.ndarray, default = (0.9, 0.1)) – The first weight is for objective value and the second weight is for the number of features
verbose (int, default = True) – Controls verbosity of output.
mode (str, default = 'single') –
The mode used in Optimizer belongs to Mealpy library. Parallel: ‘process’, ‘thread’; Sequential: ‘swarm’, ‘single’.
’process’: The parallel mode with multiple cores run the tasks
’thread’: The parallel mode with multiple threads run the tasks
’swarm’: The sequential mode that no effect on updating phase of other agents
’single’: The sequential mode that effect on updating phase of other agents, default
n_workers (int or None, default = None) – The number of workers (cores or threads) used in Optimizer (effect only on parallel mode)
termination (dict or None, default = None) – The termination dictionary or an instance of Termination class. It is for Optimizer belongs to Mealpy library.
- fit_transform(X, y=None, n_trials=2, n_jobs=2, save_path='history', save_results=True, verbose=True, fit_weights=(0.9, 0.1), mode='single', n_workers=None, termination=None)[source]¶
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
ndarray array of shape (n_samples, n_features_new)
mafese.embedded package¶
mafese.embedded.lasso module¶
- class mafese.embedded.lasso.LassoSelector(problem='classification', estimator='lasso', estimator_paras=None, threshold=None, norm_order=1, max_features=None)[source]¶
Bases:
mafese.selector.Selector
Defines a LassoSelector class that hold all Lasso-based Feature Selection methods for feature selection problems
- Parameters
problem (str, default = "classification") – The problem you are trying to solve (or type of dataset), “classification” or “regression”
estimator (str, default = 'lasso') –
- We are currently support:
lasso: lasso estimator (both regression and classification)
lr: Logistic Regression (classification)
svm: LinearSVC, support vector machine (classification)
estimator_paras (None or dict, default = None) – The parameters of the estimator, please see the official document of scikit-learn to selected estimator. If None, we use the best parameter for selected estimator
threshold (str or float, default=None) – The threshold value to use for feature selection. Features whose absolute importance value is greater or equal are kept while the others are discarded. If “median” (resp. “mean”), then the
threshold
value is the median (resp. the mean) of the feature importances. A scaling factor (e.g., “1.25*mean”) may also be used. If None and if the estimator has a parameter penalty set to l1, either explicitly or implicitly (e.g, Lasso), the threshold used is 1e-5. Otherwise, “mean” is used by default.norm_order (non-zero int, inf, -inf, default=1) – Order of the norm used to filter the vectors of coefficients below
threshold
in the case where thecoef_
attribute of the estimator is of dimension 2.max_features (int, callable, default=None) –
The maximum number of features to select.
If an integer, then it specifies the maximum number of features to allow.
If a callable, then it specifies how to calculate the maximum number of features allowed by using the output of max_feaures(X).
If None, then all features are kept.
To only select based on
max_features
, setthreshold=-np.inf
.
Examples
The following example shows how to retrieve the most informative features in the Lasso-based FS method
>>> import pandas as pd >>> from mafese.embedded.lasso import LassoSelector >>> # load dataset >>> dataset = pd.read_csv('your_path/dataset.csv', index_col=0).values >>> X, y = dataset[:, 0:-1], dataset[:, -1] # Assumption that the last column is label column >>> # define mafese feature selection method >>> feat_selector = LassoSelector(problem="classification", estimator="lasso", estimator_paras={"alpha": 0.1}) >>> # find all relevant features >>> feat_selector.fit(X, y) >>> # check selected features - True (or 1) is selected, False (or 0) is not selected >>> print(feat_selector.selected_feature_masks) array([ True, True, True, False, False, True, False, False, False, True]) >>> print(feat_selector.selected_feature_solution) array([ 1, 1, 1, 0, 0, 1, 0, 0, 0, 1]) >>> # check the index of selected features >>> print(feat_selector.selected_feature_indexes) array([ 0, 1, 2, 5, 9]) >>> # call transform() on X to filter it down to selected features >>> X_filtered = feat_selector.transform(X)
- SUPPORT = {'classification': ['lasso', 'lr', 'svm'], 'regression': ['lasso']}¶
- fit(X, y=None)[source]¶
Learn the features to select from X.
- Parameters
X (array-like of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of predictors.
y (array-like of shape (n_samples,), default=None) – Target values. This parameter may be ignored for unsupervised learning.
- Returns
self – Returns the instance itself.
- Return type
object
mafese.embedded.tree module¶
- class mafese.embedded.tree.TreeSelector(problem='classification', estimator='tree', estimator_paras=None, threshold=None, norm_order=1, max_features=None)[source]¶
Bases:
mafese.selector.Selector
Defines a TreeSelector class that hold all Tree-based Feature Selection methods for feature selection problems
- Parameters
problem (str, default = "classification") – The problem you are trying to solve (or type of dataset), “classification” or “regression”
estimator (str, default = 'tree') –
- We are currently support:
rf: random forest
adaboost: AdaBoost
xgb: Gradient Boosting
tree: Extra Trees
estimator_paras (None or dict, default = None) – The parameters of the estimator, please see the official document of scikit-learn to selected estimator. If None, we use the best parameter for selected estimator
threshold (str or float, default=None) – The threshold value to use for feature selection. Features whose absolute importance value is greater or equal are kept while the others are discarded. If “median” (resp. “mean”), then the
threshold
value is the median (resp. the mean) of the feature importances. A scaling factor (e.g., “1.25*mean”) may also be used. If None and if the estimator has a parameter penalty set to l1, either explicitly or implicitly (e.g, Lasso), the threshold used is 1e-5. Otherwise, “mean” is used by default.norm_order (non-zero int, inf, -inf, default=1) – Order of the norm used to filter the vectors of coefficients below
threshold
in the case where thecoef_
attribute of the estimator is of dimension 2.max_features (int, callable, default=None) –
The maximum number of features to select.
If an integer, then it specifies the maximum number of features to allow.
If a callable, then it specifies how to calculate the maximum number of features allowed by using the output of max_feaures(X).
If None, then all features are kept.
To only select based on
max_features
, setthreshold=-np.inf
.
Examples
The following example shows how to retrieve the most informative features in the Tree-based FS method
>>> import pandas as pd >>> from mafese.embedded.tree import TreeSelector >>> # load dataset >>> dataset = pd.read_csv('your_path/dataset.csv', index_col=0).values >>> X, y = dataset[:, 0:-1], dataset[:, -1] # Assumption that the last column is label column >>> # define mafese feature selection method >>> feat_selector = TreeSelector(problem="classification", estimator="tree") >>> # find all relevant features >>> feat_selector.fit(X, y) >>> # check selected features - True (or 1) is selected, False (or 0) is not selected >>> print(feat_selector.selected_feature_masks) array([ True, True, True, False, False, True, False, False, False, True]) >>> print(feat_selector.selected_feature_solution) array([ 1, 1, 1, 0, 0, 1, 0, 0, 0, 1]) >>> # check the index of selected features >>> print(feat_selector.selected_feature_indexes) array([ 0, 1, 2, 5, 9]) >>> # call transform() on X to filter it down to selected features >>> X_filtered = feat_selector.transform(X)
- SUPPORTED = ['rf', 'adaboost', 'xgb', 'tree']¶
- fit(X, y=None)[source]¶
Learn the features to select from X.
- Parameters
X (array-like of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of predictors.
y (array-like of shape (n_samples,), default=None) – Target values. This parameter may be ignored for unsupervised learning.
- Returns
self – Returns the instance itself.
- Return type
object
Citation Request¶
Please include these citations if you plan to use this library:
@software{nguyen_van_thieu_2023_7969043,
author = {Nguyen Van Thieu, Ngoc Hung Nguyen, Ali Asghar Heidari},
title = {Feature Selection using Metaheuristics Made Easy: Open Source MAFESE Library in Python},
month = may,
year = 2023,
publisher = {Zenodo},
doi = {10.5281/zenodo.7969042},
url = {https://github.com/thieu1995/mafese}
}
@article{van2023mealpy,
title={MEALPY: An open-source library for latest meta-heuristic algorithms in Python},
author={Van Thieu, Nguyen and Mirjalili, Seyedali},
journal={Journal of Systems Architecture},
year={2023},
publisher={Elsevier},
doi={10.1016/j.sysarc.2023.102871}
}
If you have an open-ended or a research question, you can contact me via nguyenthieu2102@gmail.com
Important links¶
Official source code repo: https://github.com/thieu1995/mafese
Official document: https://mafese.readthedocs.io/
Download releases: https://pypi.org/project/mafese/
Issue tracker: https://github.com/thieu1995/mafese/issues
Notable changes log: https://github.com/thieu1995/mafese/blob/master/ChangeLog.md
Examples with different meapy version: https://github.com/thieu1995/mafese/blob/master/examples.md
- This project also related to our another projects which are “optimization” and “machine learning”, check it here:
License¶
The project is licensed under GNU General Public License (GPL) V3 license.