MAFESE Library¶
mafese.selector¶
- class mafese.selector.Selector(problem='classification')[source]¶
Bases:
abc.ABC
Defines an abstract class for Feature Selector.
- SUPPORTED_CLASSIFICATION_METRICS = ['PS', 'NPV', 'RS', 'AS', 'F1S', 'F2S', 'FBS', 'SS', 'MCC', 'HS', 'CKS', 'JSI', 'GMS', 'ROC-AUC', 'LS', 'GINI', 'CEL', 'HL', 'KLDL', 'BSL']¶
- SUPPORTED_ESTIMATORS = ['knn', 'svm', 'rf', 'adaboost', 'xgb', 'tree', 'ann']¶
- SUPPORTED_PROBLEMS = ['classification', 'regression']¶
- SUPPORTED_REGRESSION_METRICS = ['EVS', 'ME', 'MAE', 'MSE', 'RMSE', 'MSLE', 'MedAE', 'MRE', 'MRB', 'MAPE', 'SMAPE', 'MAAPE', 'MASE', 'NSE', 'NNSE', 'WI', 'R', 'PCC', 'AR', 'APCC', 'R2S', 'RSQ', 'R2', 'COD', 'AR2', 'ACOD', 'CI', 'DRV', 'KGE', 'GINI', 'GINI_WIKI', 'PCD', 'JSD', 'VAF', 'RAE', 'A10', 'A20', 'A30', 'NRMSE', 'RSE', 'COV', 'COR', 'EC', 'OI', 'CRM']¶
- evaluate(estimator=None, estimator_paras=None, data=None, metrics=None)[source]¶
Evaluate the new dataset. We will re-train the estimator with training set and return the metrics of both training and testing set
- Parameters
estimator (str or Estimator instance (from scikit-learn or custom)) –
- If estimator is str, we are currently support:
knn: k-nearest neighbors
svm: support vector machine
rf: random forest
adaboost: AdaBoost
xgb: Gradient Boosting
tree: Extra Trees
ann: Artificial Neural Network (Multi-Layer Perceptron)
If estimator is Estimator instance: you need to make sure that it has fit and predict methods
estimator_paras (None or dict, default = None) – The parameters of the estimator, please see the official document of scikit-learn to selected estimator. If None, we use the default parameter for selected estimator
data (Data, an instance of Data class. It must have training and testing set) –
metrics (tuple, list, default = None) – Depend on the regression or classification you are trying to tackle. The supported metrics can be found at: https://github.com/thieu1995/permetrics
- Returns
metrics_results – The metrics for both training and testing set.
- Return type
dict.
- fit(X, y=None)[source]¶
Learn the features to select from X.
- Parameters
X (array-like of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of predictors.
y (array-like of shape (n_samples,), default=None) – Target values. This parameter may be ignored for unsupervised learning.
- Returns
self – Returns the instance itself.
- Return type
object
- fit_transform(X, y=None, **fit_params)[source]¶
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
ndarray array of shape (n_samples, n_features_new)
- name = 'Feature Selector'¶
mafese.filter¶
- class mafese.filter.FilterSelector(problem='classification', method='ANOVA', n_features=3, n_neighbors=5, n_bins=10, normalized=True)[source]¶
Bases:
mafese.selector.Selector
Defines a FilterSelector class that hold all filter methods for feature selection problems
- Parameters
problem (str, default = "classification") – The problem you are trying to solve (or type of dataset), “classification” or “regression”
method (str, default = "ANOVA") –
If the problem = “classification”, FilterSelector’s support method can be one of this value:
”CHI”: Chi-Squared statistic
”ANOVA”: ANOVA F-score
”MI”: Mutual information
”KENDALL”: Kendall Tau correlation
”SPEARMAN”: Spearman’s Rho correlation
”POINT”: Point-biserial correlation
”RELIEF”: Original Relief method
”RELIEF-F”: Weighted average Relief based on the frequency of each class
”VLS-RELIEF-F”: Very Large Scale ReliefF
If the problem = “regression”, FilterSelector’s support method can be one of this value:
”PEARSON”: Pearson correlation
”ANOVA”: ANOVA F-score
”MI”: Mutual information
”KENDALL”: Kendall Tau correlation
”SPEARMAN”: Spearman’s Rho correlation
”POINT”: Point-biserial correlation
”RELIEF”: Original Relief method
”RELIEF-F”: Weighted average Relief based on the frequency of each class
”VLS-RELIEF-F”: Very Large Scale ReliefF
n_features (int or float, default=3) – If integer, the parameter is the absolute number of features to select. If float between 0 and 1, it is the fraction of features to select.
n_neighbors (int, default=5, Optional) – Number of neighbors to use for computing feature importance scores of Relief-based family
n_bins (int, default=10, Optional) – Number of bins to use for discretizing the target variable of Relief-based family in regression problems.
normalized (bool, default=True, Optional) – Normalize feature importance scores by the number of instances in the dataset
- n_features¶
The number of selected features.
- Type
int
- supported_methods¶
Key: is the support method name Value: is the support method function
- Type
dict
- method_name¶
The method that will be used
- Type
str
Examples
The following example shows how to retrieve the most informative features in the FilterSelector FS method
>>> import pandas as pd >>> from mafese.filter import FilterSelector >>> # load dataset >>> dataset = pd.read_csv('your_path/dataset.csv', index_col=0).values >>> X, y = dataset[:, 0:-1], dataset[:, -1] # Assumption that the last column is label column >>> # define mafese feature selection method >>> feat_selector = FilterSelector(problem='classification', method='SPEARMAN', n_features=5) >>> # find all relevant features >>> feat_selector.fit(X, y) >>> # check selected features - True (or 1) is selected, False (or 0) is not selected >>> print(feat_selector.selected_feature_masks) array([ True, True, True, False, False, True, False, False, False, True]) >>> print(feat_selector.selected_feature_solution) array([ 1, 1, 1, 0, 0, 1, 0, 0, 0, 1]) >>> # check the index of selected features >>> print(feat_selector.selected_feature_indexes) array([ 0, 1, 2, 5, 9]) >>> # call transform() on X to filter it down to selected features >>> X_filtered = feat_selector.transform(X)
- SUPPORT = {'classification': {'ANOVA': 'f_classification_func', 'CHI': 'chi2_func', 'KENDALL': 'kendall_func', 'MI': 'mutual_info_classif', 'POINT': 'point_func', 'RELIEF': 'relief_func', 'RELIEF-F': 'relief_f_func', 'SPEARMAN': 'spearman_func', 'VLS-RELIEF-F': 'vls_relief_f_func'}, 'regression': {'ANOVA': 'f_regression_func', 'KENDALL': 'kendall_func', 'MI': 'mutual_info_regression', 'PEARSON': 'r_regression', 'POINT': 'point_func', 'RELIEF': 'relief_func', 'RELIEF-F': 'relief_f_func', 'SPEARMAN': 'spearman_func', 'VLS-RELIEF-F': 'vls_relief_f_func'}}¶
- fit(X, y=None)[source]¶
Learn the features to select from X.
- Parameters
X (array-like of shape (n_samples, n_features)) – Training vectors, where n_samples is the number of samples and n_features is the number of predictors.
y (array-like of shape (n_samples,), default=None) – Target values. This parameter may be ignored for unsupervised learning.
- Returns
self – Returns the instance itself.
- Return type
object
- fit_transform(X, y=None, **fit_params)[source]¶
Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
ndarray array of shape (n_samples, n_features_new)