Minimum redundancy feature selection

Minimum redundancy feature selection is an algorithm frequently used in a method to accurately identify characteristics of genes and phenotypes and narrow down their relevance and is usually described in its pairing with relevant feature selection as Minimum Redundancy Maximum Relevance (mRMR). This method was first proposed in 2003 by Hanchuan Peng and Chris Ding,^[1] followed by a theoretical formulation based on mutual information, along with the first definition of multivariate mutual information, published in IEEE Trans. Pattern Analysis and Machine Intelligence in 2005. ^[2]

Feature selection, one of the basic problems in pattern recognition and machine learning, identifies subsets of data that are relevant to the parameters used and is normally called Maximum Relevance. These subsets often contain material which is relevant but redundant and mRMR attempts to address this problem by removing those redundant subsets. mRMR has a variety of applications in many areas such as cancer diagnosis and speech recognition.

Features can be selected in many different ways. One scheme is to select features that correlate strongest to the classification variable. This has been called maximum-relevance selection. Many heuristic algorithms can be used, such as the sequential forward, backward, or floating selections.

On the other hand, features can be selected to be mutually far away from each other while still having "high" correlation to the classification variable. This scheme, termed as Minimum Redundancy Maximum Relevance (mRMR) selection has been found to be more powerful than the maximum relevance selection.

As a special case, the "correlation" can be replaced by the statistical dependency between variables. Mutual information can be used to quantify the dependency. In this case, it is shown that mRMR is an approximation to maximizing the dependency between the joint distribution of the selected features and the classification variable.

Studies have tried different measures for redundancy and relevance measures. A recent study compared several measures within the context of biomedical images.^[3]

References

^ Chris Ding and Hanchuan Peng, "Minimum Redundancy Feature Selection from Microarray Gene Expression Data". 2nd IEEE Computer Society Bioinformatics Conference (CSB 2003), 11–14 August 2003, Stanford, CA, USA. Pages 523–529.
^ Peng, H.C., Long, F., and Ding, C., "Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 8, pp. 1226–1238, 2005.
^ Auffarth, B., Lopez, M., Cerquides, J. (2010). Comparison of redundancy and relevance measures for feature selection in tissue classification of CT images. Advances in Data Mining. Applications and Theoretical Aspects. p. 248--262. Springer. http://www.csc.kth.se/~auffarth/publications/redrel.pdf

External links

Peng, H.C., Long, F., and Ding, C., "Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 8, pp. 1226–1238, 2005.
Chris Ding and Hanchuan Peng, "Minimum Redundancy Feature Selection from Microarray Gene Expression Data". 2nd IEEE Computer Society Bioinformatics Conference (CSB 2003), 11–14 August 2003, Stanford, CA, USA. Pages 523–529.
Penglab mRMR

[1] Chris Ding and Hanchuan Peng, "Minimum Redundancy Feature Selection from Microarray Gene Expression Data". 2nd IEEE Computer Society Bioinformatics Conference (CSB 2003), 11–14 August 2003, Stanford, CA, USA. Pages 523–529.

[2] Peng, H.C., Long, F., and Ding, C., "Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No. 8, pp. 1226–1238, 2005.

[3] Auffarth, B., Lopez, M., Cerquides, J. (2010). Comparison of redundancy and relevance measures for feature selection in tissue classification of CT images. Advances in Data Mining. Applications and Theoretical Aspects. p. 248--262. Springer. http://www.csc.kth.se/~auffarth/publications/redrel.pdf

[1]

[2]

[3]