Heuristic similarity- and distance-based supervised feature selection methods
Lohrmann, Christoph (2019-12-16)
Väitöskirja
Lohrmann, Christoph
16.12.2019
Lappeenranta-Lahti University of Technology LUT
Acta Universitatis Lappeenrantaensis
School of Engineering Science
School of Engineering Science, Laskennallinen tekniikka
Kaikki oikeudet pidätetään.
Julkaisun pysyvä osoite on
https://urn.fi/URN:ISBN:978-952-335-473-9
https://urn.fi/URN:ISBN:978-952-335-473-9
Tiivistelmä
In the field of machine learning, the available data often contain many features to describe phenomena. This can pose a problem since only those features that are relevant to characterize the target concept are needed, whereas additional features can make it even more complicated to determine the underlying association between the features and the phenomenon. Therefore, an essential task for data analysis is feature selection, which means to reduce the number of features in the data to a set of relevant features. The focus in this thesis is on supervised feature selection methods used in the context of classification tasks. In particular, the emphasis is on heuristic filter methods, which do not guarantee an optimal solution but are considerably faster and are deployed as a preprocessing step for the data before a classification algorithm is applied.
The first approach presented is the ‘fuzzy similarity and entropy’ (FSAE) feature selection method, which is a modification of the approach by Luukka (2011). It is demonstrated that this approach, which evaluates each feature by itself (a univariate approach), accomplishes at least comparable classification results to the original approach, often with a considerably smaller feature subset. The results were competitive to those of several other distance- and information-based filter methods. In addition to several artificial examples and real-world medical datasets, the FSAE was deployed together with a random forest to construct a classification model for the prediction of the S&P500 intraday return. Several trading strategies derived from the classification model demonstrated the ability to outperform a buy-and-hold strategy with small to moderate transaction costs. In the context of classification, the similarity classifier, which as the FSAE feature selection method works with a single representative point (ideal vector) for each class, was modified to allow for multiple ideal vectors per class using clustering. This classifier was able to outperform all single classifier models it was compared to in terms of classification accuracy, often by a significant margin. The same idea of using multiple class representatives was successfully applied in the context of feature selection with the proposed ‘clustering one less dimension’ (COLD) algorithm. In addition, the distance-based COLD filter algorithm is capable of accounting for dependencies among features (a multivariate approach). This ability was highlighted on several artificial examples. Lastly, it achieved at least competitive results compared to several other heuristic filter methods on real-world datasets.
The first approach presented is the ‘fuzzy similarity and entropy’ (FSAE) feature selection method, which is a modification of the approach by Luukka (2011). It is demonstrated that this approach, which evaluates each feature by itself (a univariate approach), accomplishes at least comparable classification results to the original approach, often with a considerably smaller feature subset. The results were competitive to those of several other distance- and information-based filter methods. In addition to several artificial examples and real-world medical datasets, the FSAE was deployed together with a random forest to construct a classification model for the prediction of the S&P500 intraday return. Several trading strategies derived from the classification model demonstrated the ability to outperform a buy-and-hold strategy with small to moderate transaction costs. In the context of classification, the similarity classifier, which as the FSAE feature selection method works with a single representative point (ideal vector) for each class, was modified to allow for multiple ideal vectors per class using clustering. This classifier was able to outperform all single classifier models it was compared to in terms of classification accuracy, often by a significant margin. The same idea of using multiple class representatives was successfully applied in the context of feature selection with the proposed ‘clustering one less dimension’ (COLD) algorithm. In addition, the distance-based COLD filter algorithm is capable of accounting for dependencies among features (a multivariate approach). This ability was highlighted on several artificial examples. Lastly, it achieved at least competitive results compared to several other heuristic filter methods on real-world datasets.
Kokoelmat
- Väitöskirjat [1092]