Using Statistical Learning Methods for Better Spectrum Classification
Chen, Yaoyi
:
2014-07-16
Abstract
Shotgun proteomics has become a widely used technology for identifying a large number of peptides and proteins in complex biological samples. However, any single score function from most search algorithms to evaluate the quality of peptide-spectrum matches (PSMs) is not adequate to discriminate between correct and incorrect spectrum identification. Here, we used and compared multiple logistic regression models with different flexibilities and support vector machines with various kernel functions and random forests to incorporate multiple scores from search engines. New features, such as retention time differences and a number of other modifications, were also incorporated to build a better binary classifier. We validated these methods through bootstrapping and compared their performance to each other. My study has shown that these methods, with their unique strengths, have improved performance - specifically with higher area under ROC curve and better discrimination indices - to classify correct from incorrect peptide spectrum matches.