Abstract:
Peptide sequencing is an important component for understanding the characterization of proteins. Typical analyses of mass spectrometry data only identify amino acid sequences that exist in reference databases. This restricts the possibility of discovering new peptides such as those that contain uncharacterized mutations or originate from unexpected proteins. De novo peptide sequencing approaches address this limitation by directly deriving peptides from MS/MS spectra using the knowledge of the ion fragmentation process but often suffer from low accuracy and require extensive validation by experts. In this thesis, we develop SMSNet, a deep learning-based hybrid de novo peptide sequencing model that achieves >95% amino acid accuracy while retaining good identification coverage. We propose a sequence-mask-search framework which allows the model to recover full-sequence peptide predictions from known database in case the predictions contain ambiguous amino acid positions. Additionally, because the confidence scores of each amino acid are often affected by the predictions in the previous positions, we propose the use of external rescorer for adjusting the scores, which leads to better separation between correct and incorrect amino acids. Using techniques described and proposed in this thesis, we are able to recover a large number of peptides which are in accordance with predictions using database searching techniques, suggesting the potential of SMSNet on other real-life proteomics studies.
Chulalongkorn University. Office of Academic Resources