Ph.D. Thesis:

Change-of-Representation in Machine Learning, and an Application to Protein Structure Prediction

Thomas R. Ioerger, University of Illinois, 1996

Abstract:

While many excellent induction algorithms are known for making predictions from data-bases in well-studied domains, learning systems still perform poorly in many difficult real-world domains, such as weather prediction or financial risk analysis. Two characteristics of real-world domains are inadequately addressed by current machine learning research. First, the difficulty in these domains is often caused by a low-level representation, which necessitates shifting to a higher-level representation. But the space of possible representations is very large, so we need intelligent methods for finding higher-level representations. Second, background knowledge is almost always available in real-world domains, which we would like to take advantage of to increase predictive accuracy. However, known roles for domain knowledge in machine learning are often inflexible, often requiring the use of a specific induction algorithm or being sensitive to incorrectness or incompleteness in the knowledge.

We propose a general framework for change-of-representation based on searching for alternative representations to improve the accuracy of an underlying induction algorithm. Representations are selected as candidates by querying a strategy component, which relies on domain knowledge to suggest which alternatives to search. An evaluation component then compares these representations by applying each one to a set of examples and running the induction algorithm on the transformed examples to empirically determine the effect of the change on accuracy. This approach provides solutions to the two characteristic problems of learning in real-world domains. First, domain knowledge is used as a heuristic to guide the search for alternative representations, enabling more intelligent decisions during change-of-representation. Second, the framework provides a flexible role for knowledge that can be used with any learning algorithm and is tolerant of uncertainty.

We apply our framework for change-of-representation to the difficult, real-world domain of protein structure prediction. The best computational method to date for determining the structure of a protein from its amino acid sequence is homology modeling, which is based on sequence alignments with a protein database. Homology modeling can fail in cases where the sequence similarity is low between proteins with similar structures. However, the physical and chemical properties of amino acids are believed to relevant to protein structure. Using an implementation of our framework, we incorporate this domain knowledge to suggest ways to change the representation of amino acid sequences. Efficient search procedures are derived that lead the discovery of representations that improve the ability to predict protein structures by homology modeling.