Automated Text Categorization in Bibliographic Search

Principal Investigator: Lukas C. Faulstich

Co-Investigator: Peter F. Stadler

Co-Workers: Christina Witwer, Caroline Thurner


In bioinformatics there exist research topics that cannot be uniquely characterized by a set of key words because relevant key words are (i) also heavily used in other contexts and (ii) often omitted in relevant documents because the context is clear to the target audience. Information retrieval interfaces such as entrez/Pubmed produce either low precision or low recall in this case. To yield a high recall at a reasonable precision, the results of a broad information retrieval search have to be filtered to remove irrelevant documents. We use automated text categorization for this purpose.

In this project we use the topic of conserved secondary RNA structures in viral genomes as running example. We are investigating how well automated classifiers trained on a manually labeled reference corpus can be applied to similar unlabeled corpora. Further research goals are to validate and enhance existing feature selection methods and to experiment with classification techniques that take unlabeled instances into account.

We are working on a bibliographic search tool, litsift that that sends a user query to a bibliographic database such as Pubmed, retrieves the search results and the articles cited therein, and ranks the results according to the predictions of a classifier previously trained on a labeled reference corpus using the same tool. The user may choose to re-label some of the results manually and retrain the classifier in order to enhance its performance.

A prototype for the core functionality of litsift has been used to asses the transferability of classifiers trained on corpora on virus groups such as picornaviridae, flaviviridae and hepadnaviridae.


Open Positions


This work is supported by the Austrian Fonds zur Förderung der Wissenschaftlichen Forschung, Project Nos. P-13545-MAT and P-15893 and the German DFG Bioinformatics Initiative. We use the ConceptComposer Software, courtesy of TextTech, the English dictionary of Projekt Deutscher Wortschatz, and the Weka 3 Machine Learning Software.
Last modified on: Wed Jul 16 11:05:53 CEST 2003 (faulstic)