http://www.bioinf.uni-leipzig.de/~faulstic/litsift/
litsift:
Automated Text Categorization in Bibliographic
Search
Abstract
In bioinformatics there exist research topics that cannot be
uniquely characterized by a set of key words because relevant key
words are (i) also heavily used in other contexts and (ii) often
omitted in relevant documents because the context is clear to the
target audience. Information retrieval interfaces such as entrez/Pubmed
produce either low precision or low recall in this case. To yield a
high recall at a reasonable precision, the results of a broad
information retrieval search have to be filtered to remove
irrelevant documents. We use automated text categorization for this
purpose.
In this project we use the topic of conserved secondary RNA
structures in viral genomes as running example. We are investigating
how well automated classifiers trained on a manually labeled
reference corpus can be applied to similar unlabeled
corpora. Further research goals are to validate and enhance existing
feature selection methods and to experiment with classification
techniques that take unlabeled instances into account.
We are working on a bibliographic search tool, litsift that
that sends a user query to a bibliographic database such as Pubmed,
retrieves the search results and the articles cited therein, and
ranks the results according to the predictions of a classifier
previously trained on a labeled reference corpus using the same
tool. The user may choose to re-label some of the results manually
and retrain the classifier in order to enhance its performance.
A prototype for the core functionality of litsift has been
used to asses the transferability of classifiers trained on corpora
on virus groups such as picornaviridae,
flaviviridae and hepadnaviridae.
Publications
Open Positions
- Diploma Thesis: We are looking
for a computer science student willing to participate in this
project. This involves the implementation of the litsift tool
based on the existing prototype as well as research on the
above-mentioned topics. Requirements are experiences in Java, SQL, Web
interfaces, and some knowledge on Machine Learning and Information
Retrieval.
Acknowledgments:
This work is supported by the Austrian Fonds zur Förderung der
Wissenschaftlichen Forschung, Project Nos. P-13545-MAT and P-15893 and
the German DFG Bioinformatics Initiative. We use the
ConceptComposer Software, courtesy of TextTech, the English dictionary of
Projekt Deutscher
Wortschatz, and the Weka 3 Machine
Learning Software.
Last modified on: Wed Jul 16 11:05:53 CEST 2003 (faulstic)