1871 Making the World's Scientific Information (More) Organized, Accessible, and Usable

Friday, February 19, 2010: 3:50 PM
Room 2 (San Diego Convention Center)
Edward Briscoe , University of Cambridge, Cambridge, United Kingdom
Web portals like Google Scholar and ScienceDirect have revolutionized access to scientific information by making it possible to identify relevant papers via keyword search, and then to browse them on-line. However, as scientific information continues to grow exponentially, and as (e-)science embraces automation, keeping abreast of and exploiting the information in these papers effectively is becoming impossible. I'll describe two scientific literature search and information extraction systems, developed in collaboration with the FlyBase (Fruit Fly Genomics) curation team, based on similar underlying image and text processing techniques -- one designed to enhance curation of individual papers and the other to support more fine-grained querying and access to information in a collection. FlyBrowse redisplays a paper highlighting mentions of genes and anaphorically associated entities selected by the curator, and supporting automated thematic navigation. It improves curator productivity by 20\%. FlySearch indexes a collection of annotated papers and supports integrated search over individual sentences and images, aggregating information across the collection. For example, one can search captions describing a specific gene regulating a biological process and restrict the associated images to a specific body part. Both systems rest on a similar processing pipeline in which a Portable Document Format paper is first converted to Scientific eXtensible Mark-up Language, preserving it's logical structure but, for example, separating images, tables, and references from running text, and then applying specialized text and image processing tools to the different components of the paper. These are able to compute image similarity, recognize gene names, facts about genes and their relationships to other biological entities, etc. They have been designed to be as generic as possible to facilitate application to different areas of science. Where they require domain-specific tuning they have been developed using semi-supervised machine learning methods to minimize such costs. Our and others' results suggest that scientific literature search and information extraction can already deliver useful results. Nevertheless, challenges remain, of which perhaps the most pressing is handling more forms of contextually-mediated variant ways of expressing the same meaning, but we would also like to be able to go beyond finding and extracting relations between biological entitites and, for example, support temporal reasoning about biological events.