The Linguistic Context of Citations: a Cartography of the Structure of Scientific Papers

Sunday, 15 February 2015
Exhibit Hall (San Jose Convention Center)
Marc Bertin, Centre interuniversitaire de recherche sur la science et la technologie, Montreal, QC, Canada
The IMRaD structure of scientific papers (Introduction, Methods, Results, and Discussion) has been adopted by most journals and has become standard in the 1970s. It was introduced to facilitate the reading of publications and to provide faster access to information by standardizing the argumentative structure of articles. The IMRaD sequence provides an outline for scientific writing, dividing the articles into four sections, each one having a specific rhetorical function. Previous studies on the density of citations in the four main section types have showed that the distribution of references along the text progression is essentially invariant for the seven PLOS journals. In this study, we use Natural Language Processing tools to look into the way authors use citations in the different sections, in order to build visualizations of the inner structure of scientific articles. We have processed the dataset of articles published in the seven PLOS journals over the period 2009-2012. This dataset contains about 50,000 scientific papers, mainly in the fields of Biology and Medicine. We consider that verbs found in sentences containing citations are an important indicator of the purpose of citations and of the reasons behind citing a given document. Thus the linguistic context of citations indicates the relations that exist between authors. By processing the section titles, we have identified the different sections of the papers as well as the positions of citations in these sections. Citations were then linked with the verb occurrences in citation contexts using the Stanford POS-tagger. The data on the verb frequencies in the four section types and their positions in the text were visualized using Circos tool to produce a circular layout showing the verbs along the text progression. Our results demonstrate a strong relation between verbs used around citations and the rhetorical structure of scientific papers. Furthermore, we provide context tree visualizations for the set of verbs and multi-layer author network maps, generated using Gephi, where relations between authors are considered in terms of the position of citations in the IMRaD structure. By processing the full text of scientific publications, we achieve a detailed representation of the phenomenon of citations. This approach provides a novel framework for the study of relations between authors in the field of Big Humanities. Our ultimate goal is to design interactive domain-specific maps that could be exploited as tool for improved information extraction and analysis of citation networks.