Project Information

Start date: 1 April 2009
End date: 31 September 2010

Project Minute Summary

SAPIENT Automation aims to help researchers process scientific papers faster and get the information they are interested in out of them. The project will achieve this by automating the recognition of core scientific concepts such as Motivation, Method, Result, Conclusion in papers and use them to generate automatic use based summaries.

Motivation

Research output in the form of scientific papers is being generated at a faster pace than ever before, especially in the life sciences. This makes it a challenge for researchers and resource curators to extract and evaluate the knowledge contained within them. Automated text mining methods currently operate mainly on abstracts but scientists have highlighted the need for the automatic processing of the full text of scientific papers. Researchers in information extraction and information retrieval need to be able to recognise areas of interest in the papers and scientists have spoken of the need for machine readable summaries of papers. However, the manual production of semantic markup in papers is very time consuming and cannot cater for the millions of papers already published.

Goal

The ART project (completed in March 2009) produced a tool (SAPIENT) and a set of meta-data (CISP) for the annotation of core scientific concepts (Motivation, Background, Hypothesis, Goal, Object, Method, Experiment, Observation, Result, Conclusion, Model) in scientific papers. A selection of 225 papers (> 1 million words, The ART Corpus) covering topics in physical chemistry and biochemistry were annotated at the sentence level by 16 experts using SAPIENT and a set of annotation guidelines. The goal of the SAPIENT Automation project is to assess the added benefit from the mark up of core scientific concepts in papers as well as the sustainability and reproducability of the latter meta-data. SAPIENT Automation will automate the SAPIENT tool by employing machine learning and using the ART corpus as training data. The automatically generated concepts will be used to produce digital summaries of the papers.

Methodology

We aim to show that the automated generation of CISP meta-data would be of significant benefit to the research community, facilitating information retrieval and information extraction. We plan to train machine learning (ML) algorithms on the ART corpus, in order to learn to predict core scientific concepts automatically. The ML predictors will be evaluated on a subset of the ART corpus. We also aim to employ ontology engineering and natural language generation methods to combine the automated CISP meta-data for the generation of digital abstracts in a standard structured form. Such digital abstracts can improve literature mining, support web-services, and increase the sharing and re-use of scientific knowledge. The automatically generated CISP meta-data will be compared against other similar meta-data both in terms of reproducibility and usefulness in generating digital abstracts.

Objectives

Our objectives within the SAPIENTA project include the evaluation of:

  • the annotation guidelines (developed to implement the CISP meta-data as an annotation scheme
  • the annotator agreement in the ART corpus
  • the sustainability of CISP meta-data

We aim to show that the automated generation of CISP meta-data would be of significant benefit to the research community, facilitating information retrieval and information extraction. We also aim to demonstrate that the generation of digital abstracts in a standard structured form will improve literature mining, support web-services, and increase the sharing and re-use of scientific knowledge. Even though the current papers in the ART corpus are from the domains of physical chemistry and biochemistry, the CISP meta-data as well as the SAPIENT tool are domain independent and the approach could be applied to a wide range of scientific papers. More specifially, the project SAPIENT Automation has the following agenda:

  1. Evaluation of existing meta-data in the ART corpus, in terms of statistical measures. This will result in further improvement of guidelines and possibly a new round of annotation for corpus enlargement.
  2. Use of existing machine learning algorithms to train on the ART corpus, for the automatic recognition of content metadata.
  3. Evaluation of the automated methods and the meta-data obtained on a testing set from the ART corpus and on a set of new papers.
  4. Comparison and evaluation of our own automated meta-data against other types of meta-data such as the categories in Argumentative Zoning [1] and the ones in [2].
  5. Evaluate the CISP meata-data further by using it for the automatic generation of text for digital abstracts and summaries from content meta-data for papers in the ART corpus and the new set of papers.
  6. Smooth integration of the automated methods into the SAPIENTA (SAPIENT Automated) software.
  7. Investigation into new types of queries over the meta-data, which can add to the functionality of SAPIENTA.
  8. Disseminate SAPIENTA and the automatic meta-data to the research community and target particular users (reviewers, editors, authors).

References

[1] Teufel, S., Moens, M. (1999) Argumentative classification of extracted sentences as a first step towards flexible abstracting. I. Mani, M. Maybury (Eds.) Advances in automatic text summarization, MIT Press, 1999.
[2] Shatkay, H ., Pan, F., Rzhetsky, A. and Wilbur W.J. Multi-Dimensional Classification of Biomedical Text: Toward Automated, Practical Provision of High-Utility Text to Diverse Users. Bioinformatics. 24(18). 2008. (pp. 2086-2093).