Loading

Overview

The NASA Astrophysics Data System is a digital library and search engine that provides access to a vast collection of scientific literature covering fields such as astrophysics, earth science, planetary science, and heliophysics. Our corpus of over 20 million records, over 7 million of which are indexed along with their fulltext, is a rich source for exploring new methods, aided by artificial intelligence, to assist with information retrieval and data enrichment. These techniques become only more necessary as we consider that the ever-increasing volume of scientific literature has made finding relevant sources more difficult, and thus new techniques are needed to improve discovery.

In this poster, we summarize work in the field by the ADS team over the last year, including in curating machine learning datasets, experimenting with large language models (LLMs), and incorporating AI-enabled data enrichment techniques in our data ingestion pipelines.

OPEN

Providing data for language models

An example of a manually labeled citation context text snippet.
An example of a manually labeled citation context text snippet from the FOCAL dataset.

 

ADS has been building and curating datasets to train deep learning models, both to provide a richer user experience and to enable automated data enrichment in our internal pipelines. The datasets described here are open to researchers and are publicly available and easy to access. The models are licensed under an MIT license and the datasets are licensed under a CC-BY 4.0 license. Briefly, these licenses allow researchers to use, share, modify or build upon these works as long as appropriate attribution is given. We discuss two of these datasets here.

The Detecting Entities in the Astrophysical Literature (DEAL) dataset is a curated dataset for Named Entity Recognition (NER). This task involves identifying predetermined entities in text, such as Organization or Location. The dataset consists of text fragments obtained from the fulltext or acknowledgements sections of the astrophysical literature (Astrophysical Journal, Astronomy & Astrophysics, and the Monthly Notices of the Royal Astronomical Society). Roughly 6000 text snippets containing over 147,000 labeled entities were manually labeled by a domain expert with 33 different entity types. The DEAL dataset was used as part of a shared task in the First Workshop on Information Extraction from the Scientific Literature (WIESP 2022) as part of the AACL-IJCNLP 2022 conference. The proceedings of this workshop are part of the ACL Anthology.

The Function Of Citation in Astrophysics Literature (FOCAL) dataset is a curated dataset for citation context analysis, which “facilitates the syntactic and semantic analysis of the contents of the citation context to understand how and why authors discuss others research work” (Kunnath et al. 2021), including as background or as motivation. The snippets that contain the citations are obtained from over 25,000 astronomy articles, from the same journals and publication years as the DEAL dataset. From this set of articles, over 2 million citations and their context are harvested. Further, only citations with context sizes between 2,000 and 10,000 characters are selected. This is to allow the determination of what portions of the context are most relevant to understanding the citation’s function. A domain area expert manually examined these text snippets to determine the citation function as well as label the relevant context. In total there are 6023 instances of annotated citations. The FOCAL dataset will be used for the Second Workshop on Information Extraction from the Scientific Literature (WIESP 2023) part of IJCNLP-AACL 2023.

To learn more about our datasets and astroBERT, our astrophysics-specific language model, see our blog post.

OPEN

SciX Brain: an LLM experiment

An example conversation with the SciX Brain chatbot. The question is "what is OSIRIS" and provided is the ADS query "abs:OSIRIS abs:Keck". The chatbot's answer is "OSIRIS (Optical System for Imaging and Low Resolution Integrated Spectroscopy) is an integral field spectrograph, which means it combines imaging capabilities with the ability to collect detailed information about light in a specific region. In other words, it lets you look at the universe as a whole while simultaneously analyzing what's happening at very tiny scales within it. OSIRIS was primarily designed for use on the Keck II Telescope and is part of the NASA-funded Keck Adaptive Optics system, enabling researchers to study exoplanets, galaxy evolution, star formation, and other astrophysical phenomena in unprecedented detail." with cited sources.
Example conversation with the SciX Brain chatbot, with input from the ADS search engine.

 

Open-source Large Language Models (LLMs) offer the opportunity to think creatively and explore alternative methods for information retrieval and data augmentation while ensuring the protection of data copyright and users’ privacy. However, when these models are directly presented with questions lacking context, they become susceptible to generating inaccurate or fictional responses (hallucinations). To address this issue, we have been experimenting internally with open-source (7-13 billion parameters) LLMs and our large corpus of scientific articles. This experimentation led us to build a highly customizable internal web interface and a RESTful API to easily interact with LLMs, code named SciX Brain. The web interface allows users to have quick conversations with the deployed LLMs, and rapidly assess the quality of their response. The API also enables the ADS team to develop pipelines that automatically make use of LLM capabilities for data enrichment and information extraction tasks.

Our experiments have included:

  • retrieval augmented generation (RAG): a strategy of providing additional context to the LLM while answering a question. In the screenshot, we've passed in context retrieved via an ADS search. We've also experimented with an embeddings (i.e. semantic vector) database.
  • comparison of various open-source LLMs
  • grammars
  • natural language to structured Solr queries

While our chatbot is not available to users outside of the ADS team, these experiments will inform our work going forward and may lead to user-facing features in the future.

To read more about our experimental LLM setup, please see our proceedings paper from the 33th annual international Astronomical Data Analysis Software & Systems (ADASS XXXIII). 

OPEN

Data enrichment: planetary features

The automatic identification of planetary features, such as craters or mares, in astronomy and planetary science publications presents numerous challenges. 

The new planetary features filter in the SciX user interface.
Screenshot of the new planetary features filter, available in the new SciX user interface.

 

Many feature names overlap with places or people’s names that they are named after, such as Tempe or Sagan. Some feature names have been used in many contexts, e.g. Apollo, or can appear in the text as adjectives, like the lunar craters Black, Green, and White. Additionally, some features share identical names across different celestial bodies, requiring disambiguation, such as the Adams crater, which exists on both the Moon and Mars.

We have developed a multi-step pipeline combining rule-based filtering, statistical relevance analysis, part-of-speech (POS) tagging, a named entity recognition (NER) model, hybrid keyword harvesting, knowledge graph (KG) matching, and inference with a locally installed large language model (LLM) to reliably identify planetary names despite these challenges. When evaluated on a dataset of astronomy papers from the Astrophysics Data System (ADS), this methodology achieves an F1-score over 0.97 in disambiguating planetary feature names.

With this pipeline, we have tagged over 6000 papers with features from over three dozen solar system bodies. Searching on these tags is now available in ADS using the gpn search tag, e.g. gpn:Mars. Browsing and filtering on these search tags is available in the new SciX interface.

To read more about the work behind the planetary features function, see our pre-print.

OPEN

Data enrichment:

In addition to the planetary features project, currently in production, we are developing tools that utilize AI to enrich our holdings in other ways.

The observational astronomy branch of the Unified Astronomy Thesaurus, expanded to show infrared observatories.
Segment of the Unified Astronomy Thesaurus

 

Currently in development is a machine learning-based project to tag astronomy papers with keywords from the Unified Astronomy Thesaurus. While it is currently possible to search ADS by keywords provided by publishers, there is no single vocabulary that has been consistently used throughout the indexed literature in ADS. The Astronomical Subject Keywords that had been in use by leading astronomy journals since the 1970s hasn’t been updated since 2013, and may not cover the latest topics in the field. The “Keywords” also do not include definitions or relationships between concepts. For this reason, the American Astronomical Society (AAS) journals and the Publications of the Astronomical Society of the Pacific (PASP) elected to adopt the Unified Astronomy Thesaurus as its keyword system of choice in 2019 and 2020, respectively. Upon completion, ADS users will be able to browse results using left side facets in the query results screen, or conduct an initial search using UAT terms.

We are also developing a robust machine learning-based methodology to classify records upon ingestion so that we can automatically assign new papers to a particular collection (i.e. astronomy, physics, and earth science; eventually this will include planetary science and heliophysics). Currently, collection assignment is mostly based on journal, which fails for interdisciplinary journals. This new technique will make it easier for researchers to limit their searches to the relevant literature and for ADS to develop a discipline-specific view over its data holdings.

 

To read more about the UAT project, see our blog post.

OPEN

Notice!

Your iPoster has now been unpublished and will not be displayed on the iPoster Gallery.

You need to publish it again if you want to be displayed.

You are not registered!

Your have to register before you can publish your iPoster.

If you have recently registered it may take up to 10 min before your registration status is updated in our system.

Sorry but time is up!

Because of maintenance we have just saved your content and will within a few minutes logout all users and restart our server. We will be back in a moment.

Sorry for the inconvenience!

Because of maintenance we will within a few minutes restart our server. We will be back in a moment.

Sorry for the inconvenience!

00:00

Contact Author

Get iPoster

CONTACT AUTHOR


GET IPOSTER

Please enter your email address in the field below and a link to this iPoster will be sent to you.
Use the text box to add a personal note to yourself.

NOTE: Your email address will be shared with the author, but your personal notes will not.


0