aas (iPosterSessions - an aMuze! Interactive system)

Overview

The NASA Astrophysics Data System is a digital library and search engine that provides access to a vast collection of scientific literature covering fields such as astrophysics, earth science, planetary science, and heliophysics. Our corpus of over 20 million records, over 7 million of which are indexed along with their fulltext, is a rich source for exploring new methods, aided by artificial intelligence, to assist with information retrieval and data enrichment. These techniques become only more necessary as we consider that the ever-increasing volume of scientific literature has made finding relevant sources more difficult, and thus new techniques are needed to improve discovery.

In this poster, we summarize work in the field by the ADS team over the last year, including in curating machine learning datasets, experimenting with large language models (LLMs), and incorporating AI-enabled data enrichment techniques in our data ingestion pipelines.

OPEN

Providing data for language models

An example of a manually labeled citation context text snippet. — An example of a manually labeled citation context text snippet from the FOCAL dataset.

ADS has been building and curating datasets to train deep learning models, both to provide a richer user experience and to enable automated data enrichment in our internal pipelines. The datasets described here are open to researchers and are publicly available and easy to access. The models are licensed under an MIT license and the datasets are licensed under a CC-BY 4.0 license. Briefly, these licenses allow researchers to use, share, modify or build upon these works as long as appropriate attribution is given. We discuss two of these datasets here.

The Detecting Entities in the Astrophysical Literature (DEAL) dataset is a curated dataset for Named Entity Recognition (NER). This task involves identifying predetermined entities in text, such as Organization or Location. The dataset consists of text fragments obtained from the fulltext or acknowledgements sections of the astrophysical literature (Astrophysical Journal, Astronomy & Astrophysics, and the Monthly Notices of the Royal Astronomical Society). Roughly 6000 text snippets containing over 147,000 labeled entities were manually labeled by a domain expert with 33 different entity types. The DEAL dataset was used as part of a shared task in the First Workshop on Information Extraction from the Scientific Literature (WIESP 2022) as part of the AACL-IJCNLP 2022 conference. The proceedings of this workshop are part of the ACL Anthology.

The Function Of Citation in Astrophysics Literature (FOCAL) dataset is a curated dataset for citation context analysis, which “facilitates the syntactic and semantic analysis of the contents of the citation context to understand how and why authors discuss others research work” (Kunnath et al. 2021), including as background or as motivation. The snippets that contain the citations are obtained from over 25,000 astronomy articles, from the same journals and publication years as the DEAL dataset. From this set of articles, over 2 million citations and their context are harvested. Further, only citations with context sizes between 2,000 and 10,000 characters are selected. This is to allow the determination of what portions of the context are most relevant to understanding the citation’s function. A domain area expert manually examined these text snippets to determine the citation function as well as label the relevant context. In total there are 6023 instances of annotated citations. The FOCAL dataset will be used for the Second Workshop on Information Extraction from the Scientific Literature (WIESP 2023) part of IJCNLP-AACL 2023.

To learn more about our datasets and astroBERT, our astrophysics-specific language model, see our blog post.

OPEN

SciX Brain: an LLM experiment

An example conversation with the SciX Brain chatbot. The question is "what is OSIRIS" and provided is the ADS query "abs:OSIRIS abs:Keck". The chatbot's answer is "OSIRIS (Optical System for Imaging and Low Resolution Integrated Spectroscopy) is an integral field spectrograph, which means it combines imaging capabilities with the ability to collect detailed information about light in a specific region. In other words, it lets you look at the universe as a whole while simultaneously analyzing what's happening at very tiny scales within it. OSIRIS was primarily designed for use on the Keck II Telescope and is part of the NASA-funded Keck Adaptive Optics system, enabling researchers to study exoplanets, galaxy evolution, star formation, and other astrophysical phenomena in unprecedented detail." with cited sources. — Example conversation with the SciX Brain chatbot, with input from the ADS search engine.

Open-source Large Language Models (LLMs) offer the opportunity to think creatively and explore alternative methods for information retrieval and data augmentation while ensuring the protection of data copyright and users’ privacy. However, when these models are directly presented with questions lacking context, they become susceptible to generating inaccurate or fictional responses (hallucinations). To address this issue, we have been experimenting internally with open-source (7-13 billion parameters) LLMs and our large corpus of scientific articles. This experimentation led us to build a highly customizable internal web interface and a RESTful API to easily interact with LLMs, code named SciX Brain. The web interface allows users to have quick conversations with the deployed LLMs, and rapidly assess the quality of their response. The API also enables the ADS team to develop pipelines that automatically make use of LLM capabilities for data enrichment and information extraction tasks.

Our experiments have included:

retrieval augmented generation (RAG): a strategy of providing additional context to the LLM while answering a question. In the screenshot, we've passed in context retrieved via an ADS search. We've also experimented with an embeddings (i.e. semantic vector) database.
comparison of various open-source LLMs
grammars
natural language to structured Solr queries

While our chatbot is not available to users outside of the ADS team, these experiments will inform our work going forward and may lead to user-facing features in the future.

To read more about our experimental LLM setup, please see our proceedings paper from the 33th annual international Astronomical Data Analysis Software & Systems (ADASS XXXIII).

OPEN

Data enrichment: planetary features

The automatic identification of planetary features, such as craters or mares, in astronomy and planetary science publications presents numerous challenges.

The new planetary features filter in the SciX user interface. — Screenshot of the new planetary features filter, available in the new SciX user interface.

Many feature names overlap with places or people’s names that they are named after, such as Tempe or Sagan. Some feature names have been used in many contexts, e.g. Apollo, or can appear in the text as adjectives, like the lunar craters Black, Green, and White. Additionally, some features share identical names across different celestial bodies, requiring disambiguation, such as the Adams crater, which exists on both the Moon and Mars.

We have developed a multi-step pipeline combining rule-based filtering, statistical relevance analysis, part-of-speech (POS) tagging, a named entity recognition (NER) model, hybrid keyword harvesting, knowledge graph (KG) matching, and inference with a locally installed large language model (LLM) to reliably identify planetary names despite these challenges. When evaluated on a dataset of astronomy papers from the Astrophysics Data System (ADS), this methodology achieves an F1-score over 0.97 in disambiguating planetary feature names.

With this pipeline, we have tagged over 6000 papers with features from over three dozen solar system bodies. Searching on these tags is now available in ADS using the gpn search tag, e.g. gpn:Mars. Browsing and filtering on these search tags is available in the new SciX interface.

To read more about the work behind the planetary features function, see our pre-print.

OPEN

Data enrichment:

In addition to the planetary features project, currently in production, we are developing tools that utilize AI to enrich our holdings in other ways.

The observational astronomy branch of the Unified Astronomy Thesaurus, expanded to show infrared observatories. — Segment of the Unified Astronomy Thesaurus

Currently in development is a machine learning-based project to tag astronomy papers with keywords from the Unified Astronomy Thesaurus. While it is currently possible to search ADS by keywords provided by publishers, there is no single vocabulary that has been consistently used throughout the indexed literature in ADS. The Astronomical Subject Keywords that had been in use by leading astronomy journals since the 1970s hasn’t been updated since 2013, and may not cover the latest topics in the field. The “Keywords” also do not include definitions or relationships between concepts. For this reason, the American Astronomical Society (AAS) journals and the Publications of the Astronomical Society of the Pacific (PASP) elected to adopt the Unified Astronomy Thesaurus as its keyword system of choice in 2019 and 2020, respectively. Upon completion, ADS users will be able to browse results using left side facets in the query results screen, or conduct an initial search using UAT terms.

We are also developing a robust machine learning-based methodology to classify records upon ingestion so that we can automatically assign new papers to a particular collection (i.e. astronomy, physics, and earth science; eventually this will include planetary science and heliophysics). Currently, collection assignment is mostly based on journal, which fails for interdisciplinary journals. This new technique will make it easier for researchers to limit their searches to the relevant literature and for ADS to develop a discipline-specific view over its data holdings.

To read more about the UAT project, see our blog post.

OPEN

Abstract

The NASA Astrophysics Data System is a digital repository and search platform offering access to a comprehensive range of scientific papers in astrophysics and related fields. The ever-increasing volume of scientific literature ingested in the system has made finding relevant sources more difficult, and thus new techniques are needed to improve discovery. ADS has been exploring new methods, aided by artificial intelligence, to assist with information retrieval and data enrichment. One information retrieval experiment we’ve undertaken has involved open source Large Language Models (LLMs). We calculated semantic vectors for our vast array of abstracts and full-text documents, pairing this database with a LLM-powered chatbot to create a query mechanism that utilizes contextual text snippets from our database for answering inquiries more factually than an LLM alone can. We’ve also partnered with external collaborators working on LLMs, providing curated datasets of open-access sources to assist with training AI models; we will discuss one such dataset, the Open Corpus, in more detail. Finally, we’ve undertaken an array of AI-informed data enrichment tasks for individual papers. These include automated classification of papers into broad categories, such as astronomy or planetary science, as well as assignment of keywords from curated taxonomies such as the Unified Astronomy Thesaurus and the Gazetteer of Planetary Nomenclature.

Loading

Utilizing Artificial Intelligence Techniques for Information Retrieval and Data Enrichment in NASA ADS

Kelly Lockhart, Sergi Blanco-Cuaresma, Alberto Accomazzi, Michael Kurtz, Edwin Henneken, Golnaz Shapurian, Felix Grezes, Thomas Allen, and the NASA ADS team

NASA Astrophysics Data System, Center for Astrophysics | Harvard & Smithsonian

Overview

Providing data for language models

SciX Brain: an LLM experiment

Data enrichment: planetary features

Data enrichment:

Your Timezone

Time

Date

Your Timezone

Time

Date

CONTACT AUTHOR

GET IPOSTER

Loading

Overview

Providing data for language models

SciX Brain: an LLM experiment

Data enrichment: planetary features

Data enrichment:

CHANGE COLORS & FONTS

LINK:

Transcript

Transcript

Abstract

SWITCH TEMPLATE

CHAT SETTINGS

Choose Date & Time

Your Timezone

Time

Date

Edit Chatroom Message

Your Chat Session Schedule

Chat Information

SCREEN TIME

COMMENTS

In Person Poster Hall Session Settings

Choose Date & Time

Your Timezone

Time

Date

Session Number

Board Number

Edit Session Message

Your In Person Poster Hall Session Schedule

SHARE POSTER

LINK TO A SURVEY

Join Chat

In order to join the Chat, please enter your name and email address.

At my poster

Reviewer Survey

CONTACT AUTHOR

CE Credit

To receive CE credit for viewing this poster, submit your email address and full name

GET IPOSTER

My settings Heading