News & Events / Meetings / Joint ARM/ASR Meeting / Posters / Abstract

Machine Learning for Keyword Tagging of ARM Publications

Authors

Cromwell, Erol — Pacific Northwest National Laboratory
Levin, Maxwell — Pacific Northwest National Laboratory
Sivaraman, Chitra — Pacific Northwest National Laboratory
Munikoti, Sai — Pacific Northwest National Laboratory
Agarwal, Khushbu — Pacific Northwest National Laboratory

Description

ARM program regularly collects metrics on number of papers published using ARM data. The collected publications are tagged manually with relevant keywords to tie them to ARM campaigns, sites, or data products. These keywords enables search results, making it easier for online visitors to search for publications relevant to their work. Internally, these keywords are used to better understand ARM’s impact and can help ARM leadership make informed decisions about the direction of the ARM program. Currently, ARM relies heavily on internal communications experts to manually tag publications when they are uploaded through the website. With over 200 papers on average entering ARM’s databases each year and thousands of keywords available to choose from, this process can be time intensive and tedious. Additionally, ARM leadership occasionally adds new keywords as priorities change or new sites come online. To re-tag historical publications with the newly added keywords would require a review of 4500+ historical papers in the ARM database, a monumental task for an individual.

We investigate applying natural language processing methods to the task of automating keyword tagging. Our initial work focuses on sets of keywords which represent science areas that ARM leadership is interested in tracking. We compare two methods: KeyBERT, a minimal and easy-to-use keyword extraction technique that leverages BERT embeddings, and Llama 3.1 70B, a large language model (LLM). Both approaches required no model fine-tuning or pre-training, which allows us to apply these methods out of the box and extend our keyword set with minimal overhead.

Preliminary results from the methods are encouraging. In this poster, we will present the results of both the approaches and the impact of automating and improving the accuracy of this task.