Fine-tuning LLMs for Chandra X-ray Observatory Data

Overview

Fine-tuning large language models to understand and analyze astronomical event data from the Chandra X-ray Observatory. The goal is to enable LLMs to process and interpret time-series astronomical data that exists in a completely different input space than traditional text.

Challenge

Chandra event files contain time-series photon detection data with energy measurements and spatial coordinates—a fundamentally different modality that standard LLMs cannot process. The challenge was bridging this gap between continuous time-series data and discrete token spaces.

Approach

Developed custom encoders using Flamingo architecture to map astronomical time-series data into already trained LLM embedding spaces. Fine-tuned multiple models including GPT-Neo, Llama, and QWEN 7B using curriculum learning and multi-stage training.

Technologies

PyTorch & Transformers: Model development and fine-tuning
Python: Core implementation and data processing
MongoDB: Training data storage
RunPod: GPU infrastructure for training

Impact

Successfully enabled LLMs to interpret astronomical event data, achieving high accuracy in analyzing source characteristics, variability, and spectral properties across various X-ray sources (AGN, stars, SNR, galaxies).

Fine-tuning LLMs for Chandra X-ray Observatory Data

Technologies

Key Highlights

Overview

Challenge

Approach

Technologies

Impact