
Fine-tuning large language models to understand and analyze astronomical event data from the Chandra X-ray Observatory. The goal is to enable LLMs to process and interpret time-series astronomical data that exists in a completely different input space than traditional text.
Chandra event files contain time-series photon detection data with energy measurements and spatial coordinates—a fundamentally different modality that standard LLMs cannot process. The challenge was bridging this gap between continuous time-series data and discrete token spaces.
Developed custom encoders using Flamingo architecture to map astronomical time-series data into already trained LLM embedding spaces. Fine-tuned multiple models including GPT-Neo, Llama, and QWEN 7B using curriculum learning and multi-stage training.
Successfully enabled LLMs to interpret astronomical event data, achieving high accuracy in analyzing source characteristics, variability, and spectral properties across various X-ray sources (AGN, stars, SNR, galaxies).
