PLM: Physical Language Model for Chandra X-ray Source Analysis

Overview

PLM (Physical Language Model) is a production chat-based application enabling astrophysicists to analyze raw X-ray event data from the Chandra Source Catalog through natural language. I led the end-to-end development of the entire system: fine-tuning models on 50,000+ sources, building the FastAPI backend with multi-agent orchestration, implementing MongoDB vector search, and creating an interactive Next.js frontend with UMAP visualization and real-time streaming chat.

System Architecture

Backend: FastAPI (Python) with multi-agent workflow orchestration
Frontend: Next.js 14 with React, TypeScript, TailwindCSS, D3.js visualizations
Database: MongoDB with vector search indices for 64D PCA embeddings
ML Infrastructure: RunPod GPUs for fine-tuning Qwen-7B on X-ray event data

Data Pipeline

Each X-ray source contains photon arrival times and energies. I built a comprehensive processing pipeline that:

Computes 64D PCA embeddings from event data for similarity search
Generates 2D UMAP projections for interactive visualization
Calculates spectral features: energy quantiles, hardness ratios in standard bands (soft, medium, hard)
Analyzes variability: K-S tests, Fano factors, Gregory-Loredo light curves at multiple cadences
Detects emission lines and performs spectral model fits (power-law, blackbody, APEC)

Multi-Agent Architecture

Built with LangGraph, the system orchestrates 5 specialized agents with streaming progress updates:

1. Event Analyst (Fine-tuned Qwen-7B)

I fine-tuned Qwen-7B on 50,000 X-ray sources to directly interpret raw photon event data. The model learned to classify sources (AGN, stars, SNR, galaxies) and identify variability patterns from time-energy sequences without traditional text inputs.

2. Metadata Analyst (GPT-4/GPT-5)

Analyzes computed spectral features and astrophysical metadata. Rather than fine-tuning on raw data, I translate event data into standardized metrics (hardness ratios, light curves, model fit parameters)—the same approach a professional astrophysicist would use—enabling enterprise LLMs to provide expert-level physical reasoning.

3. Neighbor Analyst (GPT-4/GPT-5 + MongoDB Vector Search)

I implemented MongoDB vector search with cosine similarity on 64D PCA embeddings to find the 10 most similar sources. The analyst compares spectral properties of neighbors, leverages known classifications, and provides comparative context. This "wisdom of the crowd" approach significantly improves classification confidence.

4. Tool Analyst (Multi-wavelength Imaging)

Provides LLM-accessible tools, primarily hips2fits for querying multi-wavelength imagery (infrared, optical, X-ray) at varying fields of view. The LLM autonomously queries images of the surrounding region to understand spatial and spectral context.

5. Critic Agent (Cross-validation)

Synthesizes all agent outputs, identifies agreements and disagreements, evaluates evidence strength, and assesses overall confidence. Provides critical review ensuring physically plausible conclusions.

6. Conversation Moderator (Final Synthesis)

Produces the final response in either "Normal" (conversational) or "Advanced" (structured with explicit reasoning sections) modes, synthesizing all analyses into a coherent answer.

Technical Implementation

Backend Architecture (FastAPI + LangGraph)

Built FastAPI service handling multi-agent workflow orchestration
Implemented streaming responses with Server-Sent Events (SSE) for real-time agent progress updates
Integrated fine-tuned Qwen-7B model serving via HTTP endpoints
Developed comprehensive spectral analysis module: energy quantiles, hardness ratios, variability metrics (K-S test, Fano factor, excess variance), periodicity analysis (Rayleigh test, FFT PSD)
Created Gregory-Loredo Bayesian Blocks implementation for piecewise-constant light curve analysis

Frontend Implementation (Next.js + React)

UMAP Visualization: Interactive SVG-based scatter plot with zoom/pan, displaying all 50,000+ sources color-coded by type, with nearest neighbor connections
Chat Interface: Real-time streaming chat with markdown rendering, agent progress indicators, message history persistence
Light Curve Charts: D3.js visualizations for fixed-cadence (100s, 500s, 2000s) and Gregory-Loredo light curves with error bars and credible intervals
Dataset Explorer: Browse and filter sources, upload custom event data, search by properties
Responsive Design: Three-panel layout (dataset explorer, UMAP, chat) with dark mode support

MongoDB Integration

Designed schema for 50,000+ X-ray sources with event data, embeddings, and metadata
Implemented vector search index on pca_64d field for cosine similarity queries
Built upload pipeline with automatic embedding generation and validation
Handled sources with <8h observation windows gracefully (metadata-only analysis)

Model Fine-tuning (PyTorch + Transformers)

Fine-tuned Qwen-7B on 50,000 sources using custom encoder architecture (Flamingo-style) to bridge time-series event data to LLM embedding space
Trained on Q&A pairs about source properties, classifications, and variability
Deployed model on RunPod GPU infrastructure with HTTP serving for inference

Technologies

Frontend: Next.js 14, React 18, TypeScript, TailwindCSS, D3.js, SVG visualizations
Backend: FastAPI (Python), LangChain, LangGraph, NumPy/SciPy
ML: PyTorch, Transformers, Qwen-7B fine-tuning, OpenAI GPT-4/GPT-5
Data: MongoDB with vector search, Pinecone (alternative vector DB)
Infrastructure: RunPod (GPU), Docker, HTTP microservices

Impact

Enables astrophysicists to rapidly analyze X-ray sources through conversational AI, significantly reducing analysis time from hours to minutes. The system combines strengths of specialized fine-tuned models (pattern recognition from 50,000+ sources) with general-purpose LLMs (physical reasoning) and automated computational pipelines (standardized metrics), providing comprehensive multi-perspective analysis with confidence assessments.

Key Contributions:

End-to-end implementation from model training to production deployment
Novel approach translating raw astronomical data into LLM-understandable metrics
Interactive visualization system enabling exploration of 50,000+ sources
Real-time multi-agent orchestration with streaming progress updates

PLM: Physical Language Model for Chandra X-ray Source Analysis

Technologies

Key Highlights

Overview

System Architecture

Data Pipeline

Multi-Agent Architecture

1. Event Analyst (Fine-tuned Qwen-7B)

2. Metadata Analyst (GPT-4/GPT-5)

3. Neighbor Analyst (GPT-4/GPT-5 + MongoDB Vector Search)

4. Tool Analyst (Multi-wavelength Imaging)

5. Critic Agent (Cross-validation)

6. Conversation Moderator (Final Synthesis)

Technical Implementation

Backend Architecture (FastAPI + LangGraph)

Frontend Implementation (Next.js + React)

MongoDB Integration

Model Fine-tuning (PyTorch + Transformers)

Technologies

Impact