Back to Projects
Contrastive Learning: Aligning Chandra Event Data with Research Papers

Contrastive Learning: Aligning Chandra Event Data with Research Papers

Astromind (in collaboration with CfA Harvard)December 2024 - PresentML Engineer & Researcher

Key Highlights

  • Collaborating with Center for Astrophysics (CfA) Harvard on cutting-edge contrastive learning research
  • Creating shared embedding spaces between Chandra X-ray Observatory event data and astronomical research papers
  • Developing novel encoder architecture that bridges observational data with scientific literature
  • Enabling direct retrieval and semantic search across both data and text modalities

Overview

Collaborative research project with the Center for Astrophysics at Harvard to create shared embedding spaces between Chandra X-ray Observatory event data and astronomical research papers using contrastive learning.

Challenge

Astronomical research involves two distinct modalities that traditionally exist in separate silos:

  • Observational Data: Raw event files with time, energy, and position measurements
  • Scientific Literature: Research papers describing and interpreting observations

The goal was to create a unified embedding space enabling cross-modal retrieval between data and text.

Approach

Developed contrastive learning framework with dual encoders:

  • Event Encoder: Processes Chandra event lists using CNN/Transformer hybrid
  • Text Encoder: Processes research papers using pre-trained language models (SciBERT)
  • Contrastive Loss: NT-Xent loss to align embeddings in shared space

Used hard negative mining and temperature scaling for efficient training on paired observation-paper datasets.

Technologies

  • PyTorch: Model development and training
  • Python: Core implementation and data processing
  • RunPod: GPU infrastructure
  • Pinecone: Vector storage for similarity matching

Impact

Achieved >85% top-5 retrieval accuracy for paper-to-data matching, enabling intelligent data retrieval, literature-guided analysis, and automated annotation of astronomical observations. Embeddings cluster meaningfully by astronomical object type and generalize to unseen source types.

Contrastive Learning: Aligning Chandra Event Data with Research Papers 1