Link Search Menu Expand Document
⭐ Join us!! December 3, 2025 · GHC 4400 / 4300 · 12 pm–4 pm
Language Technologies Institute
Master of Computational Data Science
Capstone Project Showcase
Project Categories (Click to Expand)
LLM Systems & Agents
Retrieval & Knowledge-Grounded Generation
Benchmarks, Datasets & Evaluation
Code, Programming & Software Engineering
Systems, Infrastructure & Optimization
Vision, Robotics & Embodied AI
Applied AI for Science, Engineering & Materials
Applied AI for Society, Finance & Education
Data, Privacy, Attribution & Curation
Safety, Alignment & Theory of Mind

AIDEN: AI-based Interactive TA for Educational Needs

Emily Guo, Helen Wang, Ken Ye

Teaching Assistants in large technical courses often struggle to keep pace with the volume and complexity of Piazza questions. To address this challenge, we developed AIDEN, an in-browser AI assistant that generates draft responses directly inside Piazza. We redesigned the system from a Slack-based prototype into a fully deployed Chrome Extension backed by a secure cloud server. The updated system incorporates course documents, starter code, and follow-up thread context through an expanded retrieval-augmented generation pipeline, and uses proactive pre-generation to reduce response delays. AIDEN was deployed in a live course, enabling us to collect real usage data—including TA interaction logs, ratings, and paired model–TA answers. Analysis of this data shows that TAs frequently engaged with the system, that pre-generation substantially reduced wait time, and that integrating starter code improved technical accuracy. Comparative experiments further revealed that Claude outperformed GPT-4o-mini within our enhanced pipeline.

These findings demonstrate that tightly integrating AI assistance into TA workflows can reduce friction and enhance support for high-enrollment courses. The deployment infrastructure and usage dataset built this semester provide a strong foundation for future data-driven improvements and continued refinement of the generation pipeline.

Tags: Retrieval-Augmented Generation (RAG); Large Language Models; Semantic Search; Context-Aware Automation; Chrome Extension; React Frontend; Backend API Integration; Intelligent Response Generation; User Feedback Loop; AI-Assisted Workflows

ORBIT - Open Recommendation Benchmark for Reproducible Research with Hidden Tests

Vishan Vishesh Oberoi, Bolin Wu, Mahima Jagadeesh Patel, Kangrui Mao, Chuning Shi

Recommender systems are among the most impactful AI applications, interacting with billions of users every day, guiding them to relevant products, services, or information tailored to their preferences. However, the research and development of recommender systems are hindered by existing datasets that fail to capture realistic user behaviors and inconsistent evaluation settings that lead to ambiguous conclusions. This project introduces the Open Recommendation Benchmark for Reproducible Research with HIdden Tests (ORBIT), a unified benchmark for consistent and realistic evaluation of recommendation models. ORBIT offers a standardized evaluation framework of public datasets with reproducible splits and transparent settings for its public leaderboard. Additionally, ORBIT introduces a new webpage recommendation task, ClueWeb-Reco, featuring web browsing sequences from 87 million public, high-quality webpages. ClueWeb-Reco is a synthetic dataset derived from real, user-consented, and privacy-guaranteed browsing data. It aligns with modern recommendation scenarios and is reserved as the hidden test part of our leaderboard to challenge recommendation models’ generalization ability. ORBIT measures 12 representative recommendation models on its public benchmark and introduces a prompted LLM baseline on the ClueWeb-Reco hidden test. Our benchmark results reflect general improvements of recommender systems on the public datasets, with variable individual performances. The results on the hidden test reveal the limitations of existing approaches in large-scale webpage recommendation and highlight the potential for improvements with LLM integrations. This work is a collaborative effort with Meta AI and has been accepted to NeurIPS 2025, available at https://arxiv.org/abs/2510.26095. The ORBIT benchmark, leaderboard, and codebase are available at https://www.open-reco-bench.ai.

Tags: Recommendation System; Benchmarking and Evaluation; Large-Scale Retrieval; LLM-based Query Generation

Theory of Mind for Explainable AI

Aditi Saini, Akshita Gupta, Krishnaprasad Vijayshankar

Large Language Models (LLMs) have been gaining traction for performing complex reasoning tasks, including those involving ethical and rationale-based decision-making. This raises important questions about whether a model’s beliefs and reasoning can be influenced by external factors, and whether such influence can be measured in terms of LLM faithfulness. This project focuses on evaluating LLM faithfulness through theory of mind concepts. We define simulatability as a measure of how faithfully a model adheres to its own reasoning - specifically, whether an LLM remains consistent with its beliefs and ideological stance when confronted with counterfactual questions. We hypothesize that factors such as toxicity and the presence or absence of explanations can induce variations in a model’s reasoning, potentially revealing adversarial effects or reinforcing faithfulness. The objective of this project is to systematically examine whether these factors affirm or undermine faithfulness in LLMs and to benchmark their impact on ethical and non-objective questions derived from the ALMANACS, a language model explainability benchmark.

Tags: Large language model (LLM) Faithfulness; Theory of Mind; Ethical Reasoning; Counterfactual Analysis; Explainable AI

RAG Modeling and Agent Evaluation

Abhijay Sai Paladugu, Andy Tang, Pranav Setlur

Current evaluation of Retrieval-Augmented Generation (RAG) systems relies heavily on automated metrics that fail to capture human preferences or assess performance on complex, multi-step tasks. This project first addresses this gap by developing a multimodal RAG (mRAG) pipeline and RAG Arena, a scalable interface for human-in-the-loop evaluation. Our analysis using this framework revealed a fundamental limitation: RAG’s single-step retrieval is insufficient for long-horizon workflows. We therefore extended our work to create a comprehensive diagnostic benchmark for LLM agents across code, search, and reasoning tasks. By analyzing full trajectories, we successfully distinguish between agent failures caused by poor planning versus those from long-context memory limitations. Our key finding is that while retrieval relevance is critical for RAG, failures in complex tasks stem systematically from flawed planning, establishing a robust evaluation paradigm for next-generation autonomous agents.

Tags: Retrieval-Augmented Generation (RAG); Multimodal RAG (mRAG); Agent Evaluation; Long-Context Agents; RAG Arena; Human-in-the-Loop Evaluation; Agentic Benchmarking; Trajectory Analysis; Diagnostic Framework; Autonomous Agents; Large Language Models (LLMs); Planning vs. Memory Limitations

Harmful Algae Bloom Detection

Yi Qun Heng, Madison Teague, Sarvesh Navare

Harmful algae blooms (HABs) pose a serious threat to coastal communities in Madagascar where economic dependence is on aquatic resources. Their erratic occurrence makes blanket seasonal fishing advisories impractical and ineffective. Existing machine learning approaches for HAB detection achieve high accuracy within their training regions but fail to transfer across geographic boundaries, dropping from 90 percent to 57 percent accuracy when applied to Madagascar’s waters. This performance degradation is linked with Madagascar’s limited historical HAB data and underscores the need to develop models trained on global data that generalize well. In this project we develop a transferable HAB detection system using empirical findings and publicly available comprehensive oceanographic and atmospheric data, deploy it on cloud and make it available for usage via a dashboard that takes in a location and date and returns a classification (HAB or not) for that location or time. This system enables targeted alerts that protect public health and coastal livelihoods.

Tags: Harmful Algal Bloom (HAB); Remote Sensing; Machine Learning; Big Data; Copernicus Climate Data Store; Harmful Algal Event Database (HAEDAT); Transformer; Mamba; SVM; AWS; kNN; Hierarchical Clustering

Transforming Textbooks into Nonlinear Interactive Study Guides

Mahita Kandala, Aijia Lei, Jacob Scriffiny

This project presents an end-to-end system that transforms traditional linear textbooks into interactive study guides by automatically identifying and visualizing semantic relationships between paragraphs. Our team combines embedding-based similarity filtering with large language model inference to detect background and elaboration dependencies across full-length textbooks, constructing a directed graph that reflects each textbook’s underlying conceptual structure. The backend stores these relationships in a Neo4j database and serves them through a FastAPI service, while a React-based PDF reader overlays paragraph anchors on the source PDF and enables nonlinear navigation with on-demand AI-generated explanations. Applied to five textbooks across diverse domains, the system demonstrates that long-range discourse dependencies can be inferred at scale and used to support more connected and exploratory modes of learning.

Tags: NLP; Semantic Similarity; Discourse Modeling; Knowledge Graph; Educational Technology; Text Analysis; Interactive Visualization

Agent4Molecule: LLM Agent for Discovery

Dhruv Garg, Emily Shen, Kaavya Subramanian

Several artificial intelligence (AI) tools aimed at accelerating the molecular discovery pipeline have recently emerged. However, with the exception of AlphaFold, most tools in this domain have not been widely adopted yet due to several factors. In parallel, Large Language Model (LLM) agents have recently gained significant traction due to their scalable reasoning and problem-solving capabilities. Motivated by these developments, we propose an LLM agent that automates molecular discovery pipelines by orchestrating state-of-the-art AI tools, stitching them together along with structured evaluation on outputs, and using strong reasoning capabilities to reliably produce high-quality molecules. To showcase the capabilities of our agent, we automate two representative pipelines using natural language prompts: Heme binder generation and EnzyGen, and successfully reproduced key results reported in the original work.

Tags: Large Language Models (LLMs); Agent; MCP Server; AI for Science; Molecule Generation; Protein Design

LLM Data Attribution Benchmark

Hanzhang Zhao, Niket Jain, Ishita Dasgupta

This work studies how training data quality and data attribution jointly shape large language model (LLM) behavior. We first introduce DATE-LM, a benchmark for evaluating data attribution methods across three practical tasks: training-data selection, toxicity and bias filtering, and factual influence tracing. DATE- LM provides standardized evaluation protocols and scalable infrastructure for consistent comparison of attribution techniques. Motivated by the dependence of attribution reliability on underlying data quality, we further investigate an end-to-end LLM-based web rewriting pipeline as an alternative to traditional heuristic cleaning in pretraining datasets. Through experiments involving multiple extraction methods, filtering pipelines, and LLM rewriters, we observe meaningful variation in document quality, repetition patterns, and scoring metrics. Our findings show that cleaner, more coherent corpora not only benefit pretraining but also improve the stability and interpretability of data attribution. Together, these results highlight a unified perspective on building attribution-aware data curation pipelines for modern LLMs.

Tags: Large Language Models (LLMs); Data Attribution; Training Data Quality; Data Curation; Benchmarking; Web Corpus Construction; End-to-End Web Rewriting; CommonCrawl; Heuristic Filtering; LLM Rewriting; Influence Tracing; Toxicity and Bias Mitigation; Pretraining Data Selection; Quality Scoring; Dataman; fastText; Corpus Refinement; Attribution-Aware Data Pipelines

Automated SQL Hinting for Postgres Query Optimization

Wenda Fu, Xueqi Li, Bobby Norwood

We introduce pg hint engine, an extensible framework for automatically generating SQL hint represen- tations of arbitrary query plans. We demonstrate the engine’s functionality by forcing Datafusion plans to run on Postgres. We further demonstrate the ability of pg hint engine to improve query performance through query plan management, leading to a fourteen percent speed up in the Join Ordering Benchmark.

Tags: Query Optimization; SQL Hints; Query Plan Management

Optimizing Hybrid Cloud Partition and Placement of Data and Compute

Adarsh Nandanwar, Ananya Angadi, Chenghui Yu, Junwei Chen

The hybrid cloud deployment model is becoming increasingly popular with large organizations. In such a hybrid environment, data may need to be moved frequently from one side to another, leading to thousands of dollars spent per year in data movement or replication costs. Moirai is a new framework for optimizing data and job placement in hybrid cloud deployments. It involves extracting and using relevant details about jobs from job scheduler logs to guide data placement, job placement, and job scheduling decisions. The goal here is to minimize the total cost and peak utilization of resources to improve resource utilization. The framework is designed to run periodically for maximal efficiency, and is capable of scaling to large data analytics infrastructure.

Tags: Cloud Infrastructure; Hybrid Cloud; Data Placement; Job Scheduling; Optimization; Linear Programming

NL2SQL: Natural Language to SQL

Venu Arvind Arangarajan, Tim Han, Ziming Wang

Text-to-SQL systems have streamlined data analysis by allowing natural language queries to replace manu- ally written SQL, enabling analysts to focus on insights rather than query construction. However, current approaches, such as fine-tuning a single strong model, struggles to scale to complex, real-world databases. These methods fail when handling large-scale relational structures and dynamic query requirements in industrial settings. To address this gap, we propose a multi-agent Text-to-SQL generation framework that automates schema linking, error detection and iterative refinement through reasoning plan generation and multi-agent collaboration. By moving beyond fixed schema retrieval methods and refinement strategies, our approach improves robustness, offering a scalable solution for complex database environments.

Tags: Text-to-SQL; Multi-Agent Framework; Large Language Models; Database Schema Analysis; SQL Generation; Spider 2.0; Generative AI

WayBuddy

Rithvik Senthil, Gunavardhan Akiti, Naveen Shenoy, Rupsa Dhar

Mobile vision applications often rely on compact object detection models that must operate under tight computational constraints and adapt to continuously changing environments. However, maintaining these small models typically requires costly manual annotation and frequent retraining. We introduce WayBuddy, an automated fine-tuning framework that combines active learning with supervision from a large vision– language model (VLM). The system identifies uncertain on-device detections, routes them to a server, and uses the VLM as a drop-in replacement for human annotators to produce high-quality pseudo-labels. These labels are integrated with historical data and periodically distilled back into the on-device detector, enabling continuous improvement without human involvement. We evaluate this architecture on real-world mobile video data, demonstrating that VLM-guided active learning reduces annotation cost to zero while improving detection accuracy across multiple metrics. Our results highlight a practical pathway for maintaining small models through selective, large-model supervision in resource-constrained environments.

Tags: WayBuddy; On-Device Object Detection; Automated Active Learning; Fine-Tuning

Less LLM, More Documents: Searching for Improved RAG

Jingjie Ning, Yibo Kong, Yunfan Long

Retrieval-Augmented Generation (RAG) couples document retrieval with large language models (LLMs). While scaling generators improves accuracy, it also raises cost and limits deployability. We explore an orthogonal axis: enlarging the retriever’s corpus to reduce reliance on large LLMs. Experimental results show that corpus scaling consistently strengthens RAG and can often serve as a substitute for increasing model size, though with diminishing returns at larger scales. Small- and mid-sized generators paired with larger corpora often rival much larger models with smaller corpora; mid-sized models tend to gain the most, while tiny and large models benefit less. Our analysis shows that improvements arise primarily from increased coverage of answer-bearing passages, while utilization efficiency remains largely unchanged. These findings establish a principled corpus–generator trade-off: investing in larger corpora offers an effective path to stronger RAG, often comparable to enlarging the LLM itself.

Tags: Retrieval-Augmented Generation; Passage Retrieval; Large Language Models; Corpus Scaling; Resource-Constrained Inference

FAIRMUNI-2

Lucy Sun, Nachaun Zhao, Bolin Zhang, Sayak Banerjee

Municipal bonds are a crucial mechanism for local governments to finance public projects, yet the factors that influence their borrowing costs remain complex and often opaque. This project aims to identify the key determinants that impact the True Interest Cost (TIC) of municipal bond issuers. To enable large-scale and accurate TIC analysis, we develop a large language model (LLM)-based extraction pipeline that automatically retrieves bond CUSIP schedules, bond information, and cost tables from unstructured offering statement PDFs. Using the extracted data, we compute TIC at scale and build predictive models that highlight the key features driving borrowing costs. The results indicate that years to maturity is the dominant factor associated with higher borrowing costs. The detailed results are presented in Section 12 and also through an interactive dashboard.

Tags: Large Language Model; Financial Analysis; Municipal Bond

GPU Program Partitioning for Superoptimization

Man Kit Ao, Jiaying Li, Ayush Kumar

As Large Language Models (LLMs) become more and more complex and ubiquitous, the demand for efficient computation of these LLMs is also increasing. Graphic Processing Units (GPUs) are often the hardware of choice for training and running LLMs. However, programming GPU kernels requires extensive domain knowledge and engineering efforts. Currently, there are tensor optimizers that can automatically generate GPU kernels. Mirage is one such superoptimizer, capable of searching through multiple levels of the GPU hierarchy. However, this large search space means Mirage can only effectively optimize small subgraphs consisting of a few operators. Applying Mirage to larger models like GPT or Llama requires users to manually partition the model, which is engineering-intensive.

In our project, we propose and implement a cost-model-based automatic program partitioner that automates the partitioning and Mirage superoptimization of large models. We built an end-to-end pipeline that extracts and partitions the computational graph of arbitrary PyTorch programs into Mirage-compatible and incompatible partitions. Afterwards, we perform cost-model-informed dynamic programming to automatically partition these programs and apply Mirage superoptimization. Our goal is to enable users to run Mirage on their own PyTorch programs to improve computational efficiency.

Tags: GPU, CUDA Kernels; Program Partitioning; Tensor Parallelism; Model parallelism; Graphs; Cost Modelling

CyberLife AI

Webber Wu, Eagle Lo, Brad Chen

This work presents CyberLife AI, a no-code platform for creating personalized, topic-driven conversational agents with persistent memory and knowledge grounding. We address limitations in existing platforms through a memory-augmented dual-retrieval RAG pipeline combining conversation history recall with document-based knowledge retrieval. The system enables non-technical users to create agents by uploading domain-specific documents, defining personality traits, and customizing voice and appearance through integrated speech recognition, text-to-speech synthesis, and animated avatar generation. The architecture comprises a TypeScript/React frontend, Python Flask backend, and a separate GPU model server for avatar generation.

Tags: No-Code Conversational Agents Generation Platform, RAG, Long-Term Memory, Dual-Database Retrieval, Multimodal Ingestion, Large Language Models (LLMs) Systems, Knowledge Indexing, Applied AI Engineering

Reproducible Determination of 3D Model and Porosity of Fuel Cell Cathode Layers from pFIB-SEM Data

Nicole Wang, Aryan Mehta, Enora Petry

This project presents a reproducible pipeline for 3D reconstruction and porosity analysis of fuel cell cathode layers using pFIB-SEM image data. Existing approaches rely heavily on manual parameter tuning, which introduces inconsistencies across datasets and limits reproducibility (Ferner et al., 2024a). To overcome the scarcity of labeled data, we also generate synthetic 3D porous structures using Porous Microstructure Analysis (PuMA) (Ferguson et al., 2018) and Blender (Blender Foundation, 2023), allowing us to create diverse training datasets with known ground truth. Our approach improves reproducibility and sets the foundation for future integration with scalable machine learning models and automated analysis of real and synthetic materials.

Tags: Mechanical Engineering; Machine Learning; Vision Model; Image Segmentation; 3D Software; AI for Science, AI for Materials Science

MetaSafetyReasoner

Shrey Jain, Nishoak Kosaraju, Sathwik Acharya, James Ding

Large Reasoning Models, or LRMs, despite their excellence in mathematical and analytical reasoning, pose serious safety risks. Their thinking chains can often be left unmonitored and with outcome-level reward a chain with unsafe content in thinking chain but then a refusal in the output could be inadvertently reinforced through outcome-level post-alignment methods. Fine-grained annotation within reasoning chains is gaining popularity with recent benchmarks such as SafeRBench proposing novel schemes. We take a step further towards leveraging fine-grained annotation for fine-grained rewards and propose MetaSafetyReasoner, a scalable framework that can differently reinforce safety inducing, safety destabilizing and neutral behaviors in model’s reasoning chains. Our approach centers around leveraging an LM trained for reasoning chunk annotations on the SafeRBench labels. Through chunking and then labeling each chunk, a risk profile for each behavior is assigned as the reward for the chunk. We then leverage a GSPO-inspired training methodology to derive loss at subsequence-level. We expect gains on both explicit output safety classification benchmarks and SafeRBench.

Tags: Large Reasoning Models, Safety Alignment, Dense Reward Model, Reasoning Supervision, Jailbreak Attacks, Reinforcement Learning

Multi-Agent LLM Systems for Code Migration

Shanru Lin, Xinyu Li, Yogesh Adhi Narayan

Code migration between libraries is critical for security and compatibility but remains tedious and error- prone in large, real-world codebases. We introduce an end-to-end multi-agent LLM system for Python library migration that orchestrates three specialized agents: (1) an environment-setup agent that automati- cally builds Dockerized runtimes for each repository; (2) a code-migration agent, extending SWE-Agent with web documentation retrieval and a new LSP-RepoGraph module for symbol-centric structural context; and (3) a testing agent, based on Qodo-Cover, that synthesizes Helper Tests for regression safety and Evaluation Tests for post-migration validation. Our pipeline targets real repositories from PyMigBench and evaluates migration quality via AST-level patch similarity and the pass rate of generated tests inside project-specific containers. Early results suggest that structured retrieval and documentation-aware editing reduce hallucinated API usages and missed call sites. This system shows the promise of multi-agent, tool-augmented LLM workflows for realistic Python library migration.

Tags: Code Migration; Large Language Models (LLMs) Agent; Multi-Agent; Python Library Migration; SWE-Agent; Qodo-Cover; RepoGraph

LLM Secure Code Generation via Reasoning and Reinforcement Learning

Yanlin Fei, Arihant Sheth

Large language models for code generation frequently produce code containing security vulnerabilities, yet existing mitigation approaches rely solely on unreliable reward signals from either high-false-positive static analyzers or non-specialized general-purpose LLMs. We propose a two-stage framework combining Chain-of-Thought Supervised Fine-Tuning with multi-reward Reinforcement Learning to enhance security while preserving functional correctness. Our approach first employs CoT-SFT on security-related coding tasks to instill explicit security reasoning, then applies RL with two complementary rewards: functional and security-focused unit tests for dynamic execution validation, combined with static analysis using CodeQL for vulnerability detection. Evaluation on the Python subset of CWEval demonstrates competitive performance with state-of-the-art models, achieving secure code generation rates matching GPT-4o despite using a 20× smaller model. Future evaluation on the complete CWEval set and CodeGuardPlus benchmarks covering over 40 CWE categories will assess generalization to unseen vulnerability patterns across multiple programming languages, demonstrating whether principled multi-reward optimization can achieve robust security without sacrificing code quality.

Tags: Secure Code Generation; Large Language Models (LLMs) Code Generation; Common Weakness Enumerations

Interactive Vision-Language Navigation

Wei Bin Au Yeong, Calvin Qin, Anubhav Sharma, Yaqi Wang

We present an Interactive Vision-Language Navigation (VLN) system that enhances robot navigation in real-world environments through multi-turn dialogue. Unlike existing VLN agents that assume ideal, unambiguous instructions, our approach addresses the ambiguity and inconsistency common in natural human communication. Our system detects uncertainty in user instructions arising from under-specification or mismatches with the environment and actively resolves it through dialogue. The pipeline integrates a large language model (LLM) with environmental perception modules to reason about semantic and contextual cues. To evaluate ambiguity detection and resolution methods, we augment current datasets with ambiguous instruction scenarios and generate additional natural tasks using LLMs. Experiments show that our dialogue-enabled agent achieves higher task completion rates in ambiguous settings, demonstrating the promise of conversational VLN for more adaptable and user-friendly human-robot interaction.

Tags: Vision-Language Navigation; Ambiguity Detection; Dialogue-based Resolution; Human-Robot Interaction; Embodied AI

Abstraction and Reasoning Challenge for LLMs

Akhil Dua, Jake Bentley, Naman Tuli

This work presents a data pipeline, modular system, and hardware strategy designed to improve the abstraction and reasoning capabilities of large language models (LLMs) on the ARC-AGI-2 benchmark, a challenging suite of grid-based visual reasoning tasks requiring generalization to entirely novel test problems. We combine staged fine-tuning on curated ARC-style datasets with increasing difficulty with an optimized test-time training (TTT) pipeline, enabling dynamic model adaptation using few-shot supervision at inference. Our framework integrates a 4-bit quantized Mistral NeMo 8B model optimized for memory efficiency and deployed across 4 GPUs under strict compute constraints. While experimental results demonstrated substantial gains over baseline models, our staged data pipeline experiments failed to produce substantial accuracy gains on the actual leaderboard. Although with the help of inference-level optimizations and parameter averaging, the team’s best submission scored 6.67% on the public leaderboard, placing 85th out of more than 1400 teams.

Tags: ARC-AGI-2; Visual Reasoning; Test-Time Training; Few-Shot Adaptation; Model Quantization; Mistral NeMo; Multi-GPU Inference; Curriculum Fine-Tuning; Generalization Benchmarking

Sotopia-ToM: Evaluating and Advancing Information Management in Multi-Agent Interaction with Theory of Mind (ToM)

Ruichen Wang, Shihua Zeng, Yashwanth Yerabudala Surendra

Large language model (LLM) based agents in multi-agent systems (MAS) often fail to account for information asymmetry, leading to over-disclosure or the omission of critical details. We introduce Sotopia-ToM, a framework built on Sotopia V2 that evaluates an agent’s ability to share the right information while protecting sensitive data during multi-party interactions. We collected and created 1,360 multi-party scenarios exhibiting varying degrees of information asymmetry, forming an extensive benchmark for multi-agent conversation with a focus on privacy. We implemented and tested algorithmic improvements such as chain-of-thought prompting and machine Theory of Mind (ToM), enabling agents to infer and predict other agents’ knowledge states.

Tags: Multi Agent Systems (MAS); Information Sharing, Privacy; Chain-of-Thought (CoT); Theory of Mind (ToM)

Cross Lingual Natural Language Inference

Jared Cochrane, Clint Zhu, Declan Tan

Natural Language Inference (NLI) examines whether a hypothesis sentence logically follows from, contradicts, or is unrelated to a premise. Cross-Lingual NLI (XNLI) extends this task across languages, where the premise and hypothesis appear in different languages. While widely useful for downstream tasks such as question-answering, summarization, and translation evaluation, existing XNLI datasets contain only about four hundred thousand examples per language pair, limiting the effectiveness of modern large-scale models.

In this project, we leverage large parallel corpora and sentence perturbation to construct a greatly expanded XNLI dataset with roughly one million examples per language pair. We fine-tune two XNLI classifiers on this expanded dataset and integrate them into a reference-free machine translation evaluation system. The XNLI-based MT evaluator shows strong correlation with state-of-the-art metrics, demonstrating that entailment-driven evaluation can reliably assess translation quality without requiring manually translated gold references.

Tags: Natural Language Inference; Machine Translation Evaluation; Reference-Free Evaluation; XNLI; Entailment Classification; Large language model (LLM) Data Generation

Beyond H-Index: New Metrics of Scientific Impact

Divyan Goyal, Wang Xiang, Wenhao Xu

Evaluating scientific impact remains challenging because widely used citation-based metrics, such as the H-Index, fail to reflect how researchers actually assess influence, novelty, or interdisciplinary reach. This project introduces a human-centered framework for modeling academic impact by integrating large-scale researcher metadata with empirical evidence from controlled pairwise comparisons.

We construct a comprehensive NLP researcher dataset by combining ACL Anthology and Semantic Scholar metadata, along with a cross-disciplinary database to support broader analysis. An IRB-approved survey administered at CMU and major NLP conferences presents both anonymous and named researcher comparisons, allowing us to measure evaluation criteria as well as the influence of name visibility.

Using ranking models, we estimate how different metadata features contribute to human judgments of impact and analyze discrepancies between participants’ stated values and their revealed preferences. The resulting composite metric captures community judgment more faithfully than traditional citation-based indices, offering a scalable and more holistic method for assessing interdisciplinary scientific influence.

Tags: Citation Context Analysis; Scientific Impact Measurement; Metric Learning; Researcher Profiling; Human-Guided Ranking