Research Intern @ Stanford AI Lab
Contact: abhaygupta1266@gmail.com
I am a high school researcher with a deep passion for natural language processing, large language models (LLMs), and AI safety. Currently, I am a research intern at the Stanford Artificial Intelligence Laboratory (SAIL) working directly under Prof. Yejin Choi and Liwei Jiang.
My research focuses on evaluating biases in LLMs, improving fairness, and diagnosing reasoning failures in complex contexts. My work has been published in top venues including EMNLP, NeurIPS, and AACL.
In collaboration with Meta and UC Berkeley
Proceedings of the Association for Computational Linguistics: EMNLP 2025
Current LLMs struggle to answer questions spanning tens of thousands of tokens. We introduce NovelHopQA, an innovative benchmark to evaluate 1-4 hop QA over 64k-128k-token excerpts from 83 full-length public domain novels. Our research evaluates six SOTA models, revealing that scale alone does not guarantee robust multi-hop reasoning.
Findings of the Association for Computational Linguistics: EMNLP 2025
EnDive (English Diversity) addresses the lack of intra-language evaluation in standard benchmarks. Translating SAE datasets into five underrepresented dialects via few-shot prompting, we created a challenging diagnostic that uncovers persistent model biases against non-standard language speakers across reasoning, logic, and math tasks.
Proceedings of the Third Workshop on NLP for Positive Impact, 2024
To develop more inclusive NLP systems, we introduced AAVENUE, a benchmark to evaluate LLMs on NLU tasks in African American Vernacular English (AAVE). The benchmark uses human-verified LLM translation to reliably map GLUE and SuperGLUE metrics out to AAVE.
Proceedings of SciProdLLM, IJCNLP-AACL 2025
Evaluates how demographic cues influence clinical reasoning in frontier LLMs by holding critical symptoms constant while perturbing patient pronouns in 69,000 parallel test items, exposing localized divergences in downstream medical rationales.
Accepted @ NeurIPS 2025 LLM Evaluation Workshop
Introducing a two-stage NLP extraction and classification pipeline to structure raw, free-form online discussions into a coherent deliberation map of core issues, barriers, and solutions. Outperforms prompt-only baselines and establishes a standardized task for modeling collective civic intelligence.
Research Intern
Aug 2025 - Present · Remote
Research Intern
Mar 2025 - Jan 2026 · Remote
Research Intern
Nov 2024 - Dec 2025 · Remote
Machine Learning Intern
Sep 2025 - Nov 2025 · Remote
LLM Researcher
Jan 2024 - Sep 2025 · Remote
Machine Learning Intern
Jun 2025 - Aug 2025 · Remote
Schmidt Sciences (Jan 2026)
Davidson Institute (Jul 2025)
My journey in AI began with a simple curiosity about how machines process human language, but it was Kevin Zhu who truly opened the door to the world of research for me. He was the very first person to guide my early steps, teaching me through everything and becoming my most impactful mentor. That initial curiosity quickly evolved into a passion for ensuring these systems are fair, inclusive, and safe for everyone. As I continue to grow, I am also incredibly grateful to Prof. Yejin Choi and Liwei Jiang for continuously inspiring me to tackle the difficult sociotechnical problems in AI alignment.
Outside of running evaluations and writing papers, I enjoy exploring the intersection of technology and linguistics, keeping up with the rapid pace of open-source AI, and finding new ways to make complex machine learning concepts accessible to my peers.