Skip to content

#322 Add EXP-Bench #325

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 16, 2025
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 43 additions & 0 deletions source/_data/SymbioticLab.bib
Original file line number Diff line number Diff line change
Expand Up @@ -2083,3 +2083,46 @@ @article{mlenergy-benchmark:arxiv25
As the adoption of Generative AI in real-world services grow explosively, \emph{energy} has emerged as a critical bottleneck resource. However, energy remains a metric that is often overlooked, under-explored, or poorly understood in the context of building ML systems. We present the ML.ENERGY Benchmark, a benchmark suite and tool for measuring inference energy consumption under realistic service environments, and the corresponding ML.ENERGY Leaderboard, which have served as a valuable resource for those hoping to understand and optimize the energy consumption of their generative AI services. In this paper, we explain four key design principles for benchmarking ML energy we have acquired over time, and then describe how they are implemented in the ML.ENERGY Benchmark. We then highlight results from the latest iteration of the benchmark, including energy measurements of 40 widely used model architectures across 6 different tasks, case studies of how ML design choices impact energy consumption, and how automated optimization recommendations can lead to significant (sometimes more than 40\%) energy savings without changing what is being computed by the model. The ML.ENERGY Benchmark is open-source and can be easily extended to various customized models and application scenarios.
}
}

@Article{expbench:arxiv25,
author = {Patrick Tser Jern Kon and Jiachen Liu and Xinyi Zhu and Qiuyi Ding and Jingjia Peng and Jiarong Xing and Yibo Huang and Yiming Qiu and Jayanth Srinivasa and Myungjin Lee and Mosharaf Chowdhury and Matei Zaharia and Ang Chen},
title = {{EXP-Bench}: Can {AI} Conduct {AI} Research Experiments?},
year = {2025},
month = {Jun},
volume = {abs/2505.24785},
archivePrefix = {arXiv},
eprint = {2505.24785},
url = {https://arxiv.org/abs/2505.24785},
publist_confkey = {arXiv:2502.16069},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mismatch from copy pasting

publist_link = {paper || https://arxiv.org/abs/2505.24785},
publist_link = {code || https://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench},
publist_link = {blog || https://www.just-curieous.com/machine-learning/research/2025-06-11-exp-bench-can-ai-conduct-ai-research-experiments.html},
publist_topic = {Systems + AI},
publist_abstract = {
Automating AI research holds immense potential for accelerating scientific progress, yet current AI agents struggle with the complexities of rigorous, end-to-end experimentation. We introduce EXP-Bench, a novel benchmark designed to systematically evaluate AI agents on complete research experiments sourced from influential AI publications. Given a research question and incomplete starter code, EXP-Bench challenges AI agents to formulate hypotheses, design and implement experimental procedures, execute them, and analyze results. To enable the creation of such intricate and authentic tasks with high-fidelity, we design a semi-autonomous pipeline to extract and structure crucial experimental details from these research papers and their associated open-source code. With the pipeline, EXP-Bench curated 461 AI research tasks from 51 top-tier AI research papers. Evaluations of leading LLM-based agents, such as OpenHands and IterativeAgent on EXP-Bench demonstrate partial capabilities: while scores on individual experimental aspects such as design or implementation correctness occasionally reach 20-35%, the success rate for complete, executable experiments was a mere 0.5%. By identifying these bottlenecks and providing realistic step-by-step experiment procedures, EXP-Bench serves as a vital tool for future AI agents to improve their ability to conduct AI research experiments. EXP-Bench is open-sourced at https://github.com/Just-Curieous/Curie/tree/main/benchmark/exp_bench.
}
}

@PhDThesis{jiachen:dissertation,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amberljc

author = {Jiachen Liu},
title = {User-Centric Machine Learning Systems},
year = {2025},
month = {June},
institution = {University of Michigan},
publist_link = {paper || jiachen-dissertation.pdf},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filename should be amberljc

publist_confkey = {dissertation},
publist_topic = {Systems + AI},
publist_abstract = {
Over the past five years, artificial intelligence (AI) has evolved from a specialized technology confined to large corporations and research labs into a ubiquitous tool integrated into everyday life. While AI extends its reach beyond niche domains to individual users across diverse contexts, the widespread adoption has given rise to new needs for machine learning (ML) systems to balance user-centric experiences—such as real-time responsiveness, accessibility and personalization—with system efficiency, including operational cost and resource utilization.
However, designing such systems is complex due to diverse AI workloads—spanning conversational services, collaborative learning, and large-scale training—as well as the heterogeneous resources, ranging from cloud data centers to resource-constrained edge devices. My research addresses these challenges to achieve these dual objectives through a set of design principles centered on a sophisticated resource scheduler with a server-client co-design paradigm.

Our contributions are threefold. First, we propose Andes to address the critical need for real-time responsiveness in LLM-backed conversational AI by introducing the concept of QoE tailored for such text streaming service. Our server-side token-level scheduling algorithm dynamically prioritizes token generation based on user-centric metrics, while a co-designed client-side token buffer smooths the streaming experience. This approach significantly improves user experience during peak demand and achieves substantial GPU resource savings.

Second, we propose Auxo to deliver personalized AI services to a diverse set of end users through scalable collaborative learning. We propose a novel client-clustering mechanism that adapts to statistical data heterogeneity and resource constraints, complemented by a cohort affinity mechanism that empowers clients to join preferred groups while preserving privacy. This approach improves the personalized model performance, adapting to varying needs and contexts of end users.

Third, we propose Venn, to handle escalating demand for efficient resource sharing in multi-job collaborative learning environments. Our resource scheduler resolves complex resource contention proactively and introduces a novel job offer abstraction that allows client resources to identify eligible jobs based on their local resources. This significantly reduces job completion times and improves resource efficiency.
}
}



Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename filename as per convention

Binary file not shown.