Xiangyi Li: The Young Founder Bringing Transparency to AI

Matthew Kayser

Contributor

Dec. 18, 2025, 3:11 p.m. ET

Companies are constantly looking to incorporate AI to grow their businesses, whether it’s to provide customer service agents or keep track of their productivity. However, running evaluations for the systems they build can be a slow, inconsistent process that often means piecing together disjointed scripts and datasets to accommodate their particular infrastructures.

As a result, without standardized methods, comparing results between different models becomes nearly impossible, leaving critical questions about safety, reliability, and performance unanswered.

Xiangyi Li saw this gap during his work at Tesla and in research projects across universities. Rather than accept the inefficiency, he founded BenchFlow, a platform designed to make AI model evaluations transparent and accessible, effectively working to make this technology fully accountable while retaining its practicality and advanced capabilities.

His First Encounters with AI

Li grew up in Shandong, China, where his interest in computing began when he stumbled upon a copy of the 2018 book Artificial Unintelligence, which introduced him to technologies like neural networks and coding platform Kaggle. Soon, he was experimenting with Python scripts and entering online competitions, learning as he went.

His first real exposure to AI came when GPT-3 launched. While many in his academic circles had not yet heard of it, Li was already using the model to help write his university lab reports. The experience led to a realization: AI was advancing far faster than the tools available to test it.

Li later got a bachelor’s in computer science from the Chinese University of Hong Kong, which eventually led to pursuing his master’s in this field at San José State University, all while interning across different companies. At Tesla, for example, he worked on evaluating frameworks for code-generation models, directly seeing how difficult it was to measure these systems reliably.

That frustration planted the earliest seeds for what would become BenchFlow.

Automating Benchmark Testing With BenchFlow

Li founded BenchFlow in 2024 after noticing a pattern: whether it was for academic research or industry projects, teams struggled to run AI benchmarks efficiently. Each test often required manual setup and custom scripts, and needed to run on isolated environments — a time-consuming process that slowed progress. In fact, recent studies show 17 out of 24 major benchmarks don’t provide easy-to-run scripts for result replication, and several provide scripts only for partial replication, greatly limiting reproducibility.

BenchFlow provides a platform where evaluation could be handled through a simple interface that Li describes as akin to a “smart kitchen,” automating the prep work of AI testing. Instead of writing new code for every experiment, researchers could submit a model through BenchFlow’s API to get standardized, reproducible evaluations spanning a variety of benchmarks.

The idea gained momentum when Li released an open-source benchmark modeled on a viral project trending on Twitter. The tool gave researchers the chance to test models in a game-like environment and measure how well they handled reasoning or sequential decision-making tasks.

Within weeks, researchers at major labs were using this benchmark for internal experiments, helping BenchFlow secure more than $1 million in seed funding from backers including Jeff Dean, Google’s Chief Scientist, Arash Ferdowsi, co-founder of Dropbox, and Founders, Inc.

By the end of its first year, the platform hosted over 60 benchmarks, with adoption from teams at a variety of multinational companies as well as many research labs worldwide. For Li, the real validation came when researchers began mentioning they had used BenchFlow internally, proof that the tool solved a genuine problem.

Building An Open-Source Community

Much of BenchFlow’s growth has come from its open-source philosophy. Li spent weekends at hackathons across San Francisco, speaking directly to developers to accurately understand their needs and releasing tools for free to encourage them to experiment within the platform.

That openness helped BenchFlow establish itself as a valuable community resource. Researchers could build on each other’s work, share results, and avoid running benchmarks more than once. And the platform’s standardized benchmarks made it easier to compare models even if they came from different institutions, lowering the access entry for smaller labs without access to massive computing infrastructure.

By prioritizing shared standards and open participation, BenchFlow aims to help researchers better navigate the growing complexity of modern machine learning systems while fostering more transparent and responsible development practices.

The Next Frontier For Performance Benchmarks

Li remains aware of the current risks that come with incorporating AI, considering its current capabilities. He believes standardized evaluation will be essential to guarantee that this technology is implemented safely. BenchFlow, he argues, can serve as the infrastructure that would lead to that greater transparency, acting as an equivalent of unit tests for automated systems before they reach the public.

He also sees a future where this technology moves closer to individuals instead of being restricted to corporate servers. Personalized agents, he predicts, will reshape how people present themselves online, with people becoming able to expand their outreach in ways previously unheard of. But for that future to be safe and reliable, rigorous testing will be non-negotiable.

“Evaluation is how we make sure these systems work as intended,” he has explained. “It’s the step that turns rapid progress into something people can actually trust.”

BenchFlow’s journey from a side project to a widely used platform reflects that mission. With AI only becoming more integrated into regular workflows, tools that bring accountability to its outputs will only become more essential, and Xiangyi Li intends to keep building them.