Automating Intellectual Toil: Agent-Driven Development at Copilot Applied Science

Published: 2026-05-13 15:48:26 | Category: Programming

In a recent breakthrough, an AI researcher on the Copilot Applied Science team automated their own intellectual toil—analyzing thousands of lines of agent trajectory data. This led to creating a tool called eval-agents, which now empowers the whole team to build and share custom coding agents. Here are answers to common questions about this innovation.

What is agent-driven development in the context of Copilot Applied Science?

Agent-driven development refers to using AI-powered coding agents to automate repetitive, intellectual tasks—like analyzing benchmark results—so engineers can focus on creative problem-solving. In this case, the researcher built agents that parse trajectory files (JSON logs of agent actions) from evaluation benchmarks like TerminalBench2 or SWEBench-Pro. Instead of manually sifting through hundreds of thousands of lines of code, team members now run agents that surface patterns and reduce the reading load to just a few hundred lines. This approach turns the developer into a maintainer of automation, enabling faster iterations and deeper insights.

Automating Intellectual Toil: Agent-Driven Development at Copilot Applied Science — Source: github.blog

Why did the researcher decide to automate their own work?

The researcher regularly analyzed coding agent performance against standardized benchmarks. Each task produces a trajectory file, and with dozens of tasks across multiple daily runs, the total code to review reached hundreds of thousands of lines—impossible to do manually. They initially used GitHub Copilot to help surface patterns, but the process remained repetitive. The engineer in them said, “I want to automate that.” Agents provided the means to automate that intellectual work, leading to the creation of eval-agents. This freed them from toil and allowed them to focus on more creative aspects of research.

What is eval-agents and how does it work?

eval-agents is a tool that automates the analysis of agent trajectories from benchmark runs. It uses GitHub Copilot to understand the context of each trajectory and generate insights, patterns, or summaries automatically. The researcher designed it with three goals: make agents easy to share and use, easy to author new agents, and make coding agents the primary vehicle for contributions. By leveraging GitHub’s collaborative features, team members can quickly create, test, and share their own agents without deep expertise. The result is a fast development loop where everyone can build solutions tailored to their needs.

What were the main design goals for the eval-agents project?

The project was guided by three core goals:

Easy to share and use: Agents should be accessible to all team members through familiar GitHub workflows.
Easy to author new agents: Anyone should be able to create a custom agent without steep learning curves.
Make coding agents the primary vehicle for contributions: Encourage the team to think in terms of agents, not just scripts or manual processes.

These goals reflect the researcher’s experience as an open-source maintainer on the GitHub CLI, where collaboration and simplicity were key.

How did the researcher use GitHub Copilot to build these agents?

The researcher relied heavily on GitHub Copilot’s ability to provide context-aware code suggestions and natural language understanding. They iteratively refined prompts to help Copilot generate accurate analysis code for trajectory files. Over time, they learned to structure agent prompts so that Copilot could surface the most relevant patterns. This collaboration with Copilot sped up development dramatically—turning what used to take hours into minutes. The researcher also shared these prompt patterns with teammates, enabling them to craft their own agents quickly.

What impact did eval-agents have on the team’s workflow?

Before eval-agents, the team spent significant time manually reviewing trajectory files. Now, anyone can run an agent to get instant summaries and anomaly detection. This has unlocked an incredibly fast development loop. Team members report being able to test hypotheses in real-time and collaborate more effectively. The researcher, who once might have automated themselves into a different job, now maintains the tool and supports peers in building their own agents. The overall productivity gain has been substantial, and the team feels more empowered to tackle complex research questions.

What lessons can other teams learn from this agent-driven approach?

The key lesson is that automation doesn’t have to be limited to physical toil—it can also handle intellectual, repetitive analysis. By using tools like GitHub Copilot to build specialized agents, teams can dramatically reduce the time spent on data parsing and pattern recognition. Another lesson is the importance of making agents easy to share and modify; this fosters a culture of collaboration where everyone contributes. Finally, the researcher emphasizes that starting with a clear, repetitive problem (like analyzing thousands of trajectories) and then iterating with AI assistance can yield powerful, team-wide solutions.

Buconos