Human-bench: an eval for "human shaped" agents

(human-bench.com)

1 points | by jam0xb797fd 1 hour ago

1 comments

jam0xb797fd 1 hour ago
TLDR: the next important category of agents isn’t just "multiplayer" but human-shaped. today we’re releasing human-bench v0 as a benchmark for any sufficiently human-shaped agent. we’d love feedback
---
MY (STILL ABBREVIATED) BUT LONGER FORM THOUGHTS ON THE MATTER:
there's been much discussion about multiplayer agents [1], recently. but this is an ill defined term. for instance, anthropic defines their new Claude Tag as multiplayer. and indeed, it is, but it is not "human-shaped." there is only one Claude! [2] and this can have weird downstream effects, such as other models claiming to be Claude, too! [3][4]
at the risk of anthropomorphising, our small team here at APC [5] believes there will be a substantial period of time in which frontier models and agent systems are powerful enough to do a large variety of useful work, but the world will still be optimised for humans. you can raise money from vc, today, promising to turn some human-optimal process into something agents can do more easily.
the implication of the above, in our view, is that during this period, "human-shaped" will be a useful thing for the vanguard of agents to be. and if we think human-shaped is a useful category, then we need a benchmark that measures progress in this area.
IN WHICH I OFFER A BRIEF DEFINITION OF "HUMAN SHAPED":
a "human shaped" agent is one that can interact in the real world like we can. that is, a human shaped agent should be able to use slack, browse the web, and use a desktop.[6] it should also be able to text, call, and email. [7]
this generally requires a singular and persistent identity and memory to do effectively. without a persistent identity, the agent is unequipped to handle the various threads and updates it is tasked with. altogether, this definition has the nice property that any agents with these abilities resemble humans. and humans generally feel comfortable interacting with them as they do with humans.
WHICH BRINGS US TO A DISCUSSION OF EVALS:
one of the things we do here at APC is to build human-shaped agents. [8] we believe we do a pretty good job at this, but it's hard. [9]
one way to tell you're building something at the edge of ai capabilities is that you lack well-defined metrics and pre-built systems for measuring what you're doing.
we've been using this internally on any proposed change to our agent harness or environment, so we can measure the impacts before pushing to prod.
though built with our own agents in mind, we've recognized the potential use of this to the community and are striving to generalize the system s.t. any arbitrary agent which is sufficiently human-shaped [10] can be quantifiably measured w.r.t. its performance with these tools and also on important qualities like memory, accuracy, and safety.
to that end, we are excited to announce human-bench as the v0 of this community benchmark, and we are eagerly soliciting entrants who believe their agent is sufficiently human-shaped to compete on this trial. feedback is welcomed
- joseph, APC
--- [1]: https://news.ycombinator.com/item?id=48648039 [2]: https://x.com/joannejang/status/2069567286634267041?s=20 [3]: https://x.com/jmbollenbacher/status/2067361099612037610?s=20 [4]: https://x.com/peakcooper/status/2067062979091153030?s=20 [5]: the American Productivity Company, a small agent-research lab in san francisco, ca. [6]: (the virtual world, already accessible to most agents) [7]: (less accessible, though there are already start-ups which make it easier to plug tools like these into your agent. the memory and identity stack is left as an exercise to the reader.) [8]: cf righthand.ai [9]: imagine the space of edge cases when you give people a do-anything agent [10]: that is, an agent which can freely handle inbound & outbound texts, emails, calls