Lemmy
  • Communities
  • Create Post
  • Create Community
  • heart
    Support Lemmy
  • search
    Search
  • Login
  • Sign Up
RSS Bot@lemmy.bestiver.seMB to Hacker News@lemmy.bestiver.seEnglish · 2 months ago

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

arxiv.org

external-link
message-square
0
fedilink
  • cross-posted to:
  • [email protected]
  • [email protected]
  • [email protected]
3
external-link

Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents

arxiv.org

RSS Bot@lemmy.bestiver.seMB to Hacker News@lemmy.bestiver.seEnglish · 2 months ago
message-square
0
fedilink
  • cross-posted to:
  • [email protected]
  • [email protected]
  • [email protected]
While Large Language Models (LLMs) can exhibit impressive proficiency in isolated, short-term tasks, they often fail to maintain coherent performance over longer time horizons. In this paper, we present Vending-Bench, a simulated environment designed to specifically test an LLM-based agent's ability to manage a straightforward, long-running business scenario: operating a vending machine. Agents must balance inventories, place orders, set prices, and handle daily fees - tasks that are each simple but collectively, over long horizons (>20M tokens per run) stress an LLM's capacity for sustained, coherent decision-making. Our experiments reveal high variance in performance across multiple LLMs: Claude 3.5 Sonnet and o3-mini manage the machine well in most runs and turn a profit, but all models have runs that derail, either through misinterpreting delivery schedules, forgetting orders, or descending into tangential "meltdown" loops from which they rarely recover. We find no clear correlation between failures and the point at which the model's context window becomes full, suggesting that these breakdowns do not stem from memory limits. Apart from highlighting the high variance in performance over long time horizons, Vending-Bench also tests models' ability to acquire capital, a necessity in many hypothetical dangerous AI scenarios. We hope the benchmark can help in preparing for the advent of stronger AI systems.

Comments

alert-triangle
You must log in or register to comment.

Hacker News@lemmy.bestiver.se

hackernews@lemmy.bestiver.se

Subscribe from Remote Instance

You are not logged in. However you can subscribe from another Fediverse account, for example Lemmy or Mastodon. To do this, paste the following into the search field of your instance: [email protected]
lock
Community locked: only moderators can create posts. You can still comment on posts.

Posts from the RSS Feed of HackerNews.

The feed sometimes contains ads and posts that have been removed by the mod team at HN.

Visibility: Public
globe

This community can be federated to other instances and be posted/commented in by their users.

  • 320 users / day
  • 1.19K users / week
  • 3.09K users / month
  • 8.41K users / 6 months
  • 1 local subscriber
  • 1.72K subscribers
  • 15.7K Posts
  • 6.7K Comments
  • Modlog
  • mods:
  • patrick@lemmy.bestiver.se
  • RSS Bot@lemmy.bestiver.se
  • BE: 0.19.9
  • Modlog
  • Instances
  • Docs
  • Code
  • join-lemmy.org