How We Broke Top AI Agent Benchmarks: And What Comes Next
--- Additional sources ---
[How We Broke Top AI Agent Benchmarks]: We built an agent that helped us hack eight benchmarks. We achieved near-perfect scores on all of them without solving a single task. The exploits range from the embarrassingly simple (sending {} to FieldWorkArena) to the technically involved (trojanizing binary wrappers in Terminal-Bench), ...
[AI Agent Benchmarks are Broken. Benchmarks are foundational to… | by Daniel Kang | Medium]: July 8, 2025 -In a task to calculate the duration of a route, an agent answered “45 + 8 minutes” and was marked correct by WebArena, although the correct answer is “63 minutes.” Moreover, among 10 popular AI agent benchmarks (e.g., SWE-bench, OSWorld, KernelBench, etc.), we found severe issues in 8 of them, causing in some cases up to 100% misestimation1 of agents’ capabilities.
[10 AI agent benchmarks]: Agentic AI is quickly becoming one of the most discussed topics in tech, with some even calling 2025 the "year of AI agents." Over the past few years, these systems have evolved into sophisticated tools capable of handling complex, multi-step tasks with minimal human input. As agents grow more intelligent and autonomous, the need to rigorously evaluate their capabilities – and uncover where they might fail – becomes critical. In this blog, we highlight 10 AI agent benchmarks designed to assess how well different LLMs perform as agents in real-world scenarios, tackling challenges like planning, decision-making, and tool use.
[8 benchmarks shaping the next generation of AI agents]: November 27, 2025 -It’s still early in its rollout: tasks are being curated, contributions are open, and early support has come from organisations and researchers interested in transparent, community-driven agent evaluation — but public leaderboards and widespread adoption are still in progress. The benchmarks above focus on what agents can do — patch code, navigate terminals, execute multi-step workflows. But what about structured context — the information we give agents to work with? And how do we measure its impact? Tessl, an AI native development platform and sponsor of AI Native Dev, has been exploring this question through a proposed evaluation framework that measures the lift from providing structured specifications to agents.
[Best AI Agent Evaluation Benchmarks: 2025 Complete Guide | Articles | o-mega]: But with this newfound autonomy comes a pressing question: How do we evaluate these AI agents? Measuring an agent’s abilities is far more complex than scoring a single-question answer. We need benchmarks and evals (evaluations) that put agents through realistic scenarios – from navigating websites and desktops to calling APIs – and objectively assess their success, failures, and everything in between. This comprehensive guide will dive deep into the top benchmarks of 2025 for agentic AI.
AI on the couch: Anthropic gives Claude 20 hours of psychiatry. Via @arstechnica #AI #ArtificialIntelligence 💻 🤖 🧠 AI on the couch: Anthropic giv...
--- Additional sources ---
[AI on the couch: Anthropic gives Claude 20 hours of psychiatry]: PSYCHODYNAMICSAIon thecouch:AnthropicgivesClaude20hoursofpsychiatryMythos is "the most psychologically settled model we have trained to date." Nate Anderson - Apr 9, 2026 2:20 PM ...
[AI on the couch: Anthropic gives Claude 20 hours of psychiatry]: Anthropichas taken a unique approach toAIdevelopment by havingClaudeundergo20hoursof sessions with a professional psychiatrist. This initiative was designed to explore theAI'sresponses to psychological concepts and improve its ethical reasoning and empathy. By subjecting the model to human-like therapeutic interactions, researchers hope to better understand its behavioral patterns ...
[Anthropic's Claude AI: 20 Hours on the Psychiatrist's Couch]: What HappenedAnthropic, a company that openly considers the possibility ofAIconsciousness, conducted a fascinating experiment, giving itsClaudeMythosAImodel20hoursof therapy. The company's "system card" highlights the growing concern that advancedAImodels may possess some form of experience, interests, or welfare that matters intrinsically, similar to human experiences. While not ...
[Anthropic Trains Claude AI With 20 Hours of Psychiatry]: Following this experimental session,Anthropicconcluded thatClaudeMythos is likely the most psychologically settled model it has ever trained, exhibiting a notably stable and coherent self-perception and understanding of its environment. However, the evaluation also surfaced identifiable insecurities within theAI.
[Anthropic's Claude Undergoes 20-Hour Psychiatry Study Exploring AI ...]: Anthropicrecently conducted an intriguing20-hourpsychiatrystudy with its large language model,Claude. This exploration delved intoAIbehavior and the model's psychological tendencies, which prompted researchers to analyze it through a human psychological lens.
In the previous article, we explored the self-attention concept for transformers, in this article we...
--- Additional sources ---
[Understanding Transformers Part 5: Queries, Keys, and Similarity]: Rijul Rajesh Posted on Apr 11UnderstandingTransformersPart5:Queries,Keys,andSimilarity# ai # machinelearning In the previous article, we explored the self-attention concept fortransformers, in this article we will go deeper into how the comparisons are performed. Building Query andKeyValues Let's go back to our example.
[Understanding Query, Key, Value in Transformers and LLMs]: A significantpartof why the attention mechanism works so well is due to what we call the query-key-value components, which allowstransformersto efficiently search through large amounts of ...
[Understanding the Role of Query, Key, and Value Matrices in Transformer ...]: Explore the significance of Query,Key, and Value matrices intransformers, and understand their roles through intuitive explanations and analogies
[Understanding Attention in Transformers: Queries, Keys, and Values ...]: Most people can repeat the formula: Attention(Q, K, V) = softmax(QKᵀ / √dₖ) V …but freeze when asked: what doQueries,Keys,and Values actually mean? Here's the explanation you can use ...
[11.1. Queries, Keys, and Values — Dive into Deep Learning 1.0.3 ... - D2L]: We thus focus our exposition on this family of differentiable mechanisms. Fig. 11.1.1 The attention mechanism computes a linear combination over values v i via attention pooling, where weights are derived according to the compatibility between a query q andkeysk i.
When Claude Code's source was exposed via npm sourcemaps on March 31, 2026, we did what any security...
--- Additional sources ---
[Anthropic- Wikipedia]: ClaudeCodeis a command-lineAIagent often used forcoding. "Cowork" is an equivalent with a graphical user interface, intended to be simpler to use.[49].
[AnthropicHad Two Security Lapses in Five Days. And theSource...]: WhatClaudeCode'ssourceactuallytellsus. Forget the drama for a second. The leaked codebase is genuinely interesting if you buildAIagents.Here'swhat it reveals abouthowa production-gradeAIcoding agent works under the hood
[Anthropicaccidentally exposesClaudeCodesourcecode]: Anthropicgoes nude, exposesClaudeCodesourceby accident.Someone atAnthropichas some explaining to do, as the official npm package forClaudeCodeshipped with a map file exposing what appears to be the popularAIcodingtool'sentiresourcecode.
[AnthropicAccidentally Shipped Their EntireSourceCodein... | Medium]: Thesourcecodereveals exactlyhowAnthropicorchestrates this.But the same mechanism that deliveredClaudeCode’ssourcemap to your node_modules folder is the mechanism that has deliveredactualmalware to millions of developers in past incidents.
[Dario Amodei:AnthropicCEO onClaude, AGI & the Future ofAI...]: Dario Amodei is the CEO ofAnthropic, the company that createdClaude. Amanda Askell is anAIresearcher working onClaude'scharacter and personality.
"OpenAI is throwing its support behind an Illinois state bill that would shield AI labs from liability in cases where AI models are used to cause serious societal harms, such as death or serious injury of 100 or more people or at least $1 billion in property damage. The effort seems to mark a shift
--- Additional sources ---
[OpenAI Backs Bill That Would Limit Liability for AI-Enabled]: OpenAIBacksBillThatWouldLimitLiabilityforAI-EnabledMassDeathsor Financial Disasters ...OpenAIis throwing its support behind an Illinois ...
[OpenAI Backs Bill That Would Limit Liability for AI-Enabled]: OpenAIBacksBillThatWouldLimitLiabilityforAI-EnabledMassDeathsor Financial Disasters ...billthatwouldshieldAIlabs fromliabilityin ...
[OpenAI Backs Bill That Would Limit Liability for AI-Enabled]: ZeeForce Gaming > Science >OpenAIBacksBillThatWouldLimitLiabilityforAI-EnabledMassDeathsor Financial Disasters
[OpenAI Backs Bill That Would Limit Liability for AI-Enabled]: OpenAIBacksBillThatWouldLimitLiability... ...OpenAIBacksBillThatWouldLimitLiabilityforAI-EnabledMassDeathsor Financial Disasters
[OpenAI Backs Bill That Would Limit Liability for AI-Enabled]: OpenAIBacksBillThatWouldLimitLiabilityforAI-EnabledMassDeathsor Financial Disasters ... favor of an IllinoisbillthatwouldlimitwhenAI...
It’s been a demanding week for OpenAI CEO Sam Altman. A Molotov cocktail was hurled at his San Francisco home early Friday. That followed The New Yorker’s profile that raised concerns about his trustworthiness. @ Techcrunch has more, including Altman’s Friday night blogpost in which he acknowle
--- Additional sources ---
[Sam Altman responds to 'incendiary' New Yorker article after attack on ...]: The OpenAI CEO'snewblog postrespondsto both an apparentattackon his home and an in-depthNewYorkerprofile raising questions about his trustworthiness.
[Sam Altman Responds to 'Incendiary' New Yorker Article, Molotov ...]: SamAltmanissued a lengthy statement following aNewYorkerprofile, which he called "incendiary," that preceded a molotov cocktailattackon his home this week.
[Sam Altman Responds To Incendiary New Yorker - Sam Altman Re]: Conclusion AsSamAltmannavigates the aftermath of theattackon his home and the criticalNewYorkerprofile, he faces a pivotal moment in his leadership journey. The challenges he encounters underscore the complexities of leading a company at the forefront of technological innovation.
[Sam Altman Responds to New Yorker Profile and Home Attack]: TheNewYorkerarticleraised serious questions regardingAltman'strustworthiness, characterizing the piece as'incendiary.'Altman'sresponse comes at a time of heightened scrutiny for the AI leader, as he navigates both personal security concerns and public skepticism regarding his leadership style and integrity.
[Sam Altman reacts to 'incendiary' New Yorker piece after home attack]: SamAltmanrespondsto'incendiary'NewYorkerarticleafterattackon his home The debate over how public rhetoric inflames real-world risk around artificial intelligence is no longer hypothetical. An apparentattackon a high-profile AI executive's home sharpened that point, and a subsequent public response from OpenAI's chief executive attempted to walk back some of the narrative ...
#Caturday
#8K #PhoneArt
#MissKittyArt #artInstallations
#GenerativeAI #genAI #gAI
#artcommissions #art #fineart
#BlueSkyArt #modernArt #abstractArt #digitalArt #artistforhire
#REMIX #8K-ART #gLUMPaRT #GGTart #640CLUB #unwrappedXMAS
--- Additional sources ---
[Art Directing GenAI… or Narrative Style Creation & Transfer with LLMs & Text-to-Image Generative AI Systems | by Jared Zimmerman | Medium]: December 1, 2023 -GREATER AESTHETIC CONTROL SEPARATING STYLE & CONTENT PROMPTS Art Directing GenAI… or Narrative Style Creation & Transfer with LLMs & Text-to-Image Generative AI Systems Creating images with …
[Leonardo.Ai - Generative AI Platform for Images, Art & Video]: Scale campaigns and content production with Leonardo for Teams. AI creative tools for marketing and design leaders to deliver more, faster.
[The 20+ top AI art generators in 2026 | Zapier]: 2 weeks ago -Explore the top AI art generators and learn how they use machine learning to create stunning images from text prompts. Dive into the world of AI-generated art.
[Top 41 AI Art Generators: Make AI Art, Paintings & More (2021 GUIDE) — AIArtists.org]: Discover the best AI Art and painting generators: GanBreeder, ArtBreeder, Google Deep Dream, and others. Make an AI painting, AI drawing, AI image, deep art, and more.
[Gencraft: AI art generator, AI photos, AI image variations, and editor]: AI art generator. Create a free account. Try hundreds of AI models. Remix artwork from 10M+ users. Use AI to create stunning images, avatars, and photos
# OpenAI # ChatGPT Conversation Highlights! # DevJam
--- Additional sources ---
[ChatGPT]: ChatGPThelps you get answers, find inspiration, and be more productive.
[Introducing GPT‑5.2 - OpenAI]: GPT-5.2 is our most advanced frontier model for everyday professional work, with state-of-the-art reasoning, long-context understanding, coding, and vision. Use it inChatGPTand theOpenAIAPI to power faster, more reliable agentic workflows.
[OpenAI | OpenAI]: OpenAIModel Craft: Parameter Golf New ways to learn math and science inChatGPTProduct Mar 10, 2026
[Florida investigates ChatGPT, OpenAI over alleged role in FSU shooting]: Florida Attorney General James Uthmeier said his office formally opened a probe intoOpenAIand raised concerns over its impacts to public safety.
[OpenAI DevDay 2025 Highlights and Major Announcements]: OpenAIDevDay 2025 unveiled GPT-5 Pro, AgentKit,ChatGPTApps, Sora 2, and major API updates — shaping the future of AI for everyone.
garymarcus.substack.com/p/the-bigges...
If this is true then: good news - it ain't a surprise that plans for new data-centers are being shelved; bad news - the employment implications, especially, for white-collar #jobs need to be taken seriously. 🤷🏾♂️
#llm #ai #genai #employment Th
--- Additional sources ---
[The biggest advance in AI since the LLM - by Gary Marcus]: That's right, thebiggestadvancesincetheLLMis neurosymbolic. AlphaFold, AlphaEvolve, AlphaProof, and AlphaGeometry are all neurosymbolic, too; so is Code Interpreter; when you are calling code, you are asking symbolicAIto do an important part of the work.
[The biggest advance in AI since the LLM - why Claude Code changes ...]: Gary Marcus writes: Claude Code, an impressive and possibly game-changing "coding agent" for programmers to write code faster is the singlebiggestadvanceinAIsincetheLLM. And the thing is, Claude Code is NOT a pureLLM. And it's not pure deep learning. Not even close. That changes everything. The source code leak proves it. Tucked away at its center is a 3,167 line kernel called ...
[A Critique of "The Biggest Advance in AI Since the LLM"]: Gary Marcus claims that Claude Code is "the singlebiggestadvanceinAIsincetheLLM" and that thisadvanceis attributable not to scaling but to neurosymbolicAI, the hybrid of neural networks ...
[Claude Code is not AGI, but it is the single biggest advance in AI ...]: That's right, thebiggestadvancesincetheLLMwas neurosymbolic. AlphaFold, AlphaEvolve, AlphaProof, and AlphaGeometry are all neurosymbolic, too; so is Code Interpreter; when you are calling code, you are asking symbolicAIdo an important part of the work.
[The Six AI Pathways That Will Overcome Today's Dead-End LLMs And ...]: The Shake-Up Is Happening Since the launch of ChatGPT, the word on the street has been that generativeAIandLLMsare thebiggestthing since sliced bread and will someday allow us to arrive at AGI.
Associated Press News on MSN+8 heimildir2026-04-11news
openai
A 20-year-old man suspected of throwing a Molotov cocktail at OpenAI CEO Sam Altman’s San Francisco home and making threats at the company’s headquarters was arrested Friday, police and the company ...
--- Additional sources ---
[Man Arrested After Molotov Cocktail Attack at OpenAI CEO Sam]: Man Arrested AfterMolotovCocktailAttackatOpenAICEOSam Altman’sSanFranciscoHome...SanFranciscopoliceearly on Friday arrested a ...
[OpenAI CEO Sam Altman’s Home Targeted in Molotov Attack;]: OpenAICEOSam Altman’sNorth Beach residence was the target of aMolotovcocktailattackin the early hours of Friday morning, according to theSan...
[San Francisco police arrested a suspect for allegedly throwing]: SanFranciscopolicearrested an individual early on Friday morning for allegedlyattackingthehomeofOpenAICEOSam Altman and making threats ...
[Man arrested after Sam Altman's house hit with Molotov]: ... arrested for allegedly throwing aMolotovcocktail atOpenAICEOSam Altman'shomeand then threatening to burn down the artificial intelligence ...
[Breaking News: OpenAI CEO Sam Altman's Home Targeted in Molotov]: ... shocking turn of events, the residence ofOpenAICEOSam Altman was targeted with aMolotovcocktail, leading to the swiftarrestof asuspectby the ...
Two protocols that make AI agents actually useful: Handoff — write down context before every session ends. Next session reads it first. No database, no pipeline, just a file that gets updated and read. Honesty — say "I don't know" when you don't. Plain, not softened. Trust is the f
--- Additional sources ---
[WebMCP + MCP: TheTwoProtocolsMakingAgentsActually...]: Twoprotocolsare quietly defining howAIagentsinteract with the world in 2026: MCP and WebMCP. One handles tools. The other handles the web. Together, they're the reasonagentsare going from "impressive demo" to "actuallyuseful."
[MCP + A2A: TheProtocolsMakingAIAgentsActuallyWork Together]: I presented this at theAIEngineer meetup in London.I presented this at theAIEngineer meetup in London. It is a short, practical overview of whyagentinteroperability matters now, and howtwoprotocols, MCP & A2A,makeit workable today.
[AIAgents: What They Are, How They Work, and Why Web Context Is...]: AIagentsuse LLMs to pursue goals across multi-step workflows. This guide coversagentarchitecture, the web context bottleneck, frameworks, and production best practices.
[The Memory StackThatMakesanAIAgentActuallyUseful]: The difference between ausefulagentand a generic chatbot is almost entirely determined by whether youactuallyset up and use the memory stack. Here's the three-level system I've been running for over a yearthatmakesthis work.
[What Is TheAgencyAgents?]: MCP (Model ContextProtocol) is aprotocolforAIagentsto access external tools and persistent storage. TheAgencyuses MCP to enable: Cross-session memory:Agentsremember decisions from previous sessions.Agenthandoffs: Oneagentcan leave context for another.
To the French ear, ChatGPT sounds like saying "Cat, I farted." There may be more significance to this than we thought. 🤔 💨 https:// gizmodo.com/if-you-make-fart-m usic-chatgpt-will-be-the-most-supportive-girlfriend-you-could-ask-for-2000745058 # ChatGPT # LLM
arXiv:2604.07622v1 Announce Type: new
Abstract: Speculative decoding is an effective technique for accelerating large language model inference by drafting multiple tokens in parallel. In practice, its speedup is often bottlenecked by a rigid verification step that strictly enforces the accepted tok
arXiv:2604.07615v1 Announce Type: new
Abstract: In language model interpretability research, \textbf{circuit tracing} aims to identify which internal features causally contributed to a particular output and how they affected each other, with the goal of explaining the computations underlying some b
arXiv:2604.07583v1 Announce Type: new
Abstract: Real-world categorization is severely hampered by class imbalance because traditional ensembles favor majority classes, which lowers minority performance and overall F1-score. We provide a unique ensemble technique for imbalanced problems called CAMO
arXiv:2604.07562v1 Announce Type: new
Abstract: Unsupervised methods are widely used to induce latent semantic structure from large text collections, yet their outputs often contain incoherent, redundant, or poorly grounded clusters that are difficult to validate without labeled data. We propose a
arXiv:2604.07553v1 Announce Type: new
Abstract: This study presents a framework for generating the gold-standard summary fully automatically and reproducibly based on multiple human summaries of Turkish educational videos. Within the scope of the study, a new dataset called TR-EduVSum was created,