AI News

901

DeepSeek trials sparse attention to cut AI processing costs

Mastodon +13 sources mastodon
deepseek
DeepSeek announced that it is field‑testing a new “fine‑grained sparse attention” mechanism that, according to the company, halves the cost of its public API for long‑form inputs. The technique, a long‑standing research idea that trims the number of token‑to‑token interactions during inference, has been re‑engineered by DeepSeek to apply dynamically at a much more granular level than earlier sparse‑transformer models. Early benchmarks shared on Hugging Face show up to a 60‑75 % reduction in compute time for sequences over 2 k tokens, and the firm has already lowered its pricing for the affected endpoint by roughly 50 %. The move matters because inference cost remains the biggest barrier to widespread deployment of large language models. Google’s recent KV‑cache compression and TurboQuant algorithms cut memory and compute expenses dramatically, but they still rely on dense attention for full‑length context. DeepSeek’s approach promises comparable savings without sacrificing the quality of long‑range dependencies, potentially democratising access to high‑capacity models for startups, researchers and enterprises that previously could not afford the per‑token fees. As we reported on 25 March, DeepSeek hired 17 specialists to integrate its DeerFlow 2.0 framework, signalling a broader push to optimise both training and serving pipelines. The sparse‑attention trial is the latest step in that strategy. What to watch next: DeepSeek plans to release a production‑ready version of the model by Q3, accompanied by a peer‑reviewed paper detailing the algorithmic innovations. Industry observers will be keen to see independent benchmark suites, how cloud providers price the new endpoint, and whether rivals such as OpenAI or Anthropic accelerate their own sparsity research in response. The outcome could reshape the economics of AI services across the Nordic tech ecosystem and beyond.
449

GitHub Copilot Revises Interaction Data Use Policy

GitHub Copilot Revises Interaction Data Use Policy
HN +11 sources hn
copilotprivacy
GitHub has announced that, from 24 April, the interaction data generated by all Copilot users – free, Pro and Pro+ – will be automatically fed into the training pipelines of its AI models unless the user explicitly opts out. The data set includes the prompts users type, the code snippets Copilot returns, and the surrounding context such as file names, language settings and session identifiers. The change follows a March 25 notice that reframed the policy from an opt‑in to an opt‑out model, aligning the service with the data‑driven approach Microsoft has taken for its broader AI portfolio. The shift matters for developers who rely on Copilot for daily coding tasks because it expands the pool of real‑world code that powers future model improvements, potentially accelerating feature rollout and accuracy gains. At the same time, it raises privacy concerns: corporate codebases, proprietary algorithms and confidential logic could be exposed to a training corpus that is later used to generate suggestions for other customers. GitHub’s updated privacy documentation stresses that data is stored temporarily, stripped of personally identifiable information where possible, and protected under Microsoft’s enterprise‑grade security standards. Nevertheless, the default inclusion may prompt enterprises to review internal policies, adjust licensing agreements, or adopt the new opt‑out workflow to safeguard sensitive assets. What to watch next is how the developer community reacts. Early signals include a surge in opt‑out requests and discussions on platforms such as Stack Overflow and the DEV community about compliance implications. GitHub’s product team has pledged to publish a transparency dashboard showing the volume of data ingested and the impact on model performance. Follow‑up updates are expected in the Copilot changelog and in Microsoft’s broader AI governance reports, which will reveal whether the policy drives measurable quality gains or triggers regulatory scrutiny in the EU and other jurisdictions.
259

New Plain-Text Cognitive Architecture for Claude Code Debuts on Show HN

New Plain-Text Cognitive Architecture for Claude Code Debuts on Show HN
HN +7 sources hn
claude
A GitHub repository posted under the name cog has sparked a fresh round of discussion on Hacker News, where the author describes it as “a plain‑text cognitive architecture for Claude Code.” The project bundles a set of Unix‑style tools—grep, find, git diff—and a lightweight folder layout that lets Claude Code treat its own memory as searchable text. By persisting prompts, reflections and execution logs in markdown files, the model can retrieve past reasoning, perform self‑reflection and even project “foresight” steps before writing new code. The author demonstrates the workflow with a typical debugging session: Claude recalls a prior design decision, surfaces related files, and adjusts its plan without a fresh prompt. Why this matters is twofold. First, Claude Code, Anthropic’s answer to GitHub Copilot, has already shown a growing footprint in the open‑source world; as we reported on 25 March, it ranked as the third‑largest contributor across public repositories and a new “auto mode” was unveiled the same day. The plain‑text architecture tackles a lingering limitation of many AI coding assistants: the lack of durable, searchable context that survives across sessions. By leveraging tools developers already know, the approach lowers the barrier to building “second‑brain” knowledge bases that can be version‑controlled, audited and shared. Second, the design aligns with a broader shift toward agentic, self‑organising AI workflows, echoing recent plugins such as Ars Contexta that generate personalized knowledge vaults from conversation. What to watch next includes whether Anthropic adopts or officially supports a similar memory layer, and how the community measures its impact on code quality and developer speed. Benchmarks comparing Claude Code with and without the cog architecture are likely to appear, as are security reviews of persisting AI‑generated artifacts in plain text. If the model can reliably reason over its own history, the next wave of AI‑assisted development could move from single‑prompt bursts to continuous, context‑rich collaboration.
248

Apple Leverages Google's Gemini to Build Smaller On‑Device AI Models

Apple Leverages Google's Gemini to Build Smaller On‑Device AI Models
Mastodon +11 sources mastodon
applegeminigoogle
Apple has secured “complete access” to Google’s Gemini large‑language model inside Google’s own data centres, and is using that privilege to distill far smaller, on‑device versions for its products. The process—known as model distillation—feeds Gemini’s outputs and internal reasoning into a training pipeline that yields compact models capable of running on iPhone, iPad and other Apple hardware without a network connection. The move matters because it gives Apple a shortcut to Gemini‑level performance while sidestepping the massive compute and memory footprints that typically accompany such models. On‑device AI can answer queries, translate speech and power context‑aware features with millisecond latency, lower battery drain and, crucially, keep user data out of the cloud. Apple’s ability to create proprietary derivatives also expands its control over the Siri experience, a point hinted at in our March 25 report that Apple may give Siri a “big AI overhaul” in iOS 27. Distilling Gemini could accelerate Apple’s rollout of offline Siri functions, improve privacy‑first features in iOS 27 and bolster the company’s broader AI‑first narrative that pits its custom silicon against Nvidia’s H100‑based solutions highlighted in Google’s TurboQuant announcement earlier this month. It also deepens the strategic partnership between the two rivals, showing that Google is willing to share core model assets in exchange for Apple’s hardware expertise and market reach. What to watch next: Apple has not disclosed a timeline, but integration is likely to appear in a beta of iOS 27 later this year. Developers will be keen to see whether Apple opens the distilled models through its Core ML framework, and regulators may scrutinise the data‑center access arrangement for antitrust implications. Benchmarks comparing the new on‑device models with the original Gemini and with Apple’s own internal models will provide the first concrete gauge of performance and privacy gains.
210

90% of Claude‑generated code lands in GitHub projects with under two stars

90% of Claude‑generated code lands in GitHub projects with under two stars
HN +11 sources hn
autonomousclaude
Anthropic’s Claude has been churning out code on GitHub at a pace that rivals Copilot, but a fresh analysis reveals that roughly nine‑in‑ten of those contributions land in repositories with fewer than two stars. The study, compiled from public commit metadata, cross‑referenced Claude‑tagged pushes with repository popularity metrics and found the overwhelming majority of Claude‑generated files reside in barely‑noticed projects. As we reported on March 24, Claude’s Code feature logged more than 19 million commits across the platform, positioning the model as a major source of AI‑assisted contributions. The new star‑distribution data, however, suggests that the bulk of that activity is confined to personal experiments, hobby scripts, or early‑stage prototypes rather than widely‑used libraries. For developers, the finding raises questions about the practical impact of Claude‑driven code: low‑star projects often lack rigorous review, testing, or community vetting, which can amplify the risk of bugs, security flaws, or licensing mismatches when the code is later reused. The pattern also matters for the broader open‑source ecosystem. If AI‑generated code proliferates in obscure repos, it may inflate the apparent volume of contributions without delivering real value, potentially skewing metrics that funders and maintainers rely on. Conversely, the concentration of Claude output in niche spaces could indicate a fertile ground for rapid prototyping, where developers experiment before graduating successful components to higher‑visibility projects. What to watch next: Anthropic has not yet commented, but a response—whether tightening integration guidelines, improving attribution, or offering quality‑scoring tools—could reshape how developers leverage Claude. GitHub’s security and licensing scanners may also adapt to flag AI‑originated code in low‑star repos. Industry observers will be tracking whether future updates to Claude’s prompting ecosystem, such as the “Claude‑Code” skill set, shift the distribution toward more reputable repositories.
160

PLDR-LLMs Reach Self-Organized Criticality in Reasoning

ArXiv +11 sources arxiv
inferencereasoning
Researchers led by Burc Gökden have unveiled a new class of large language models—dubbed PLDR‑LLMs—that display emergent reasoning when pretrained at the edge of self‑organized criticality (SOC). The findings, posted on arXiv (2603.23539v1) and accompanied by a public PyTorch implementation, show that the models’ deductive outputs at criticality follow statistical patterns akin to second‑order phase transitions, with correlation lengths that diverge in a manner reminiscent of physical systems poised between order and chaos. The breakthrough hinges on the PLDR architecture, which encodes decoder representations as power‑law distributions and augments the standard KV‑cache with a “G‑cache” that preserves long‑range dependencies. By steering the pretraining dynamics toward SOC—a regime where a system naturally tunes itself to a critical point without external parameters—the authors report that the model’s inference trajectories become highly sensitive to input perturbations yet remain globally stable. In practice, this translates to sharper logical chains, fewer hallucinations, and a measurable improvement in tasks that require multi‑step deduction. Why the result matters extends beyond a single performance bump. It offers a concrete bridge between statistical physics and deep learning, suggesting that the elusive “reasoning” exhibited by modern LLMs may emerge from universal critical phenomena rather than bespoke architectural tricks. If SOC‑based training can be scaled, it could provide a principled pathway to more interpretable, energy‑efficient models that retain robustness under distribution shift—attributes prized for high‑stakes applications in finance, healthcare, and autonomous systems. The community will now watch for replication studies and larger‑scale experiments that test PLDR‑LLMs on benchmark suites such as BIG‑Bench and MMLU. Follow‑up work is expected to probe how the DAG‑loss regularizer, introduced in earlier PLDR research, interacts with criticality, and whether hybrid training schedules can combine SOC with conventional supervised fine‑tuning. A forthcoming workshop on “Criticality in AI” at NeurIPS 2026 promises to gather physicists and AI engineers to explore whether the edge of chaos can become a standard design principle for the next generation of reasoning machines.
158

Data Centers Used as Acoustic Weapons

Data Centers Used as Acoustic Weapons
Mastodon +9 sources mastodon
A wave of low‑frequency noise is turning modern data farms into unexpected health hazards. Researchers and community activists have documented that the massive banks of servers powering everything from cloud services to AI training emit infrasound—vibrations below 20 Hz that humans cannot hear but can feel. Field recordings from facilities such as Elon Musk’s “Colossus” hub in Memphis show pressure levels comparable to those generated by wind turbines, yet the sound penetrates walls and travels farther because low frequencies are less easily dampened. The phenomenon matters because infrasound can interfere with the vestibular system in the inner ear, causing nausea, headaches, fatigue and, in extreme cases, disorientation or seizures. Residents living within a kilometre of several European and North‑American data centres have reported a rise in such symptoms, prompting local health officials to launch preliminary epidemiological surveys. The issue also raises legal questions: if a data centre’s acoustic output can be classified as a “weapon” under occupational‑health statutes, operators could face liability for “environmental nuisance” or even assault. What to watch next is a rapidly evolving regulatory and technical response. The European Union’s upcoming “Low‑Frequency Emissions Directive” is expected to set exposure limits for industrial sites, while the U.S. Occupational Safety and Health Administration is reviewing whether existing noise standards should be extended to the infrasonic range. Engineers are already testing mitigation strategies, from tuned acoustic panels to active vibration cancellation systems, and several start‑ups are marketing “infrasound monitors” for community use. The next few months will likely see pilot mitigation projects, court filings from affected neighbourhoods, and, if the pressure builds, a broader industry push to redesign cooling and power‑distribution architectures to curb the invisible roar of the digital age.
150

5 Fixes to Make Your API Ready for AI Agents

Dev.to +9 sources dev.to
agents
A new technical guide released this week warns that most public APIs were built for human developers, not for the autonomous AI agents that are now surfacing in enterprise workflows. The paper, titled “Your API Wasn’t Designed for AI Agents. Here Are 5 Fixes,” outlines five concrete patterns—aggressive retries, literal error parsing, unconfirmed chaining, opaque authentication flows, and missing context metadata—that cause agents to stall, generate hallucinations, or even trigger denial‑of‑service loops. The timing is significant. As we reported on March 25, AI agents can be hijacked with just three lines of JSON, and Claude Code now runs code on a user’s machine to complete tasks. Those stories exposed how agents treat APIs as raw contracts, bypassing the safety nets that human developers normally rely on. The new guide flips the script, showing API providers how to retrofit OpenAPI specifications, emit structured error objects, adopt OAuth 2.0 scopes that agents can negotiate, embed hypermedia controls (HATEOAS), and publish version‑aligned context plugins that feed directly into IDEs. Early experiments cited by apimatic.io claim that applying these fixes halves integration time, cuts token usage by almost half, and reduces hallucination rates to near zero. What this means for the Nordic AI ecosystem is twofold. First, companies that expose data or services through REST endpoints must treat AI agents as first‑class consumers or risk losing efficiency and security. Second, developers of AI‑driven automation platforms will gain a clearer checklist for vetting third‑party APIs, potentially accelerating adoption in sectors such as fintech, healthtech, and logistics. Watch for standards bodies to codify “agent‑ready” API profiles in the coming months, and for major cloud providers to roll out validation tools that flag non‑compliant endpoints. The next wave of AI‑augmented services will likely hinge on whether APIs can keep pace with autonomous agents’ expectations.
142

LLMs Can't Grade Essays Like Humans

ArXiv +11 sources arxiv
A new arXiv pre‑print (2603.23714v1) shows that large language models (LLMs) still fall short of human graders when scoring essays. The authors compared raw LLM scores against human marks across a multilingual test set and found systematic mismatches: short or under‑developed responses that hit the prompt are consistently overrated, while well‑crafted essays are penalised for minor language slips. The models appear to apply a literal, rubric‑free logic rather than the nuanced judgment humans use. The study joins a growing body of work that probes AI’s role in assessment. Earlier research on German student essays reported similar gaps between open‑source and proprietary LLMs and human raters, highlighting both the promise of multidimensional evaluation and the danger of hidden bias. A separate analysis of scoring processes underscored that, unlike human grading which follows explicit rubrics, LLMs generate scores from opaque internal patterns that are difficult to audit. Why it matters now is twofold. First, educational technology firms are courting schools and testing agencies with “AI‑graded” solutions, touting speed and cost savings. If the underlying models misjudge brevity or penalise stylistic variance, students could be unfairly advantaged or disadvantaged, eroding trust in digital assessment. Second, the findings raise regulatory questions: many jurisdictions are drafting standards for algorithmic transparency in education, and this paper provides concrete evidence that current LLMs may not meet those thresholds. What to watch next includes efforts to fine‑tune LLMs on domain‑specific rubrics, the emergence of hybrid human‑AI grading pipelines, and policy debates at upcoming education conferences. Industry players are likely to release updated models that claim rubric alignment, while researchers will test whether those claims hold up under the same rigorous cross‑human comparison. The next few months will reveal whether AI can move from “fast but fuzzy” to a reliable partner in essay evaluation.
133

Robust TypeScript Tool Uses LLMs to Extract Website Data

HN +12 sources hn
A new open‑source library for extracting web content with large language models (LLMs) has hit Hacker News, drawing immediate attention from developers building data‑driven AI products. The project, posted by Anirudh Kumar under the GitHub handle *lightfeed/extractor*, delivers a TypeScript‑based “Robust LLM Extractor” that turns raw HTML into clean, LLM‑friendly markdown and can return structured data using either Gemini 2.5 Flash or GPT‑4o mini. The extractor works by coupling browser automation with LLM prompting: a headless Chromium instance loads a page, strips away boilerplate, and feeds the main article to the chosen model, which then outputs either plain markdown or a JSON schema defined by the developer. An optional “main‑content only” mode reduces token usage dramatically, making large‑scale crawls affordable. The repository also bundles utilities for captcha solving, geotargeting and AI‑enriched metadata, echoing Lightfeed’s broader platform for building intelligence databases at scale. Why it matters is twofold. First, it lowers the technical barrier for Nordic startups and research teams that need reliable web‑scraping pipelines without maintaining bespoke parsers for each site. Second, by leveraging the cost‑effective GPT‑4o mini and Gemini Flash, the tool promises near‑real‑time extraction while keeping cloud‑compute bills in check—a critical factor for small‑scale AI ventures operating on tight budgets. The community will be watching how quickly the library integrates with existing TypeScript NLP ecosystems such as Unstract’s no‑code ETL framework or the emerging “document annotation” tools on LibHunt. Early adopters are already testing it on news aggregation, market‑research feeds and compliance monitoring. Upcoming milestones include support for multi‑language models, a plug‑in for Azure Functions, and a public benchmark comparing extraction accuracy against traditional scrapers. If the project gains traction, it could become a de‑facto standard for LLM‑driven web data pipelines across the Nordic AI landscape.
114

Malicious LiteLLM variants linked to TeamPCP supply‑chain breach

Mastodon +11 sources mastodon
A malicious update of the open‑source Python library LiteLLM has been traced to the notorious TeamPCP threat group, marking the latest high‑profile supply‑chain breach in the AI tooling ecosystem. On 24 March 2026 the attackers published two compromised versions of LiteLLM – 1.82.7 and 1.82.8 – on the official PyPI repository. Both packages embed a hidden .pth file that executes on every Python interpreter start, installing a three‑stage credential‑stealer capable of harvesting cloud API keys, CI/CD secrets and Kubernetes tokens before exfiltrating them to attacker‑controlled domains. The compromise appears to have originated from a prior breach of the Trivy CI/CD pipeline, a vulnerability TeamPCP exploited to hijack a maintainer account. The same account was later used to suppress the disclosure, deface related repositories and leak roughly 70 private BerriAI projects within minutes. LiteLLM, which routes large‑language‑model requests through a single API and records over 95 million monthly downloads, is now a vector for credential theft across a broad swathe of AI‑driven services. Why the incident matters goes beyond the immediate loss of secrets. It underscores the fragility of the Python package ecosystem, where mutable version tags and unauthenticated uploads can turn a widely trusted library into a stealthy backdoor. The attack also demonstrates TeamPCP’s evolving playbook: after compromising security tools such as Trivy and KICS, the group now targets foundational AI infrastructure, raising the stakes for any organization that builds or deploys LLM‑enabled applications. Enterprises should audit their dependency chains, enforce strict version pinning and adopt signed‑package verification wherever possible. Monitoring for anomalous .pth files or unexpected network traffic from Python processes can catch the payload early. In the coming weeks security researchers expect further disclosures about compromised PyPI packages, and PyPI itself has pledged to tighten publishing controls. Keeping an eye on updates from the official LiteLLM maintainers and on any legal actions against TeamPCP will be essential for organisations that rely on AI‑augmented pipelines.
94

Google's TurboQuant AI compression slashes LLM memory use sixfold

Google's TurboQuant AI compression slashes LLM memory use sixfold
Mastodon +12 sources mastodon
google
Google Research unveiled TurboQuant, a training‑free compression algorithm that slashes the memory footprint of large language models (LLMs) by up to six times. The technique quantises the key‑value (KV) cache – the working memory that stores intermediate activations during inference – to just three bits per entry, yet preserves the model’s original accuracy. A two‑step process that first applies PolarQuant to the cache’s floating‑point values and then refines them with a learned residual mapping enables the extreme reduction without the need for retraining. The breakthrough matters because KV‑cache memory has become the dominant bottleneck for serving LLMs at scale. By cutting that demand, TurboQuant can lower cloud‑infrastructure costs, reduce latency, and shrink the energy budget of inference workloads. The compression also opens a path for on‑device deployment of more capable models, a trend highlighted earlier this month when Apple demonstrated how Google’s Gemini can be distilled into smaller on‑device variants. For hardware vendors, the shift could accelerate demand for specialised accelerators that handle ultra‑low‑bit arithmetic, while cloud providers may see a competitive edge in offering cheaper, faster LLM APIs. What to watch next: Google plans to integrate TurboQuant into its Vertex AI platform later this year, and early benchmark results are expected at the upcoming ICLR conference. Third‑party frameworks such as Hugging Face and PyTorch are already probing support for the three‑bit format, which could speed broader adoption. Industry analysts will be monitoring whether the algorithm’s zero‑loss claim holds across diverse model families and real‑world workloads, and whether rivals release comparable compression schemes. If TurboQuant lives up to its promise, the economics of generative AI could shift dramatically, making powerful language models accessible to a wider range of applications and developers.
71

FPT Wins Agentic AI Award at 2026 AI Excellence Awards

Las Vegas Sun +10 sources 2026-03-26 news
agents
FPT, Vietnam’s leading IT services group, has taken home the Agentic AI prize at the 2026 Artificial Intelligence Excellence Awards, presented by the Business Intelligence Group. The accolade recognises IvyChat, the company’s enterprise‑grade platform that combines large‑language‑model reasoning with autonomous task execution, positioning it among the world’s most advanced “agentic” AI solutions. IvyChat moves beyond static chatbots by interpreting user intent, orchestrating multiple AI tools, and acting on behalf of users in real time—whether drafting contracts, triaging support tickets or optimizing supply‑chain workflows. The platform’s architecture, built on FPT’s AI Factory and Smart Cloud infrastructure, integrates proprietary data‑privacy safeguards and multilingual support, a critical differentiator for multinational corporations operating in the Asia‑Pacific region. The award matters on several fronts. First, it signals the maturation of Vietnam’s AI ecosystem, which has been bolstered by a string of recognitions for FPT’s cloud and AI initiatives over the past few years. Second, it underscores a shift in enterprise AI adoption: companies are now demanding systems that can not only answer questions but also carry out end‑to‑end processes without human intervention. By clinching the Agentic AI category, FPT demonstrates that it can compete with Western and Chinese rivals in delivering such capabilities at scale. Looking ahead, FPT has pledged to expand IvyChat’s integrations with major ERP and CRM suites and to roll out a developer portal that will let third‑party vendors embed autonomous agents into their own products. Observers will watch how the platform’s pricing model and data‑governance framework evolve, especially as European and U.S. regulators tighten rules around AI autonomy. Success in these areas could cement FPT’s role as a global hub for agentic AI and accelerate the broader adoption of self‑directing digital assistants across industries.
61

AI Agents Overtake Assistants: From Simple Replies to Autonomous Systems

Dev.to +9 sources dev.to
agentsautonomouscopilot
A post by cloud architect Sarvar Nadaf, published on the AWS Community Builders platform on March 25, sparked a fresh debate about the growing divide between AI assistants and AI agents. Nadaf’s piece, titled “AI Assistance vs AI Agents: Understanding the Shift from Responses to Autonomous Systems,” argues that the industry is moving beyond conversational helpers that merely answer questions toward software‑driven agents that can act independently on behalf of users. The distinction matters because it reshapes how enterprises design digital workspaces. AI assistants such as ChatGPT, Microsoft Copilot or Google Bard excel at retrieving information, drafting text or suggesting next steps when prompted. AI agents, by contrast, combine large‑language models with APIs, data stores and workflow engines to pursue goals without continual human input. ServiceNow’s AI Agents, IBM’s autonomous agents and emerging “agentic AI” platforms illustrate this trend, offering end‑to‑end task execution—from ticket routing to supply‑chain optimization—while embedding security and compliance controls native to the cloud provider’s AI platform. Analysts see the shift as a catalyst for productivity gains and cost reductions, but also as a source of new risk. Autonomous agents can make decisions that affect critical systems, raising questions about transparency, auditability and regulatory oversight. Companies that adopt agentic architectures will need robust governance frameworks, model‑level observability and clear escalation paths for human intervention. What to watch next: the rollout of standardized agent APIs by major cloud vendors, the emergence of cross‑vendor orchestration layers, and the first wave of regulations targeting autonomous AI actions in finance, healthcare and public services. Early adopters such as ServiceNow and IBM are likely to publish case studies that will set benchmarks for performance, safety and ROI, while startups race to build plug‑and‑play agent frameworks that promise “AI‑first” automation for midsize firms. The coming months will reveal whether the promise of truly autonomous AI agents can be delivered at scale without compromising control.
60

Azure Skills Plugin 2026 lets AI command Claude Code to deploy and auto‑configure cloud infrastructure

Mastodon +7 sources mastodon
claudemicrosoft
Microsoft has unveiled the Azure Skills Plugin 2026, a one‑click extension that lets Claude Code agents spin up full‑stack cloud environments simply by hearing the command “Deploy this app.” The plugin bundles a curated set of Azure services, the Azure MCP Server and the Foundry MCP Server into a single install, giving Claude Code a structured playbook for selecting the right compute SKU, configuring networking, handling permissions and launching the workload across more than 40 Azure services. The move pushes Claude Code beyond its recent auto‑mode rollout, which we covered on 25 March, where the model could generate code but still relied on developers to translate sketches into operational infrastructure. By embedding Azure‑specific expertise directly into the AI’s toolchain, Microsoft removes a major bottleneck in AI‑assisted development: the gap between code generation and production‑grade deployment. Enterprises can now hand off a high‑level request to an AI agent and receive a fully provisioned, monitored, and cost‑optimized environment, accelerating time‑to‑market and reducing the need for specialist cloud engineers. The plugin also opens a path for other coding assistants—OpenAI’s Codex, Gemini CLI, Cursor and the growing open‑source Claude Code skill library—to tap into the same Azure knowledge base, potentially standardising AI‑driven DevOps across platforms. For developers, the immediate benefit is a tighter feedback loop: write, test, and deploy without leaving the AI interface. What to watch next: Microsoft has promised incremental updates that will extend support to Azure Arc, hybrid‑cloud scenarios and tighter integration with GitHub Copilot. Analysts will be monitoring adoption metrics, especially among the 90 percent of Claude‑linked outputs that currently land in low‑star GitHub repos, to see whether the plugin can shift those projects into production‑grade pipelines. The next few months will reveal whether Azure Skills Plugin can truly make “just say deploy” a reliable reality for AI‑augmented software delivery.
56

Robust TypeScript LLM Extractor for Websites Released on GitHub

Robust TypeScript LLM Extractor for Websites Released on GitHub
Mastodon +9 sources mastodon
A new open‑source library for web‑data extraction has hit the spotlight on Hacker News. Lightfeed’s “Extractor”, a TypeScript package that couples Playwright’s browser automation with large language models (LLMs), was posted by its creator as a “Show HN” entry on Monday, drawing immediate attention from developers and AI practitioners alike. The library promises to replace the patchwork of custom scrapers that many teams build for each project. By feeding raw HTML into a lightweight conversion step that strips navigation, headers and footers, the tool produces LLM‑ready markdown. Developers then issue natural‑language prompts that guide the model to return validated, structured data—product specs, article bodies, user comments, and more—while keeping token usage low enough for production pipelines. The repository, which went public on GitHub (github.com/lightfeed/extractor), already includes features such as list‑vs‑detail extraction modes, value‑history tracking and optional email notifications, all wrapped in a type‑safe API. Why it matters is twofold. First, the convergence of browser‑level rendering (via Playwright) and LLM reasoning eliminates the brittle, selector‑based code that traditionally breaks whenever a site changes its layout. Second, the emphasis on token efficiency addresses a cost barrier that has limited LLM‑driven scraping to research labs rather than commercial operations. Companies that rely on up‑to‑date product catalogs, market‑intelligence feeds or real‑time news aggregation can now prototype pipelines in hours instead of weeks, potentially reshaping the economics of data‑as‑a‑service. What to watch next are the community’s response and the speed of adoption in enterprise settings. Lightfeed has announced a roadmap that includes deeper integrations with OpenAI, Anthropic and local LLM stacks, as well as a visual debugging console for prompt tuning. If the project gains traction, it could spark a wave of similar “LLM‑first” extraction tools, prompting larger players to either contribute to the open‑source effort or roll out competing services. Monitoring GitHub activity, early case studies, and any regulatory commentary on AI‑driven web scraping will be key to gauging the library’s long‑term impact.
50

Google's TurboQuant speeds AI memory eightfold, halving costs

VentureBeat +9 sources 2026-03-25 news
applegooglellamavector-db
Google unveiled an upgraded version of its TurboQuant compression algorithm, promising an eight‑fold speedup in large‑language‑model (LLM) memory handling and a 50 % reduction in operating costs. The announcement comes as LLMs stretch their context windows to ingest multi‑page documents, a move that has strained the key‑value (KV) caches that store intermediate activations during inference. TurboQuant works by squeezing KV pairs down to three‑bit representations, a technique first disclosed in Google’s March 26 research brief that showed a six‑times memory cut. The new release adds a training‑free quantisation step that not only preserves accuracy but also accelerates memory reads, delivering the reported eight‑times throughput gain on Nvidia H100 GPUs. Within 24 hours, developers began porting the code to popular open‑source runtimes such as MLX for Apple Silicon and llama.cpp, signalling rapid community uptake. The upgrade matters because memory bandwidth has become the primary bottleneck for both cloud‑based AI services and on‑device inference. By shrinking the working memory, TurboQuant lowers GPU utilisation, translates into cheaper cloud bills, and makes it feasible to run larger context windows on edge devices. The algorithm also speeds up vector‑search workloads that power semantic retrieval and recommendation engines, potentially reshaping the economics of AI‑driven search. What to watch next: benchmarks from major cloud providers will reveal whether the eight‑fold speed claim holds across diverse model families. Apple’s on‑device AI pipeline, already leveraging Google’s Gemini models, may integrate TurboQuant to push more capable assistants onto iPhones and Macs. Competitors such as Meta and Microsoft are expected to unveil rival compression schemes, setting up a race to dominate the emerging “memory‑first” AI stack. As the ecosystem tests TurboQuant at scale, its impact on pricing, model architecture and the feasibility of ultra‑long‑context LLMs will become clearer.
48

OpenAI shuts down Sora over risks to emergency response systems.

Mastodon +12 sources mastodon
openaisora
OpenAI announced on March 24 that it is permanently disabling Sora, its text‑to‑video model, and shutting down the accompanying consumer app, API and sora.com portal. The decision follows a wave of warnings from national emergency‑management agencies that realistic AI‑generated footage could be weaponised to spread false information during natural disasters, terrorist attacks or public‑health crises. Government sources said the move aligns with newly issued preparedness guidelines that flag synthetic video as a high‑risk vector for misinformation that could hamper coordination among first‑responders, divert resources and erode public trust. Sora, unveiled six months earlier, built on the same multimodal architecture that powers DALL‑E and GPT‑4, allowing users to input text, images or short clips and receive a full‑length video in seconds. Early demos showcased photorealistic scenes that were difficult to distinguish from genuine footage, prompting concerns that malicious actors could fabricate flood, fire or explosion videos and flood social media feeds at the height of an emergency. The BBC reported that the shutdown also cancels a $1 billion partnership with Disney that had been slated to integrate Sora into the studio’s content pipeline. The closure underscores a broader industry reckoning over generative‑video technology. Regulators in the EU and the United States are already drafting provisions that would require robust watermarking and provenance tracking for synthetic media, and OpenAI’s own safety roadmap has recently shifted toward “autonomous‑system safeguards” rather than pure content moderation. Observers will watch whether OpenAI releases a watered‑down version of Sora with built‑in detection tools, how quickly competitors such as Google or Meta adjust their video‑generation roadmaps, and whether new standards for emergency‑response communications emerge to counter deep‑fake threats. The episode may become a benchmark for how AI firms balance innovation with public‑safety obligations.
48

VehicleMemBench Launches Benchmark for Multi‑User Long‑Term Memory in In‑Car Agents

ArXiv +10 sources arxiv
agentsbenchmarks
A team of researchers from the University of Helsinki and partners in the automotive AI community has released VehicleMemBench, an open‑source, executable benchmark designed to test how well in‑vehicle agents retain and reason over multi‑user preferences over extended periods. The benchmark ships as a self‑contained simulation environment where virtual occupants interact with a car’s AI assistant across dozens of sessions, generating dynamic preference histories that the agent must recall, reconcile, and act upon using the vehicle’s built‑in tools. The accompanying codebase on GitHub includes a suite of scripted scenarios—from seat‑position adjustments to climate‑control preferences—that deliberately introduce conflicting user requests to probe an agent’s ability to resolve disputes and maintain a coherent state of the vehicle. Why it matters is twofold. First, modern cars are evolving from isolated infotainment consoles into shared, AI‑driven cabins where multiple occupants expect personalized, persistent experiences. Current evaluation methods focus on single‑turn dialogue or short‑term task completion, leaving a blind spot in long‑term memory and conflict‑resolution capabilities that are essential for safety‑critical decisions such as driver‑assist handover or emergency routing. Second, the benchmark provides a standardized, reproducible metric that can accelerate research on memory architectures—such as LangMem or the recently unveiled TurboQuant compression technique that slashes LLM memory footprints by up to sixfold—by exposing real‑world constraints of limited on‑board compute and storage. What to watch next is the rapid adoption of VehicleMemBench by major OEMs and platform providers. Early adopters, including a Scandinavian electric‑vehicle startup, have pledged to integrate the suite into their internal validation pipelines, and the benchmark’s GitHub repository already shows forks from several AI labs experimenting with hybrid memory‑retrieval models. The next wave of papers is likely to report performance baselines, while industry consortia may formalize the benchmark as part of safety certification standards for autonomous‑driving assistants.
48

Google cuts AI inference costs sixfold with KV cache compression.

Mastodon +11 sources mastodon
googleinference
Google’s research team unveiled TurboQuant, a data‑oblivious key‑value (KV) cache compression algorithm that shrinks the memory footprint of large language model (LLM) inference by more than six times while preserving accuracy. The method quantises KV caches to three bits per entry, a drop from the usual 16‑bit representation, and does so without any fine‑tuning or calibration. Benchmarks on Nvidia H100 GPUs show up to an eight‑fold acceleration of the attention kernel, translating into dramatically lower inference costs for models such as Gemini Pro and Llama 3. The breakthrough matters because KV caches have become the dominant bottleneck as LLMs scale and context windows lengthen. Memory consumption directly drives hardware procurement, data‑center power draw and, ultimately, the price charged to developers for running generative AI services. By slashing cache size, TurboQuant lets providers run longer prompts on the same hardware, defer or avoid costly upgrades, and pass savings on to end‑users. Early market reaction confirms the ripple effect: memory‑chip stocks slipped, and several GPU vendors are already revisiting their pricing and roadmap assumptions for next‑generation AI accelerators. What to watch next is how quickly the technique spreads beyond Google’s own Gemini stack. The algorithm is open‑source and hardware‑agnostic, so cloud operators, chip makers and enterprise AI teams are poised to adopt it. Analysts expect a wave of “compression‑first” design choices in upcoming GPU and ASIC releases, with manufacturers advertising built‑in support for 3‑bit KV cache handling. Meanwhile, competitors are racing to publish alternative quantisation schemes that could further tighten the memory‑speed‑cost triangle. The next few quarters will reveal whether TurboQuant becomes the de‑facto standard for cost‑effective LLM inference or sparks a broader arms race in AI memory optimisation.
45

Google unveils Lyria 3 Pro AI music generation model

Google unveils Lyria 3 Pro AI music generation model
Mastodon +14 sources mastodon
deepmindgoogle
Google has unveiled Lyria 3 Pro, the latest iteration of its DeepMind‑backed AI music generator, capable of composing full three‑minute tracks with distinct sections such as intros, verses, choruses and bridges. The model, rolled out today across six Google platforms and embedded in the Gemini app, marks a leap from the earlier Lyria 3 release, which was limited to short loops. Paid Gemini subscribers will be the first to access the Pro version, while a free tier will offer preview clips. The upgrade matters because it pushes generative audio closer to the creative flexibility of human composers. By understanding structural cues and rhythmic nuance, Lyria 3 Pro can produce songs that feel arranged rather than merely extended loops, a limitation that has hampered earlier tools like Suno or Udio. For independent musicians, podcasters and advertisers, the model promises rapid prototyping of original soundtracks without licensing hurdles, potentially reshaping content‑creation workflows and lowering production costs. Industry observers will watch how Google monetises the service and whether the Pro tier spurs a subscription surge for Gemini. Competition is already fierce: OpenAI’s recent focus on audio with its Sora model has stalled, while startups continue to iterate on lightweight LLM‑driven music engines. Key questions include the model’s ability to respect copyright when trained on existing music, the quality of genre‑specific output, and whether Google will open an API for third‑party integration. If Lyria 3 Pro proves reliable at scale, it could become the de‑facto backend for AI‑enhanced audio across streaming, gaming and advertising, prompting a new wave of AI‑first music production tools. Keep an eye on user feedback in the coming weeks and any announced pricing tiers that could signal Google’s broader strategy for generative audio.
43

OpenAI shuts down Sora app, ending its billion‑dollar partnership with Disney

OpenAI shuts down Sora app, ending its billion‑dollar partnership with Disney
Mastodon +8 sources mastodon
openaisora
OpenAI announced on X that it is shutting down Sora, the AI‑driven video‑generation app it launched last year, and with it the billion‑dollar partnership it had forged with Walt Disney. The notice, posted without further explanation, confirms that the December‑signed deal – which promised Disney a stake of roughly $1 billion and access to Pixar, Marvel and Star Wars characters for AI‑crafted short clips – is now dead. The move caps a turbulent few weeks for the venture. As we reported on March 25, Disney’s pilot of Sora resulted in a high‑profile “disaster” that exposed technical glitches and raised concerns about brand safety. The following day, OpenAI detailed how the tool’s ability to synthesize realistic footage could interfere with emergency‑response communications, prompting a rapid risk‑mitigation effort. Those incidents, combined with escalating production costs and a strategic shift toward productivity‑focused models ahead of the company’s planned IPO, appear to have tipped the balance. Ending Sora matters for several reasons. First, it signals that even well‑funded, high‑profile AI experiments can be aborted when they clash with corporate risk appetites and regulatory scrutiny. Second, Disney’s retreat underscores the entertainment industry’s cautious stance on granting generative AI unrestricted use of iconic IP, a lesson that will reverberate through other studios eyeing similar collaborations. Finally, the shutdown removes a potential source of deep‑fake video content, easing some of the ethical and security worries that have haunted policymakers this year. What to watch next: OpenAI’s upcoming product roadmap, especially any new tools aimed at enterprise productivity rather than consumer media creation. Disney will likely reassess its AI strategy, possibly pivoting to in‑house solutions or partnering with firms that can guarantee tighter control over IP usage. Regulators in the EU and US are also expected to issue clearer guidance on AI‑generated visual media, which could shape the next wave of collaborations between tech giants and content creators.
42

OpenAI Developers Launch Official X Account

Mastodon +12 sources mastodon
openai
OpenAI has announced a new credit program aimed at university students in the United States and Canada, granting $100 worth of usage on its Codex code‑generation model. The offer, posted on the official OpenAI Developers X account, is intended to lower the cost barrier for students who want to experiment with AI‑assisted programming, a field that has surged in popularity since Codex powered GitHub Copilot and other developer tools. The move comes as OpenAI expands its developer ecosystem after a series of high‑profile launches, including the general availability of Codex, the AgentKit SDK and the recent Dev Day announcements. By providing a modest but meaningful budget, the company hopes to seed early‑stage adoption among the next generation of engineers, encouraging them to embed AI into coursework, hackathons and personal projects. For many students, $100 translates into several hundred hours of API calls, enough to prototype full‑stack applications or explore advanced prompting techniques without worrying about bill shock. Beyond education, the credit program signals OpenAI’s broader strategy to democratise access to its models. Earlier this year the firm introduced free‑tier access to GPT‑4 and rolled out a suite of tools for building AI agents, positioning itself as the default platform for both hobbyists and enterprises. By courting students, OpenAI not only cultivates a talent pipeline familiar with its stack but also creates a loyal user base that may later migrate to paid plans as their projects scale. Watch for enrollment numbers and feedback from university computer‑science departments, which could influence whether OpenAI extends similar incentives to other regions or to its newer models such as GPT‑5. The next OpenAI developer AMA, slated for December, is likely to address the program’s rollout details and hint at future expansions of the student‑focused ecosystem.
39

AI Multi-Agent Systems Streamline Complex Task Coordination

Dev.to +11 sources dev.to
agentstraining
A new technical deep‑dive titled “System Design Deep Dive — #5 of 20” has been published as part of a 20‑post series that maps the architecture of multi‑agent systems. The article lays out concrete design patterns for coordinating dozens of AI agents around a shared context, enabling them to request assistance, delegate subtasks and reconcile conflicting decisions in real time. It builds on recent research that treats a group of specialized agents as a single “AI team” overseen by a coordinating node, a model first highlighted in the “AI Agent Teamwork: Multi‑Agent Coordination Playbook” and in academic work on training agents to split complex, multi‑step tasks. The development matters because single‑agent models still stumble on workflows that require long decision chains, such as autonomous logistics planning, real‑time fraud detection or in‑vehicle infotainment management. By formalising shared memory structures and explicit hand‑off protocols, the deep‑dive promises more reliable, scalable deployments where each agent can focus on a narrow competence while the coordinator maintains global coherence. This mirrors the shift we noted on 26 March, when we reported that AI assistance is evolving from reactive chatbots toward autonomous agent ecosystems. What to watch next are the remaining seventeen posts, which will explore fault tolerance, security sandboxing and performance benchmarking—issues that directly affect the rollout of multi‑agent platforms in sectors from banking to automotive. Early adopters are likely to pilot the shared‑context approach in sandbox environments, and industry analysts will be tracking whether the coordination layer can keep latency under the sub‑second thresholds required for safety‑critical applications. The series could become a de‑facto reference for engineers building the next generation of collaborative AI.
39

Speculation grows over Sora's launch and OpenAI's role

Mastodon +6 sources mastodon
openaisora
OpenAI has officially shut down Sora, its high‑profile AI video‑generation service, and with it the billion‑dollar partnership it had forged with Walt Disney. The move was confirmed in a terse internal memo circulated to staff on Tuesday, and the Sora app vanished from the Apple Store within hours. As we reported on 25 March 2026, Disney’s involvement had been billed as a “game‑changing” validation of generative video for Hollywood; the abrupt termination now raises fresh questions about the viability of the technology. Industry insiders point to a stark lack of a sustainable business model as the primary driver. Sora’s cloud‑based rendering pipeline required massive GPU resources, yet the service never moved beyond a freemium tier that offered only limited output quality. Early adopters—advertisers, indie creators and a handful of studios—were eager, but the pricing structure never covered the operational costs, and OpenAI’s attempts to monetize through per‑minute credits stalled. Compounding the financial strain were mounting legal concerns: leaked documents suggested the model was trained on copyrighted footage scraped from YouTube and other platforms without clear permission, prompting threats of litigation from rights holders and a wave of criticism from artists’ collectives. The shutdown matters because it signals that even the most well‑funded AI firms can stumble when a product’s economics clash with regulatory and ethical pressures. It also underscores the fragility of high‑profile corporate alliances built on speculative technology; Disney now faces a strategic gap in its AI roadmap and may look to rivals such as Runway or Luma for next‑generation video tools. What to watch next: OpenAI is expected to file a detailed post‑mortem with the SEC, which could reveal whether the decision was purely financial or also a pre‑emptive move to avoid further legal exposure. Disney’s next AI partnership, likely announced in the coming weeks, will indicate whether the studio will double down on in‑house development or seek a new external collaborator. Competitors are already positioning themselves to capture Sora’s displaced user base, so the race to build a commercially viable generative video platform is far from over.
39

US ISP Avoids File‑Sharing Liability, Finland Rejects US Cloud Services, US Court Rules for Provider

US ISP Avoids File‑Sharing Liability, Finland Rejects US Cloud Services, US Court Rules for Provider
Mastodon +6 sources mastodon
meta
A U.S. district court in New York ruled Thursday that a major American cloud provider cannot be held liable for users’ illegal file‑sharing activities, reinforcing the limited responsibility that service operators enjoy under the Digital Millennium Copyright Act. The decision, handed down in a case brought by a coalition of rights‑holders, hinges on the “safe harbour” provisions that protect platforms so long as they act promptly to remove infringing content once notified. The ruling arrives as Europe grapples with the tension between the U.S. CLOUD Act – which permits American authorities to request data from foreign‑based servers owned by U.S. companies – and the EU’s ambition for digital sovereignty. Finland’s election commission announced on the same day that it will run the September parliamentary vote on a wholly European cloud stack, explicitly excluding U.S. hyperscalers. Officials cited the CLOUD Act and recent court precedents as reasons to avoid any risk that foreign law‑enforcement could access voter data. Why it matters: the U.S. judgment solidifies the legal shield for cloud operators, potentially emboldening them to expand services without fearing copyright suits, while simultaneously sharpening scrutiny of where critical public data is stored. Finland’s move signals a broader shift among Nordic states toward “data localisation” for sensitive functions, a trend that could pressure global providers to offer EU‑jurisdictional alternatives or risk losing public‑sector contracts. What to watch next: the European Commission is expected to issue guidance on CLOUD‑Act compliance later this month, and several other Nordic governments have hinted at similar cloud‑exclusion policies. Legal scholars will be monitoring whether rights‑holder groups appeal the New York decision, which could set a precedent for future infringement cases. Meanwhile, Meta’s announced AI upgrades and a U.S. court ruling that platforms can be sued for fostering social‑media addiction add to the regulatory maelstrom surrounding tech giants, suggesting that the balance between innovation, liability and sovereignty will remain a hotly contested arena throughout 2026.
36

AI Agents Tested as CFOs in New Resource Allocation Benchmark for Dynamic Enterprises

ArXiv +10 sources arxiv
agentsbenchmarks
A team of researchers has released EnterpriseArena, the first benchmark that puts large‑language‑model (LLM) agents through a full‑scale CFO simulation. The open‑source framework runs a 132‑month enterprise simulator that blends real‑world firm‑level financial statements, anonymised business documents, macro‑economic indicators and industry trends with expert‑validated operating rules. Agents must allocate capital, hire staff, launch projects and cut costs while coping with hidden information and stochastic market shifts—tasks that mirror the long‑horizon, high‑stakes decisions of a chief financial officer. The launch follows our March 26 coverage of multi‑agent systems for complex tasks, where we noted that LLM‑driven agents excel at short‑term, reactive actions but have not been rigorously tested on strategic resource planning. EnterpriseArena fills that gap by measuring not only raw prediction accuracy but also the ability to maintain fiscal health, meet regulatory constraints and adapt to unforeseen shocks over a decade‑long horizon. Early experiments reported in the arXiv pre‑print (2603.23638v1) show that even state‑of‑the‑art LLMs struggle to keep a balanced budget without explicit guidance, highlighting the need for more sophisticated planning, memory management and risk assessment modules. The benchmark’s release could accelerate a shift from AI assistants that answer queries to autonomous agents that manage business processes end‑to‑end. Enterprises may soon evaluate vendor solutions against EnterpriseArena before deploying LLM‑based finance bots, while researchers will likely use the suite to benchmark memory‑efficient models such as Google’s TurboQuant compression and long‑term memory systems like VehicleMemBench. Watch for the first public leaderboard results, which are expected later this quarter, and for follow‑up studies that integrate multi‑agent coordination techniques to handle cross‑departmental decisions. Success in this arena could redefine how companies leverage AI for strategic governance, turning experimental agents into trusted corporate officers.
32

Google Gemini AI launches for all Hong Kong users, offering free access through Gmail login.

Mastodon +6 sources mastodon
geminigoogle
Google has lifted the final restrictions on its Gemini AI assistant, making the service available to every Gmail‑registered user in Hong Kong without the need for a VPN. The rollout, announced earlier this week, unlocks the web‑based Gemini interface and its mobile companion for the territory’s 7 million internet users, who can now summon the chatbot by voice, generate text, images and short videos, and tap it for everyday tasks such as drafting emails, planning trips or brainstorming ideas. The move follows the phased launch we reported on 26 March, when Google first opened Gemini to a limited pool of Hong Kong accounts. Full access marks the completion of that trial and signals Google’s confidence that its flagship model – the latest Gemini 3.1, billed as “the most powerful and fastest” in the series – can operate reliably under local network conditions and comply with the region’s data‑privacy expectations. Why it matters is twofold. First, Gemini now competes directly with OpenAI’s ChatGPT and Microsoft’s Copilot on a market that has been eager for a home‑grown alternative to Apple’s Siri and local VPN‑dependent services. Second, the free‑tier availability lowers the barrier for small businesses, educators and creators to embed generative AI into workflows, potentially reshaping productivity standards across Hong Kong’s service‑driven economy. Looking ahead, the next questions revolve around pricing and enterprise integration. Google has hinted at a paid “Pro” tier for heavier users, and the company is expected to weave Gemini deeper into Workspace, Maps and YouTube. Regulators will also watch how the model handles personal data under Hong Kong’s evolving AI governance framework. Finally, the industry will keep an eye on whether Gemini 4.0, slated for later this year, will bring multimodal capabilities that could further erode the market share of existing assistants. As we reported on 26 March, the full opening of Gemini is the latest step in Google’s aggressive push to make its AI the default tool for everyday users in the region.
31

Claw-Eval Benchmark Propels Step 3.5 Flash to Second Place Among Open‑Source Agents

Dev.to +5 sources dev.to
agentsbenchmarksopen-source
A new open‑source evaluation suite called **Claw‑Eval** has quickly become the talk of the LLM‑agent community. The framework, released on GitHub this week, offers a transparent, human‑verified benchmark that measures how well large language models perform as autonomous agents across 27 multi‑step tasks. In its first public leaderboard, the Step 3.5‑Flash model from StepFun AI claimed the runner‑up spot overall, trailing only the proprietary GLM‑5, while tying for first place on the Pass@3 metric – the standard indicator of an agent’s ability to find a correct solution within three attempts. The launch matters because the field has lacked a common yardstick for “real‑world” agent performance. Earlier benchmarks such as VehicleMemBench, which we covered on 2026‑03‑26, focused on memory persistence in in‑vehicle scenarios, but they did not assess the full tool‑use pipeline that modern agents require. Claw‑Eval fills that gap by demanding tool invocation, context‑window management and error recovery, and by publishing per‑task breakdowns that let developers pinpoint strengths and weaknesses. The open‑source nature of the harness also encourages reproducibility and community‑driven extensions, a contrast to the proprietary leaderboards that dominate commercial LLM rankings. Step 3.5‑Flash’s surge highlights a growing “agentic arms race” among open‑source projects. The model, fine‑tuned on multi‑step tool‑use data, demonstrates that specialized instruction can close the gap with closed‑source powerhouses. Its performance also underscores the importance of the Pass@3 metric, which many researchers now treat as a proxy for practical reliability in deployment settings such as automated customer support, code generation assistants, and even financial decision‑making agents. What to watch next: the Claw‑Eval maintainers have promised quarterly updates, adding new tasks that simulate emergency‑response coordination and long‑term planning – areas where recent OpenAI safety work, reported on 2026‑03‑26, has raised concerns. Expect other open‑source groups to release “step‑3.5‑plus” variants aimed at the upcoming 5‑million‑token context windows that industry insiders predict will arrive later this year. The leaderboard will likely become a barometer for which models are ready for production‑grade autonomous workflows, and could shape funding decisions for startups racing to build the next generation of AI agents.
31

OpenAI shuts down Sora short‑video AI generator

Vice +10 sources 2026-03-26 news
openaisora
OpenAI announced on Tuesday that it is shutting down Sora, the short‑form video generator that sparked both viral hype and industry alarm after its October 2025 launch. In a brief post on X, the company wrote, “We’re saying goodbye to Sora,” adding that the service will be deactivated within weeks and that user‑generated content will be removed from the platform. The decision comes just three months after OpenAI scrapped a multiyear partnership with Walt Disney that would have allowed creators to use Disney characters in Sora videos. The deal’s collapse, reported on 26 March, was already seen as a warning sign that the app’s legal and licensing risks outweighed its commercial upside. At the same time, OpenAI has been fielding criticism from Hollywood guilds, advertisers and regulators who warned that AI‑generated clips could flood social feeds with deep‑fakes, undermine copyright, and even interfere with emergency‑response communications—a concern highlighted in our 26 March coverage of OpenAI’s risk‑mitigation efforts. Shutting Sora also reflects OpenAI’s broader cost‑control strategy. The service required substantial GPU capacity to render high‑resolution video in seconds, a line‑item that reportedly strained the company’s balance sheet as it prepares for a new funding round. Analysts see the move as a signal that OpenAI will prioritize more defensible products, such as its text and image models, while watching rivals like Anthropic and Google develop their own video capabilities. What to watch next: OpenAI has hinted at a “next‑generation” visual AI that will be more tightly gated and possibly integrated into its existing ChatGPT interface. Stakeholders will be monitoring whether Disney pursues alternative AI collaborations, and how regulators in the EU and US respond to the rapid rise and fall of AI‑generated media platforms. The Sora shutdown may become a case study in how quickly hype can turn into policy and profitability constraints in the emerging AI video market.

All dates