Today in AI24 stories
Hey there, today's issue is all about the nuts and bolts of AI development, with a focus on the tools and frameworks that researchers are using to build and evaluate their models. From failure detection frameworks for robotic manipulation to benchmarks for text-to-image models, it's clear that the field is moving towards a more nuanced understanding of what works and what doesn't.
One of the common threads throughout these stories is the emphasis on evaluation and testing, with researchers introducing new benchmarks and frameworks to assess the performance of their models. This is a welcome development, as it suggests that the field is becoming more rigorous and self-critical.
At the same time, there are also some interesting launches and announcements, including new tools for team collaboration and content creation. These developments highlight the growing importance of AI in a range of industries and applications, from travel to creative work.
🛠️ Build
Researchers introduce Foresight, a failure detection framework for long-horizon robotic manipulation
Researchers propose Foresight, a failure detection framework that leverages action-conditioned world models and functional conformal prediction to monitor manipulation trajectories in long-horizon robotic tasks. Foresight is trained using only final task-level success or failure labels and provides a unified framework for failure detection across different policies. The framework is evaluated on state-of-the-art vision-language-action policies in simulation on LIBERO-Long, ManiSkill-Long, and BEHAVIOR-1K, and validated on real robots with three long-horizon tasks on a ReactorX-200 arm and one task on a Franka arm. The results suggest that action-conditioned world-model embeddings provide a scalable representation for reliable failure monitoring in long-horizon manipulation.
Researchers introduce DiffusionBench for holistic evaluation of diffusion transformers
Researchers introduce NanoGen, a unified framework for training and evaluating diffusion transformers, demonstrating the need for comprehensive benchmarking beyond ImageNet class-conditional generation. NanoGen matches state-of-the-art diffusion transformer baselines on ImageNet and trains competitive text-to-image models with 12 lines of configuration change. The framework supports four diffusion methods and shows no strong correlation between ImageNet and text-to-image generation method rankings.
Researchers propose FedOT for ownership verification and leakage tracing in federated LDMs
Researchers propose FedOT, a framework for ownership verification and leakage tracing in federated latent diffusion models, introducing chunked watermarking and latent vector transformation to prevent watermark removal attacks. FedOT addresses two challenges in existing VAE-based watermarking techniques: the inability to trace model leakage to a specific client and vulnerability to VAE replacement attacks. The framework uses a chunked watermark for ownership verification and client identification, and latent vector transformation to strengthen the connection between the VAE and U-Net latent spaces. Extensive experiments demonstrate FedOT's superior performance in ownership verification and traceability.
Researchers introduce Tmax, a simple recipe for terminal agents
Researchers present Tmax, a novel RL training approach for terminal agents, achieving superior performance with a simplified recipe and expanded dataset. Tmax brings open data recipes closer to the frontier, outperforming larger models with only 9B parameters and achieving 27% on Terminal-Bench 2.0. The researchers generate data using a novel taxonomy and open-source their terminal dataset, which is over 2.5x larger than previously released terminal-agent datasets.
Onur uses local models to triage OpenClaw repo issues for free
Onur used local models like Gemma and Qwen in an agent harness to classify and triage issues in the OpenClaw repo, allowing for near-instantaneous notifications without using up quota on a ChatGPT Pro plan. The local models, gemma-4-26b-a4b and qwen3.6-35b-a3b, were tested with performance optimizations and can generate hundreds of tokens per second locally. Onur utilized a restricted bash-like shell called reposhell to allow read-only operations on the OpenClaw repo, ensuring the model's safety and preventing potential security risks. This approach enabled Onur to classify issues and pull requests accurately, including a concrete example where qwen3.6-35b-a3b correctly classified an issue titled Fix Kimi tool-call rewriting stop reason handling. The localpager-agent configuration was used to perform read-only operations and return classification output, demonstrating the effectiveness of local models in triaging OpenClaw repo issues.
Researchers introduce Tapered Language Models with improved performance
Researchers introduced Tapered Language Models, which allocate more parameters to earlier layers and fewer to later layers, improving performance without increasing total parameters or compute costs. The models were tested across three scales and four architectures, including Transformer, Gated Attention, Hope-attention, and Titans, with consistent improvements in perplexity and downstream benchmark performance. A smooth cosine schedule was used to taper MLP width, resulting in improved performance at no additional parameter or compute cost. The findings establish depth-aware capacity allocation as a simple, architecture-agnostic axis of language model design. The research was published on arXiv with the identifier 2606.23670.
🚀 Launches
Hugging Face releases huggingface_hub every week with AI and human oversight
Hugging Face's team automated the release process for huggingface_hub using open-source tools and open-weights models, with a human reviewing and editing the AI-generated release notes and Slack announcements. The workflow is triggered by a single GitHub Actions workflow and takes one input, the release type, which can be a minor prerelease, minor release, or patch release. The pipeline computes the next version, creates or reuses the release branch, bumps the version, commits, tags, and pushes, then publishes to PyPI and drafts release notes using an open-weights model, currently GLM-5.2 from Z.ai.
IBM releases CUGA, a configurable generalist agent harness with two dozen working examples
IBM released CUGA, a configurable generalist agent harness, which handles planning, execution, and state plumbing for agentic apps, allowing developers to focus on tool lists and prompts. CUGA has been used to build two dozen single-file apps, including a movie recommender and an IBM Cloud architecture advisor, demonstrating its ability to simplify agent development. The harness provides features such as long-horizon planning, variable management, and self-correction, and supports interchangeable tools and providers like OpenAI and watsonx.
Baidu researchers introduce Unlimited OCR with Reference Sliding Window Attention
Baidu researchers propose Unlimited OCR, a model designed to emulate human parsing working memory, by replacing attention layers in the decoder with Reference Sliding Window Attention, reducing attention computation costs and maintaining a constant KV cache throughout the decoding process. This allows Unlimited OCR to transcribe dozens of pages of documents in a single forward pass under a standard maximum length of 32K. The model combines the high compression rate of DeepSeek OCR's encoder with the constant KV cache design. Codes and model weights are publicly available on GitHub.
Researchers introduce Qwen-AgentWorld language world models for general agents
Researchers from Qwen introduce Qwen-AgentWorld-35B-A3B and Qwen-AgentWorld-397B-A17B, language world models capable of simulating agentic environments across 7 domains via long chain-of-thought reasoning. The models were developed through a three-stage training pipeline using over 10M environment interaction trajectories and outperform existing frontier models on the AgentWorldBench benchmark. Qwen-AgentWorld can be used as a decoupled environment simulator or a unified agent foundation model, improving downstream performance across 7 agentic benchmarks. The code and models are available on GitHub.
Shumai developers release open-source Frame.io alternative for creative work
Shumai developers launched an open-source platform for creative work, offering features such as S3-compatible and local storage, frame-by-frame annotations, and secure sharing. The platform also includes a collaborative AI chat agent and custom skills and tools. Shumai can be installed via Docker Compose or NPM, and requires PostgreSQL with the pgvector extension. The platform is available for use on local machines or remote servers, with configuration options for environment variables and storage backends.
Mistral Releases OCR 4 With Bounding Boxes And Block Classification
Mistral released OCR 4, featuring bounding boxes, block classification, and inline confidence scores alongside extracted text, supporting 170 languages across 10 language groups. The model runs in a single container for self-hosted deployments and serves as an ingestion component for enterprise search and domain-specific retrieval pipelines. OCR 4 is priced at $4 per 1,000 pages, with a 50% Batch-API discount, reducing the cost to $2 per 1,000 pages. Independent annotators prefer OCR 4 over leading OCR and document-AI systems, with win rates averaging 72%. The model achieves the top overall score on OlmOCRBench, with a score of 85.20, and leads the internal Crawl Multilingual evaluation with a score of 0.98.
🛡️ Safety
OpenAI helps found Appia Foundation to develop AI standards
OpenAI helped found the Appia Foundation, hosted by the Linux Foundation, to develop open, modular specifications for evaluating and securing advanced AI systems. Appia will create a shared technical language for national and international institutions to trust each other's work, producing clearer and more reusable evidence when models, infrastructure, and applications are developed by different organizations. The foundation's work complements OpenAI's broader safety infrastructure, including its Preparedness Framework and Frontier Governance Framework.
Stanford HAI finds AI hiring tools yield racial bias and systemic rejection
Stanford HAI researchers discovered that AI hiring tools can yield racial bias and systemic rejection, with 26% of Black and 15% of Asian candidates being rejected due to biased algorithms. The study highlights the need for more diverse and inclusive hiring practices. Researchers analyzed various AI-powered hiring tools and found that they often perpetuate existing biases, leading to unfair rejection of qualified candidates. The findings suggest that AI hiring tools require more rigorous testing and validation to ensure fairness and equity.
🔥 Buzz
Derya Unutmaz uses GPT-5 Pro to solve 3-year-old immune cell mystery
Derya Unutmaz, a professor at The Jackson Laboratory and the University of Connecticut, used GPT-5 Pro to revisit a 3-year-old puzzle centered on how glucose affects T cell development and specialization. Unutmaz had performed an experiment in 2022, but couldn't make sense of the results at the time. GPT-5 Pro suggested that deoxyglucose interfered with the construction of a protein called IL-2, which can prevent T cells from becoming inflammatory-response cells. This insight helped Unutmaz understand the difference in results between T cells exposed to low-glucose environments and those exposed to deoxyglucose. Unutmaz also used GPT-5 Pro to simulate an experiment and correctly predict the outcome, demonstrating the model's ability to understand and generate meaningful insights.
Omio builds conversational travel with OpenAI integration
Omio's CTO Tomas Vocetka discusses the company's vision for conversational travel, powered by real-time transportation data and OpenAI models. Omio launched one of the earliest travel experiences on ChatGPT in 2023, connecting users to live transportation inventory and pricing data. The company has since expanded its conversational travel experience, grounding responses in verified travel data. Internally, Omio has rolled out ChatGPT and Codex to employees, enabling teams to experiment and identify opportunities to improve their work.
Quicklinks
- Google Chrome engineers propose Cross-Origin Storage API for Transformers.jsDeveloper Relations Engineer Thomas Steiner from Google's Chrome team proposes a Cross-Origin Storage API to mitigate cache duplication issues with AI model resources in Transformers.js.
- FromSoftware uses a stack-based pushdown automaton for Elden Ring NPC AIFromSoftware implements NPC AI in Elden Ring using a stack-based pushdown automaton with weighted random selection between actions, defined in Havok Script.
- Academia struggles to adapt assessments amid AI-generated contentResearchers note that AI-generated content is making it difficult for academia to assess student work and research quality, with implications for academic success and the validity of academic credentials.
- Researchers propose Self-Compacting Language Model AgentsResearchers propose a scaffolding approach called SelfCompact that enables models to autonomously determine optimal compaction timing and methods for managing long agent traces, achieving up to 18.1 points improvement on math benchmarks at 30-70% lower cost.
- OpenThoughts-Agent project releases open-source data curation pipelineThe OpenThoughts-Agent project presents a data curation pipeline for training agentic language models, yielding 44.8% average accuracy across seven benchmarks.
- Researchers present AOHP, an open-source OS-level agent harnessResearchers from multiple institutions present AOHP, an Android-based OS framework that treats AI agents as first-class entities, enhancing task completion rates by 21.12% and reducing execution costs by 51.55%.
- Researchers propose EDV framework for reliable experience learning in LLM agentsResearchers from an undisclosed institution propose EDV, a three-stage framework that uses multiple heterogeneous agents to collaboratively construct reliable experiences for LLM agents.
- Researchers from Meta and others publish PhoneBuddy for training open models in agentic phone useMeta researchers with collaborators publish PhoneBuddy, a training recipe and open-model line that combines real and mock app environments to improve task success rates to 45.33% via mixed reinforcement learning.
I'll be back tomorrow with another issue, covering the latest developments in AI and tech. Until then, I hope you find something interesting to read in today's stories.
End of edition · 2026-06-24
The daily digest
One email. The day in AI, in a minute.
The top items since yesterday — same summaries, in your inbox each morning.
Double opt-in · one email a day · unsubscribe in a click.