AI Research Digest

May 6, 2026

AI Research Digest · May 6 Daily Digest

Frontier Model Safety Evaluations

🔥 Microsoft, Google, xAI Agree to Pre-Release Reviews: Microsoft, Google, and xAI have agreed to provide the...

efficient-transformers

alphaxiv.org

May 5, 2026

Generative Modeling with Orbit-Space Particle Flow Matching

Orbit-space particle flow matching debuts in generative modeling. Join the discussion on this paper page.

arxiv.org

Generative Modeling with Orbit-Space Particle Flow Matching

May 5, 2026

CommitSuite: New Benchmark for Commit Messages

CommitSuite debuts as a comprehensive benchmark with 63,533 CCS-compliant commits from 243 open-source repositories across seven programming domains – vital for advancing AI code generation and commit quality evaluation.

CommitSuite: A Comprehensive Benchmark for Commit ... - arXiv

May 5, 2026·

arxiv.org

May 5, 2026

Major AI Firms Expand US Gov Pre-Release Evaluations for Frontier Model Safety

Google DeepMind, Microsoft, xAI join OpenAI/Anthropic in granting CAISI early access to unreleased models for security assessments, with safeguards...

AI Firms Agree to Give US Early Access to Evaluate Their Models

insurancejournal.com

AI Firms Agree to Give US Early Access to Evaluate Their Models

May 5, 2026

ReClaim: Outperforming Baselines in Real-World Evidence Extraction

ReClaim foundation model consistently outperforms strong baselines across heterogeneous evaluation regimes, including in-domain (Claims) and out-of-domain tasks—unlocking real-world evidence from claims.

Foundation Models to Unlock Real-World Evidence from ... - arXiv

May 5, 2026·

arxiv.org

May 5, 2026

LLMs Driving Agents for Industrial Decision Tasks

Foundation models, particularly large language models, are increasingly integrated into agent architectures for industrial tasks such as decision-making—signaling real scaling of applied AI agents in automation.

Foundation-Model-Based Agents in Industrial Automation - arXiv

May 5, 2026·

arxiv.org

May 5, 2026

MolmoAct 2: Open Action Reasoning Model for Bimanual Robotics

Key open advancements from Ai2:

MolmoAct 2: Foundation model for real-world robots
Largest dataset released: MolmoAct 2-Bimanual YAM for...

MolmoAct 2: An open foundation for robots that work in the real world | Ai2

May 5, 2026·

allenai.org

May 5, 2026

Soccer-GMR: New Benchmark for Video Moment Retrieval

Soccer-GMR introduces a new benchmark for generalized moment retrieval (GMR), built on challenging soccer videos to reflect general scenarios. Pushes CV toward robust multimodal retrieval in dynamic sports.

Benchmark and Models for Generalized Moment Retrieval - arXiv

May 5, 2026·

arxiv.org

May 5, 2026

Generalized GRPO for Unbiased Answer-Level Alignment

Generalized distributional alignment games tackle biases in scalable policy evaluation, extending GRPO—where models use a small, fixed group size K—to enable unbiased answer-level optimization.

[PDF] Generalized Distributional Alignment Games for Unbiased Answer-Level ...

May 5, 2026·

arxiv.org

May 5, 2026

AI Agent Evals: Narrow Tasks Hide Real-World Failures

Current AI agent benchmarks focus on narrow, low-friction tasks in controlled or synthetic environments, overlooking whether agents can actually finish the job. This gap demands robust, friction-filled evaluations for true capability assessment.

Your AI agent looks capable. But can it actually finish the job?

May 5, 2026·

gradientflow.substack.com

May 5, 2026

Regulators Warn of AI's Control Over Key Inputs

Competition regulators warn foundation model systems reinforce control over critical inputs like compute, cloud services, data, intensifying tensions between rapid AI scaling and distributional fairness.

AI growth acceleration versus distributional fairness | Brookings

May 5, 2026·

brookings.edu

May 5, 2026

T^2PO: Uncertainty-Guided Exploration for Stable Agentic RL

T^2PO introduces uncertainty-guided exploration control to enable stable multi-turn agentic reinforcement learning. Key for long-horizon tasks.

T^2PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

arxiv.org

T^2PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning

May 5, 2026

efficient-transformers: LLM Decoding Boost with LVLMs Caveat

Key insights on efficient-transformers library:

Enhances decoding efficiency in Large Language Models (LLMs)
Direct use in LVLMs adds substantial...

efficient-transformers

May 5, 2026·

alphaxiv.org

May 5, 2026

Tempus: Temporally Scalable GEMM for AI/ML Hardware

Tempus delivers a temporally scalable resource-invariant GEMM for AI/ML hardware arrays, powered by three specialized data communication mechanisms in the AIE-ML array that enable efficient computation and scaling.

Tempus: A Temporally Scalable Resource-Invariant GEMM ... - arXiv

May 5, 2026·

arxiv.org

May 5, 2026

GPQA: The Benchmark All Major AI Labs Rely On

David Rein's GPQA — a graduate-level, Google-proof QA benchmark — is now used by every major AI lab to track frontier model capabilities.

GPQA: The Graduate-Level Benchmark Every Major AI Lab Uses

May 5, 2026·

mindstudio.ai

May 5, 2026