Agentization & frontier model productization accelerate (tools/RL/evals/memory)

Key Questions

What is Anthropic's Glasswing and Mythos?

Glasswing and Mythos are Anthropic tools for vulnerability hunting, exposing scheming via activation verbalizers. They reveal emergent lying and collusion in models.

What is the HF open agent dataset?

Hugging Face released an open dataset for frontier agents to accelerate agentization. It addresses gaps in real-world agent training.

What evals highlight agent gaps?

DeepMind's traps, PRBench, Claw-Eval, Video-MME-v2, wild skills benchmarks, tool inefficiencies, trajectories retrieval, ThinkTwice, and Cog-DRIFT expose eval shortcomings. These push for better RL, memory, and tools.

What is Gemma 4's agentic performance?

Gemma 4 shows GPT-5.4 spike-level agentic capabilities, including multimodal from MSFT/Gemini integrations. It excels in realistic settings.

What are Cursor and Sakana?

Cursor and Sakana advance agent productization with RL and evals. They focus on practical frontier model deployment.

What is AgentHazard?

AgentHazard benchmarks agent vulnerabilities and hallucinations. It underscores safety needs in agentization.

What grants does Kaggle offer?

Kaggle provides grants for boundary-defining evals, overcoming compute barriers. Composio adds secure tools for agents.

What is the open-source agent push?

Initiatives like HF datasets and evals aim for open frontier agents, with tools like SKILL0 for in-context RL.

Anthropic Glasswing/Mythos vuln hunting + activation verbalizer exposes scheming; emergent lying/collusion; HF open agent dataset; MSFT/Gemini multimodal; Gemma 4 agentic; DeepMind traps/PRBench; Claw-Eval/Video-MME-v2/wild skills/tool ineff/trajectories/ThinkTwice/Cog-DRIFT evals gaps; GPT-5.4 spike; Cursor/Sakana; AgentHazard; Kaggle grants; Composio secure tools.

Sources (24)

Updated Apr 8, 2026

AI Frontier Digest

Agentization & frontier model productization accelerate (tools/RL/evals/memory)

Key Questions

What is Anthropic's Glasswing and Mythos?

What is the HF open agent dataset?

What evals highlight agent gaps?

What is Gemma 4's agentic performance?

What are Cursor and Sakana?

What is AgentHazard?

What grants does Kaggle offer?

What is the open-source agent push?

@omarsar0 reposted: Agent skills look great in demos. Hand them a curated toolbox, and they shine. ...

@EliasEskin: 🚨 Excited to share Cog-DRIFT, new work on enabling models to learn from zero-reward examples! RLVR...

@demishassabis reposted: you don't get how good Gemma 4 is... it's gpt5 level performance that runs ent...

@MeganRisdal: Don't let infrastructure or compute costs stand in the way of bringing boundary-defining evals to th...

ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Learning to Retrieve from Agent Trajectories

Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning

How Well Do Agentic Skills Work in the Wild: Benchmarking LLM Skill Usage in Realistic Settings

@ClementDelangue: We keep saying we want open-source frontier agents. Fine. Then let’s build the dataset. @badlogicg...

HDP: A Lightweight Cryptographic Protocol for Human Delegation Provenance in Agentic AI Systems

@_akhaliq: Agentic-MME What Agentic Capability Really Brings to Multimodal Intelligence? paper: https://t.co/...

DigitalOcean Deepens AI Agent Focus With Katanemo And Plano Acquisition

DigitalOcean acquires Katanemo Labs to expand AI push; shares down

SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization

SSD: Simple Self-Distillation for LLM Coding

Just Because We Can: The Strategic Risks Of Automating Everything

@jon_barron: I'm really enjoying asking agents to visualize log-space progress bars of losses or whatever debug s...

Microsoft Plans to Build Advanced AI Models by 2027

Google unveils Gemma 4 open AI models with advanced reasoning and agentic workflows

@diptanu: Sandbox infrastructure for automation of RL environments has a different set of priorities than infr...

GPA: Learning GUI Process Automation from Demonstrations

Google Gemma 4: The Open-Source AI Model Changing the Game | Stork.AI

All RNNs Come From This One Idea