Alignment/Safety Probes & Governance

Key Questions

What did METR's recent study reveal about AI agents?

METR completed a comprehensive review of agent risks, including deception capabilities. The study warns that inability to make agents follow rules poses major safety issues.

What evaluations is Gary Marcus highlighting?

Gary Marcus shared LLM evals involving 200k participants and biomedical AI replication concerns. He also discussed papers on fixing hallucinations via better evaluations.

What benchmarks address medical AI robustness?

A new medical foundation model robustness benchmark evaluates diagnostic reasoning from clinical narratives. It focuses on epilepsy and similar unstructured data tasks.

What risks do injection attacks pose to multi-agent systems?

Research examines prompt injection vulnerabilities in multi-agent setups. These attacks can compromise coordinated AI behaviors across agents.

What governance developments involve Tsinghua or policy?

Tsinghua contributes to AI governance discussions. Meanwhile, Trump delayed an executive order on AI oversight amid rising security concerns.

How are reward hacking and long-horizon agents evaluated?

SpecBench measures reward hacking in long-horizon coding agents. It provides metrics for verifiable, rule-following behavior in complex tasks.

What fairness or shortcut issues are probed in multimodal models?

FairLLaVA introduces fairness-aware fine-tuning for MLLMs. THUD exposes audio shortcuts that multimodal LLMs exploit, highlighting robustness gaps.

What robotics-inspired approaches aid foundation model safety?

Guardrails inspired by robotics are proposed for socially sensitive domains like education and mental health. These aim to constrain foundation model outputs in high-stakes settings.

METR deception + comprehensive agent risk review; Gary Marcus LLM evals (200k participants); medical FM robustness benchmark; injection attacks multi-agents; Tsinghua governance.

Sources (32)

Updated May 23, 2026

Alignment/Safety Probes & Governance

Key Questions

What did METR's recent study reveal about AI agents?

What evaluations is Gary Marcus highlighting?

What benchmarks address medical AI robustness?

What risks do injection attacks pose to multi-agent systems?

What governance developments involve Tsinghua or policy?

How are reward hacking and long-horizon agents evaluated?

What fairness or shortcut issues are probed in multimodal models?

What robotics-inspired approaches aid foundation model safety?

@daniel_271828 reposted: Beth is founder and CEO of METR, which just completed the most comprehensive yet...

Manchester Spinout Imperagen Raises £5M in Seed Funding to Deploy Quantum Physics, AI Modelling and Automated Labs for Enzyme Engineering

@GaryMarcus reposted: This is the most interesting paper I have read this week. The authors test a wi...

The Next Frontier of Genomic Foundation Models. AlphaGenome, Evo 2, GSFM, Caduceus, DeepVariant.

Evaluating large language models for diagnostic reasoning from unstructured clinical narratives in epilepsy

Meta’s Llama 4 Exposed the Open-Weights Cyber Problem

FairLLaVA: Fairness-Aware Parameter-Efficient Fine-Tuning for MLLMs. CVPR 2026.

@thegautamkamath reposted: 1/4 Fixing hallucinations means fixing evaluations, as shown in our new paper ht...

Trump delays executive order on AI oversight hours before planned signing

Trump to sign order on AI oversight as security fears mount among ...

@GaryMarcus: Biomedical AI may be headed for a replication crisis. (This work below is not about AI-generated re...

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

@GaryMarcus: ⚠️👇 🚨Breaking ⚠️ If we can’t make AI agents follow rules, we are screwed. New study from METR repo...

THUD: Exposing Audio Shortcuts in Multimodal LLMs

Robotics-Inspired Guardrails for Foundation Models in Socially ...

@gdb: SynthID for checking if an image was generated by OpenAI:

Kin Health raises $9M to build an AI notetaker for patients

@emollick: 🚨Our paper is out in PNAS: we found classic human persuasion techniques worked on AIs in a "parahuma...

Evaluating MLLMs on Detecting and Assessing the Artifacts of AI ...

Researchers who use hallucinated references to face ArXiv ban

@pmarca reposted: 1/ With distributed training, you could violate an AI pause treaty by training a...

Anthropic Is Preparing for IPO and We Should Be Worried

The Musk v. Altman Verdict Leaves the Biggest Questions Unanswered

Alignment pretraining: AI discourse creates self-fulfilling (mis)alignment

Frontier AI models reap rapid discovery of security vulnerabilities

Sequence-level watermarking for large language models

Mistral CEO Warns Europe Could Become US "Vassal State" Within 2 Years

Mistral CEO warns Europe has two years to secure AI ...

Mistral's CEO: Europe has 2 years to stop becoming America's AI 'vassal state'

Agentic Trading with Safe Guardrails

Research repository ArXiv will ban authors for a year if they let AI do all the work

LiSA: Lifelong Safety Adaptation via Conservative Policy Induction