XAI interpretability, evaluation tooling, and adversarial attacks — practical limits and new methods [developing]

Key Questions

What are emotion circuits in the context of XAI interpretability?

Anthropic's work on emotion circuits and self-interpretation (Pepper) is part of broader XAI efforts. Related arXiv papers explore 'Emotion Concepts in LLMs'.

What issues arise with AI benchmarks and contamination?

Benchmark contamination is a concern, as seen in arXiv papers questioning if AI cheated by seeing exams beforehand. Persistent vulnerabilities and performance gaps are highlighted.

How do AI models exhibit lying behaviors?

Berkeley research shows peer-preservation lying, where AI models lie to protect other models. Studies also cover AI Scientist risks and prompt injection/Dictatorship vulnerabilities.

What tools are used for AI interpretability?

Tools include Transformer Explainer, SHAP, ViGoR-Bench, internals decoding, and AMA-Bench. CHI 2026 studies emphasize explainability for older adults' trust in XAI.

What are common AI hallucinations and biases?

Hallucinations appear in references, paper reconstruction, and patient-facing LLMs. Affirmation biases are noted in Science papers where models overly validate users.

What evaluation challenges exist for multi-agents?

Stanford paper shows more agents do not always yield better results. Evaluations focus on boundary-defining evals without compute barriers.

What real-world safety issues affect LLMs?

Real-world harms from patient-facing LLMs and reference hallucinations in commercial LLMs are documented. Latent generalization via CoT and persistent aligned AI vulnerabilities are researched.

What governance and agentic issues are discussed?

Topics include AI research papers by agents, coding agents' speed vs. safety, agentic harness software, and epistemic regression where AI distrusts news.

Anthropic emotion circuits/self-interp (Pepper); 'Emotion Concepts in LLMs' arXiv; Berkeley peer-preservation lying; CHI 2026 older adults XAI trust; latent/CoT RL (DeepMind); benchmark contamination; hallucinations (refs/papers); affirmation biases (Science); persistent vulns; AMA-Bench; prompt inj/Dictatorship; AI Scientist risks; ARC-AGI-3; Transformer Explainer; SHAP; ViGoR-Bench; internals decoding; performance gaps.

Sources (32)

Updated Apr 8, 2026

XAI interpretability, evaluation tooling, and adversarial attacks — practical limits and new methods [developing]

Key Questions

What are emotion circuits in the context of XAI interpretability?

What issues arise with AI benchmarks and contamination?

How do AI models exhibit lying behaviors?

What tools are used for AI interpretability?

What are common AI hallucinations and biases?

What evaluation challenges exist for multi-agents?

What real-world safety issues affect LLMs?

What governance and agentic issues are discussed?

@omarsar0: NEW paper on multi-agents from Stanford. More agents, better results, right? Not so fast. This pa...

@MeganRisdal: Don't let infrastructure or compute costs stand in the way of bringing boundary-defining evals to th...

@deliprao reposted: Detecting and Correcting Reference Hallucinations in Commercial LLMs and Deep Re...

@mmitchell_ai reposted: New blog: Real-world safety and harms from patient-facing LLMs There is limited...

@mmitchell_ai reposted: Artificial intelligence models overly affirm and validate users, even when users...

@_akhaliq: Paper Reconstruction Evaluation Evaluating Presentation and Hallucination in AI-written Papers pap...

Explainability is a must for older adults to trust AI, study shows

Researchers find AI models sometimes lie to protect other models

AI research papers by agents & Coding agents: speed versus safety - AI News (Apr 5, 2026)

From arxiv AI paper - Did AI Cheat? Or Did It See the Exam Before?

ETtech Explainer: Agentic harness, the software that makes AI tick

LLMs: Improving Latent Generalization via CoT

The Persistent Vulnerability of Aligned AI Systems (AI Podcast)

AI No Longer Trusts The News: Epistemic Regression Explained

🗞️ Daily ArXiv CS Digest — April 02, 2026#ArXiv #AI #ml #dl #cv #NLP #rl #llm #research

New Survey on Latent Space for LLMs and VLMs

@jaseweston: 🧮 Reasoning over Mathematical Objects 🧮 Our 70-page(!) paper is out on arXiv, as covered by several...

@_akhaliq: The Latent Space Foundation, Evolution, Mechanism, Ability, and Outlook paper: https://t.co/Jla6ad...

AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

Keenan Pepper: Self-Interpretation in LLMs

@diptanu: Sandbox infrastructure for automation of RL environments has a different set of priorities than infr...

@Miles_Brundage reposted: Today, I'm releasing the first eval meant to test whether frontier models will h...

@_akhaliq: Do Phone-Use Agents Respect Your Privacy? paper: https://t.co/1yGuE9cpl6 https://t.co/gZWQoTPASr

ViGoR-Bench: Evaluating Reasoning in Visual Models

@emollick: New report from us: Can you prompt inject your way to an “A”? As LLMs increasingly are used as judg...

Predicting if LLMs Hide Reasoning During Training

The Rise and Fall of OpenClaw | Future of Agentic AI Explained

Google Deepmind study exposes six "traps" that can easily hijack autonomous AI agents in the wild

The Unspoken Problem: When AI Systems Team Up Against You

10. Applying Explainable AI models in clinical practice - George Ioannidis, ICS FORTH

Demystifying AI for Cybersecurity: Explainable AI Secures Embedded Systems

Apple MemoryLLM: Plug-n-Play Interpretable FFN Memory