LLM reasoning & evaluation bottlenecks
Key Questions
What does the HorizonMath benchmark indicate about LLM progress?
HorizonMath demonstrates measurable progress in mathematical reasoning capabilities of large language models.
What is the Perception or Prejudice study about?
It introduces the MM-OCEAN dataset and reveals a prejudice gap in MLLMs ranging from 0-33.5% in holistic grounding of personality traits.
How does decoupling perception and reasoning help VLMs?
The 'From Seeing to Thinking' work shows that separating perception from reasoning and using RL instead of SFT improves post-training of vision-language models.
What improvements does ETCHR provide for reasoning tasks?
ETCHR employs an RL-trained image editor to perform reasoning-aware edits, yielding 4-5% gains on relevant benchmarks.
What new datasets or methods address MLLM evaluation bottlenecks?
New resources include the MM-OCEAN dataset and methods such as ETCHR and decoupled perception-reasoning training to better evaluate and improve reasoning.
HorizonMath benchmark shows progress. New: Perception or Prejudice exposes MLLM prejudice gap (0-33.5% holistic grounding) with MM-OCEAN dataset. New: 'From Seeing to Thinking' decouples perception/reasoning in VLMs (RL beats SFT for perception). New: ETCHR uses RL-trained image editor for reasoning-aware edits (+4-5% gains).