Vision Research Tracker

**************Hierarchical & spatial tokenization + 3D structured visuals + depth + 4D**** [developing]

**************Hierarchical & spatial tokenization + 3D structured visuals + depth + 4D**** [developing]

Key Questions

How do hierarchical tokenizers improve multimodal large language models (MLLMs)?

Hierarchical vocab/layout tokenizers boost performance on diagrams, 3D, and OCR tasks. Examples include Perceptio/F4Splat and new models like LoST/Qianfan/PaddleOCR-VL.

What is PaddleOCR-VL?

PaddleOCR-VL is a 5M parameter model achieving 93% on OmniDocBench for OCR in visuals. It enhances structured visual processing.

What is AvatarPointillist?

AvatarPointillist enables autoregressive 4D Gaussian avatarization for AR applications. It advances 4D generation.

What is TRiGS?

TRiGS uses continuous 4D Gaussian splatting with SE(3)+Bézier for 1200-frame motion. It supports dynamic 4D scenes.

What is FreeScale?

FreeScale is a CVPR 2026 method for certainty-aware free-view scaling of 3D scenes. It handles town-scale generation like Extend3D.

What is Know3D?

Know3D uses VLM-prompted 3D generation with diffusion bridges for back-view synthesis. It leverages agentic frameworks.

What is SeGPruner?

SeGPruner is a semantic-geometric visual token pruner for 3D question answering. It optimizes 3D QA efficiency.

What reproduction priorities exist?

Priorities include semantic tokens, spatial encodings, corpora/ablations for models like VGGT Transformer 3D and FOSSA zero-shot defocus.

Hierarchical vocab/layout tokenizers boost MLLM diagrams/3D/OCR. Confirmed: Perceptio/F4Splat; New: LoST/Qianfan/PaddleOCR-VL (5M param 93% OmniDocBench), MinerU-Diffusion (diffusion OCR 3x speed/RealRestorer), FlowScene/Matryoshka/MonoArt (28% ADD-S), InfiniDepth CVPR26, 4DGS360/TRiGS (Continuous 4D GS SE(3)+Bézier 1200-frame)/AvatarPointillist (AR 4D Gaussians), SeGPruner 3D QA/LongCat-Next lexicalization/Extend3D town-scale/Seen2Scene real partials/FreeScale CVPR26 certainty-aware free-view scaling, agentic VLM 3D grounding, Gen Models/Loc3R/3DreamBooth, Group3D/UniFunc3D/SpatialBoost/LagerNVS/fly-through, VGGT Transformer 3D (CVPR25 best <1s OSS), FOSSA zero-shot defocus (ZEDD), Know3D VLM-prompted 3D gen (diffusion bridge). Repro: semantic tokens, spatial encodings, corpora/ablations.

Sources (10)
Updated Apr 8, 2026