← Research · Structure

Label it, and put it in order

On real web tasks a text-only agent succeeds 7% of the time; add the screenshot and it doubles to 15%; label the elements and it climbs again.

Two payloads can carry the same pixels and the same words yet land very differently, because models are sensitive to how visual context is presented: whether it's labeled, and what order it arrives in. This is the last layer of Noru's formatting: structure.

Screenshot beats text, and labeling beats raw screenshot

Koh et al.'s VisualWebArena evaluates agents on realistic, visually grounded web tasks. The progression is striking¹¹VisualWebArena: Evaluating Multimodal Agents on Realistic Visually Grounded Web Tasks Koh et al., 2024:

Context given to the agent	Success rate
Text only (accessibility tree)	7.3%
+ screenshot	15.1%
+ labeled elements (Set-of-Marks)	16.4%
Human	88.7%

Handing the model the actual screenshot more than doubles success over text alone. Labeling the interactive elements with marks lifts it further, and the authors note the labeling gains are largest on visually dense pages, exactly where an unstructured dump fails the model.

The agents benchmarked here are a generation old, and absolute success rates climb with every new model. The gap between formats is the durable part: a stronger model still does more with a labeled, ordered payload than with a flat dump, because the structure is what tells it where to look.

Order is not neutral

Where you place things in the payload changes whether the model uses them. Tan et al. find that simply reordering multimodal input "can cause the model's performance to fluctuate between advanced performance and random guessing," and that placing key content in the positions models attend to (beginning and end) yields average gains of +14.7% on video-caption matching and +17.8% on visual question answering²²Order Matters: Exploring Order Sensitivity in Multimodal Large Language Models Tan et al., 2024. Reading order isn't cosmetic; it's accuracy.

The bar is low and the ceiling is high

Multi-image and screen understanding remain genuinely hard, which is why presentation matters so much. On MuirBench, top models trail humans badly on multi-image reasoning³³MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding Wang et al., 2024. On ScreenSpot-Pro (pixel-precise grounding in professional, high-resolution apps) the best general models barely register (GPT-4o lands under 1%), while a method that structures the search lifts a base model to 48%⁴⁴ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use Li et al., 2025. Structure is doing real work.

Same screen, same text, but ordered and presented cleanly, the model actually uses it. Noru captures the screen, reads its text in reading order, and presents the payload structured, image first with ordered OCR alongside, so the model isn't fighting your formatting to find the answer.

Sources

VisualWebArena: Evaluating Multimodal Agents on Realistic Visually Grounded Web Tasks, Koh et al., 2024
Order Matters: Exploring Order Sensitivity in Multimodal Large Language Models, Tan et al., 2024
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding, Wang et al., 2024
ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use, Li et al., 2025