Noru Flow Research

← Research  ·  Structure

Label it, and put it in order

On real web tasks a text-only agent succeeds 7% of the time; add the screenshot and it doubles to 15%; label the elements and it climbs again.

Two payloads can carry the same pixels and the same words yet land very differently, because models are sensitive to how visual context is presented: whether it's labeled, and what order it arrives in. This is the last layer of Noru's formatting: structure.

Screenshot beats text, and labeling beats raw screenshot

Koh et al.'s VisualWebArena evaluates agents on realistic, visually grounded web tasks. The progression is striking11VisualWebArena: Evaluating Multimodal Agents on Realistic Visually Grounded Web Tasks Koh et al., 2024:

Context given to the agentSuccess rate
Text only (accessibility tree)7.3%
+ screenshot15.1%
+ labeled elements (Set-of-Marks)16.4%
Human88.7%

Handing the model the actual screenshot more than doubles success over text alone. Labeling the interactive elements with marks lifts it further, and the authors note the labeling gains are largest on visually dense pages, exactly where an unstructured dump fails the model.

The agents benchmarked here are a generation old, and absolute success rates climb with every new model. The gap between formats is the durable part: a stronger model still does more with a labeled, ordered payload than with a flat dump, because the structure is what tells it where to look.

Order is not neutral

Where you place things in the payload changes whether the model uses them. Tan et al. find that simply reordering multimodal input "can cause the model's performance to fluctuate between advanced performance and random guessing," and that placing key content in the positions models attend to (beginning and end) yields average gains of +14.7% on video-caption matching and +17.8% on visual question answering22Order Matters: Exploring Order Sensitivity in Multimodal Large Language Models Tan et al., 2024. Reading order isn't cosmetic; it's accuracy.

The bar is low and the ceiling is high

Multi-image and screen understanding remain genuinely hard, which is why presentation matters so much. On MuirBench, top models trail humans badly on multi-image reasoning33MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding Wang et al., 2024. On ScreenSpot-Pro (pixel-precise grounding in professional, high-resolution apps) the best general models barely register (GPT-4o lands under 1%), while a method that structures the search lifts a base model to 48%44ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use Li et al., 2025. Structure is doing real work.

Same screen, same text, but ordered and presented cleanly, the model actually uses it. Noru captures the screen, reads its text in reading order, and presents the payload structured, image first with ordered OCR alongside, so the model isn't fighting your formatting to find the answer.

Sources

  1. VisualWebArena: Evaluating Multimodal Agents on Realistic Visually Grounded Web Tasks, Koh et al., 2024
  2. Order Matters: Exploring Order Sensitivity in Multimodal Large Language Models, Tan et al., 2024
  3. MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding, Wang et al., 2024
  4. ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use, Li et al., 2025