Learning Alignments of Human Gaze and Fine-grained Task Descriptions
Takumi Nishiyasu, Zhiming Hu, Andreas Bulling, Yoichi Sato
Proc. ACM on Human-Computer Interaction (PACM HCI), 2026.
Abstract
We propose GTANet - a novel approach to learning the alignments between human gaze scanpaths and fine-grained task descriptions in vision-language tasks. While the influence of tasks on gaze is known, the relationship between gaze scanpaths and fine-grained task descriptions remains largely unexplored. GTANet addresses this gap by aligning encoded spatiotemporal gaze features with text descriptions. We utilize a Patch-based Gaze Encoder to generate visual-context-aware gaze features, and a Multimodal Feature Mixer to fuse these, capturing cross-modal alignment. To validate the method, we introduce the novel tasks of gaze-based question retrieval and question-based gaze retrieval. Experiments on the AiR and MHUG datasets demonstrate that GTANet significantly and consistently outperforms baseline methods across all Recall@K metrics, achieving a substantial improvement in both gaze-to-task and task-to-gaze retrieval. These results confirm the strong link between human gaze and fine-grained task descriptions, validating the effectiveness of our approach.Links
BibTeX
@inproceedings{nishiyasu26_etra,
title = {Learning Alignments of Human Gaze and Fine-grained Task Descriptions},
author = {Nishiyasu, Takumi and Hu, Zhiming and Bulling, Andreas and Sato, Yoichi},
year = {2026},
booktitle = {Proc. ACM on Human-Computer Interaction (PACM HCI)},
number = {ETRA}
}