DiffGaze: A Diffusion Model for Modelling Fine-grained Human Gaze Behaviour on 360 \({}^{\circ}\)  Images

Chuhan Jiao; Yao Wang; Guanhua Zhang; Mihai Bâce; Zhiming Hu; Andreas Bulling

doi:10.1145/3772075

journal article Jan 09, 2026

DiffGaze: A Diffusion Model for Modelling Fine-grained Human Gaze Behaviour on 360 \({}^{\circ}\) Images

Chuhan Jiao

ACM Transactions on Interactive Intelligent Systems Vol. 16 No. 1 pp. 1-23 · Association for Computing Machinery (ACM)

View at Publisher Save 10.1145/3772075

Abstract

Modelling human gaze behaviour on 360

\({}^{\circ}\)

images is important for various human–computer interaction applications. However, existing methods are limited to predicting discrete fixation sequences or aggregated saliency maps, thereby neglecting fine-grained gaze behaviour such as saccadic eye movements that can be captured by commercial eye-trackers. We introduce a more challenging task—
fine-grained gaze sequence generation
. This task aims to generate eye-tracker-like gaze data for given stimuli. We propose
DiffGaze
, a diffusion-based method for generating realistic and diverse fine-grained human gaze sequences conditioned on 360

\({}^{\circ}\)

images. We evaluate DiffGaze on two 360

\({}^{\circ}\)

image benchmarks for fine-grained gaze sequence generation as well as two downstream tasks, scanpath prediction and saliency prediction. Our evaluations show that DiffGaze outperforms the fine-grained gaze generation baselines in all tasks on both benchmarks. We also report a 21-participant survey study showing that our method generates gaze sequences that are indistinguishable from real human sequences. Taken together, our evaluations not only demonstrate the effectiveness of DiffGaze but also point towards a new generation of methods that faithfully model the rich spatial and temporal nature of natural human gaze behaviour.

Topics

No keywords indexed for this article. Browse by subject →

References

69

[1]

10.1109/iccvw.2017.275

[2]

10.1016/j.neucom.2020.03.060

[3]

10.1007/978-3-030-54994-7_10

[4]

Ali Borji and Laurent Itti. 2015. Cat2000: A large scale fixation dataset for boosting saliency research. arXiv:1505.03581. Retrieved from https://arxiv.org/abs/1505.03581

[5]

10.1109/cvpr.2012.6247710

[6]

10.1109/icmew.2018.8551543

[7]

10.1109/vr46266.2020.00027

[8]

10.1007/978-3-030-58452-8_6

[9]

10.1145/3411764.3445177

[10]

Yupei Chen, Zhibo Yang, Seoyoung Ahn, Dimitris Samaras, Minh Hoai, and Gregory Zelinsky. 2021. COCO-Search18 fixation dataset for predicting goal-directed attention control. Scientific Reports 11, 1 (2021), 1–11.

[11]

10.1109/tpami.2014.2345401

[12]

10.1007/978-3-030-01240-3_32

[13]

Budmonde Duinkharjav, Kenneth Chen, Abhishek Tyagi, Jiayi He, Yuhao Zhu, and Qi Sun. 2022. Color-perception-guided display power reduction for virtual reality. ACM Transactions on Graphics (Proceedings SIGGRAPH Asia) 41, 6 (2022), 144:1–144:16.

[14]

Ralf Engbert Lars O. M. Rothkegel Daniel Backhaus and Hans Arne Trukenbrod. 2016. Evaluation of velocity-based saccade detection in the SMI-ETG 2W system. Technical Report Allgemeine Und Biologische Psychologie Uni-Versität Potsdam March.

[15]

Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems 33 (2020), 6840–6851.

[16]

10.1109/vrw52623.2021.00236

[17]

10.1109/tvcg.2021.3067779

[18]

10.1109/tvcg.2020.2973473

[19]

SGaze: A Data-Driven Eye-Head Coordination Model for Realtime Gaze Prediction

Zhiming Hu, Congyi Zhang, Sheng Li et al.

IEEE Transactions on Visualization and Computer Gr... 10.1109/tvcg.2019.2899187

[20]

A model of saliency-based visual attention for rapid scene analysis

L. Itti, C. Koch, E. Niebur

IEEE Transactions on Pattern Analysis and Machine... 10.1109/34.730558

[21]

10.1145/3654777.3676436

[22]

10.1145/3544548.3581096

[23]

10.1145/3586183.3606780

[24]

10.1145/3746059.3747749

[25]

Chuhan Jiao Guanhua Zhang Yeonjoo Cho Zhiming Hu and Andreas Bulling. 2024. DiffEyeSyn: Diffusion-based user-specific eye movement synthesis. arXiv:2409.01240. Retrieved from https://arxiv.org/abs/2409.01240

[26]

10.1109/iccv.2009.5459462

[27]

10.1145/2638728.2641695

[28]

Zhifeng Kong Wei Ping Jiaji Huang Kexin Zhao and Bryan Catanzaro. 2020. Diffwave: A versatile diffusion model for audio synthesis. arXiv:2009.09761. Retrieved from https://arxiv.org/abs/2009.09761

[29]

10.1109/vr.2016.7504694

[30]

DeepGaze III: Modeling free-viewing human scanpaths with deep learning

Matthias Kümmerer, Matthias Bethge, Thomas S. A. Wallis

Journal of Vision 10.1167/jov.22.5.7

[31]

10.1109/ipsn54338.2022.00026

[32]

10.1109/tvcg.2012.74

[33]

10.1016/j.image.2018.03.006

[34]

10.1145/566570.566629

[35]

10.1109/iccv.2013.401

[36]

Rosanne Liu, Joel Lehman, Piero Molino, Felipe Petroski Such, Eric Frank, Alex Sergeev, and Jason Yosinski. 2018. An intriguing failing of convolutional neural networks and the CoordConv solution. In Proceedings of the 32nd International Conference on Advances in Neural Information Processing Systems (NIPS ’18), Vol. 31, 9628–9639.

[37]

10.1109/tvcg.2022.3150502

[38]

10.1016/j.image.2018.05.005

[39]

Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. In Proceedings of the International Conference on Machine Learning. PMLR, 8162–8171.

[40]

10.3758/s13428-020-01430-3

[41]

Rong Quan, Yantao Lai, Mengyu Qiu, and Dong Liang. 2024. Pathformer3D: A 3D scanpath transformer for 360 degree images. In Proceedings of the European Conference on Computer Vision. Springer, 73–90.

[42]

10.1145/3083187.3083218

[43]

10.1109/qomex.2017.7965659

[44]

10.1007/978-3-030-85607-6_10

[45]

Robin Rombach Andreas Blattmann Dominik Lorenz Patrick Esser and Björn Ommer. 2021. High-resolution image synthesis with latent diffusion models. arXiv:2112.10752. Retrieved from https://arxiv.org/abs/2112.10752

[46]

10.1145/355017.355028

[47]

10.1145/3613904.3642918

[48]

10.1109/iccv51070.2023.00413

[49]

Eye, Head and Torso Coordination During Gaze Shifts in Virtual Reality

Ludwig Sidenmark, Hans Gellersen

ACM Transactions on Computer-Human Interaction 10.1145/3361218

[50]

10.1109/tvcg.2018.2793599

Showing 50 of 69 references

Metrics

3

Citations

69

References

Details

Published: Jan 09, 2026
Vol/Issue: 16(1)
Pages: 1-23

Authors

C

Chuhan Jiao

University of Stuttgart, Stuttgart, Germany

Y

Yao Wang

University of Stuttgart, Stuttgart, Germany

G

Guanhua Zhang

University of Stuttgart, Stuttgart, Germany

M

Mihai Bâce

KU Leuven, Leuven, Belgium

Z

Zhiming Hu

The Hong Kong University of Science and Technology—Guangzhou, Guangzhou, China

A

Andreas Bulling

University of Stuttgart, Stuttgart, Germany

Funding

Deutsche Forschungsgemeinschaft Award: 251654672—TRR 161

Swiss National Science Foundation Award: 214434

European Union’s Horizon Europe research and innovation funding programme Award: 101072410

Cite This Article

Chuhan Jiao, Yao Wang, Guanhua Zhang, et al. (2026). DiffGaze: A Diffusion Model for Modelling Fine-grained Human Gaze Behaviour on 360 \({}^{\circ}\) Images. ACM Transactions on Interactive Intelligent Systems, 16(1), 1-23. https://doi.org/10.1145/3772075

DiffGaze: A Diffusion Model for Modelling Fine-grained Human Gaze Behaviour on 360 \({}^{\circ}\) Images

You May Also Like