2

DeepMeaning: Estimating and Interpreting Scene Meaning for Attention Using a Vision-Language Transformer

This study uses a vision-language transformer trained on billions of image-text pairs to investigate how local scene meaning guides human attention. ‘DeepMeaning’ automatically estimates semantic content in real-world scenes, predicts gaze patterns, detects semantic changes, and offers interpretable language-based insights. These results show that multimodal transformers can serve as powerful tools for linking vision and language to test theory on the role of scene semantics in attentional guidance.

Hayes T.R., Henderson J.M.

DeepMeaning: Estimating and Interpreting Scene Meaning for Attention Using a Vision-Language Transformer

The role of local meaning in infants’ fixations of natural scenes

Oakes L.M., Hayes T.R., Klotz S.M., Pomaranski K.I., Henderson J.M.

Visual attention during seeing for speaking in healthy aging

Rehrig G., Hayes T.R., Henderson J.M., Ferreira F.

Searching for meaning: Local scene semantics guide attention during natural visual search in scenes

Models of visual search in scenes include image salience as a source of attentional guidance. However, because scene meaning is …

Peacock C.E., Singh P., Hayes T.R., Rehrig G., Henderson J.M.

Scene inversion reveals distinct patterns of attention to semantically interpreted and uninterpreted features

We presented real-world scenes in upright and inverted orientations and used general linear mixed effects models to understand how semantic guidance, image guidance, and observer center bias were associated with fixation location and fixation duration. We observed distinct patterns of change under inversion. Semantic guidance was severely disrupted by scene inversion, while image guidance was mildly impaired and observer center bias was enhanced. In addition, we found that fixation durations for semantically rich regions decreased when viewing inverted scenes, while fixation durations for image salience and center bias were unaffected by inversion. Together these results provide important new constraints on theories and computational models of attention in real-world scenes.

Hayes T.R., Henderson J.M.

Scene inversion reveals distinct patterns of attention to semantically interpreted and uninterpreted features

Meaning maps detect the removal of local semantic scene content but deep saliency models do not

Meaning mapping uses human raters to estimate different semantic features in scenes, and has been a useful tool in demonstrating the important role semantics play in guiding attention. However, recent work has argued that meaning maps do not capture semantic content, but like deep learning models of scene attention, represent only semantically-neutral image features. In the present study, we directly tested this hypothesis using a diffeomorphic image transformation that is designed to remove the meaning of an image region while preserving its image features. The results were clear: meaning maps generated by human raters showed a large decrease in the diffeomorphed scene regions, while all three deep saliency models showed a moderate increase in the diffeomorphed scene regions. These results demonstrate that meaning maps reflect local semantic content in scenes while deep saliency models do something else.

Hayes T.R., Henderson J.M.

Rapid extraction of the spatial distribution of physical saliency and semantic informativeness from natural scenes in the human brain

Attention may be attracted by physically salient objects, such as flashing lights, but humans must also be able to direct their attention to meaningful parts of scenes. Understanding how we direct attention to meaningful scene regions will be important for developing treatments for disorders of attention and for designing roadways, cockpits, and computer user interfaces. Information about saliency appears to be extracted rapidly by the brain, but little is known about the mechanisms that deter- mine the locations of meaningful information. To address this gap, we showed people photographs of real-world scenes and measured brain activity. We found that information related to the locations of meaningful scene elements was extracted rap- idly, shortly after the emergence of saliency-related information.

Kiat J.E., Hayes T.R., Henderson J.M., Luck S.J.

Meaning and expected surfaces combine to guide attention during visual search in scenes

How do spatial constraints and meaningful scene regions interact to control overt attention during visual search for objects in real-world scenes? To answer this question, we combined novel surface maps of the likely locations of target objects with maps of the spatial distribution of scene semantic content.

Peacock C.E., Cronin D.A., Hayes T.R., Henderson J.M.

Deep saliency models learn low-, mid-, and high-level features to predict scene attention

Deep saliency models represent the current state-of-the-art for predicting where humans look in real-world scenes. However, for deep saliency models to inform cognitive theories of attention, we need to know how deep saliency models prioritize different scene features to predict where people look. Here we open the black box of three prominent deep saliency models (MSI-Net, DeepGaze II, and SAM-ResNet) using an approach that models the association between attention, deep saliency model output, and low-, mid-, and high-level scene features.

Hayes T.R., Henderson J.M.

Deep saliency models learn low-, mid-, and high-level features to predict scene attention

Linking patterns of infant eye movements to a neural network model of the ventral stream using representational similarity analysis

Little is known about the development of higher-level areas of visual cortex during infancy, and even less is known about how the development of visually guided behavior is related to the different levels of the cortical processing hierarchy. As a first step toward filling these gaps, we used representational similarity analysis (RSA) to assess links between gaze patterns and a neural network model that captures key properties of the ventral visual processing stream.

Kiat J.E., Luck S.J., Beckner A.G., Hayes T.R., Pomaranski K.I., Henderson J.M., Oakes L.M.