publications | Emalie McMahon

2025

bioRxiv

Monkey See, Model Knew: Large Language Models Accurately Predict Visual Brain Responses in Humans and Non-Human Primates

Colin Conwell, Emalie McMahon, Akshay Jagadeesh, and 7 more authors

bioRxiv Apr 2025

Abs PDF

Recent progress in multimodal AI and ’language-aligned’ visual representation learning has rekindled debates about the role of language in shaping the human visual system. In particular, the emergent ability of ‘language-aligned’ vision models (e.g. CLIP) – and even pure language models (e.g. BERT) – to predict image-evoked brain activity has led some to suggest that human visual cortex itself may be ‘language-aligned’ in comparable ways. But what would we make of this claim if the same procedures could model visual activity in a species without language? Here, we conducted controlled comparisons of pure-vision, pure-language, and multimodal vision-language models in their prediction of human (N=4) and rhesus macaque (N=6, 5:IT, 1:V1) ventral visual activity to the same set of 1000 captioned natural images (the ‘NSD1000’). The results revealed markedly similar patterns in model predictivity of early and late ventral visual cortex across both species. This suggests that language model predictivity of the human visual system is not necessarily due to the evolution or learning of language perse, but rather to the statistical structure of the visual world that is reflected in natural language.
PsyArXiv

A spatiotemporal hierarchy for social interaction perception in the lateral visual stream

Emalie McMahon, Elizabeth Jiwon Im, Michael Bonner, and 1 more author

PsyArXiv Mar 2025

Abs PDF

The lateral visual stream has been recently proposed as a third visual stream, in addition to the ventral and dorsal streams, specialized for processing dynamic social content. While prior work has suggested that the regions of this pathway form a hierarchy representing increasingly abstract information, the computations along this pathway are still largely unknown. High spatiotemporal resolution data are particularly informative for characterizing the information flow and thus neural computations across different brain regions. Using a novel regression approach, we combine data from EEG, fMRI, and behavior in response to the same videos to leverage the high temporal resolution of EEG and whole-brain spatial resolution of fMRI. We find that low-level visual features are represented in early visual cortex with a short temporal latency and are not represented in higher-level regions of the lateral stream. Further, we find that mid-level features are represented in mid-level lateral regions with a shorter latency than high-level features in more anterior regions of the lateral pathway. However, both mid- and high-level features were decodable in anterior regions of the lateral pathway with a similar latency. Together, these results provide evidence that features of social actions are processed rapidly in the lateral visual stream in a manner that is consistent with hierarchical processing, but the lateral stream does not exhibit a strict temporal sequence of representational transformations along the posterior-to-anterior axis.
ICLR

Modeling dynamic social vision highlights gaps between deep learning and humans

Kathy Garcia, Emalie McMahon, Colin Conwell, and 2 more authors

ICLR Jan 2025

Abs PDF

Deep learning models trained on computer vision tasks are widely considered the most successful models of human vision to date. The majority of work that supports this idea evaluates how accurately these models predict behavior and brain responses to static images of objects and scenes. Real-world vision, however, is highly dynamic, and far less work has evaluated deep learning models on human responses to moving stimuli, especially those that involve more complicated, higher-order phenomena like social interactions. Here, we extend a dataset of natural videos depicting complex multi-agent interactions by collecting human-annotated sentence captions for each video, and we benchmark 350+ image, video, and language models on behavior and neural responses to the videos. As in prior work, we find that many vision models reach the noise ceiling in predicting visual scene features and responses along the ventral visual stream (often considered the primary neural substrate of object and scene recognition). In contrast, vision models poorly predict human action and social interaction ratings and neural responses in the lateral stream (a neural pathway theorized to specialize in dynamic, social vision), though video models show a striking advantage in predicting mid-level lateral stream regions. Language models (given human sentence captions of the videos) predict action and social ratings better than image and video models, but perform poorly at predicting neural responses in the lateral stream. Together, these results identify a major gap in AI’s ability to match human social vision and provide insights to guide future model development for dynamic, natural contexts.

2024

TiCS

The neurodevelopmental origins of seeing social interactions

Emalie McMahon, and Leyla Isik

Trends in Cognitive Sciences Mar 2024

Abs PDF

In a recent letter, Grossmann argues that, in young children and non-human primates, third-party social interaction recognition is supported by top-down processing in the medial prefrontal cortex (mPFC). He suggests that top-down signals in the developing brain may be used to train neural systems in the superior temporal sulcus (STS), which, in adults, appears to process social interactions in a visual manner. The hypothesis that the visual computations supporting social interactions are trained using top-down signals from the mentalization network is interesting. However, activation of mPFC when viewing social interactions does not preclude visual processing. As we discuss in our original article, when seeing social interactions, viewers can make rich inferences about the goals and mental states of the interacting agents. Young children and non-human primates may spontaneously engage higher-level cognitive processes when viewing social interactions, but we argue these processes are separate from the recognition of the interaction itself. Later, we review evidence suggesting that social interactions are processed visually in both young children and non-human primates and that STS selectivity emerges early in life.
TiCS

Abstract social interaction representations along the lateral pathway

Emalie McMahon, and Leyla Isik

Trends in Cognitive Sciences May 2024

Abs PDF

Recent work in vision science and visual neuroscience has moved away from focusing on single people or objects to understanding the relations between them. Based on converging behavioral, computational, and neuroscience evidence, we recently argued that the visual system contains rich, abstract representations of social interactions between others [1]. We also outlined a framework for how this may be implemented in the human brain hierarchically [1], beginning with detecting agents, processing their physical relations, and finally recognizing their social interactions. As part of this framework we argue that mid-level visual features about the physical relations between agents, which we refer to as social primitives, are represented in the extrastriate body area (EBA) and nearby regions of lateral occipitotemporal cortex (LOTC), whereas more abstract information about social interactions is represented in more anterior regions along the superior temporal sulcus (STS). However, Papeo questions our claim that social interaction representations in the STS are, in fact, abstract. We review evidence here that strengthens our claims of a posterior-to-anterior gradient of increasingly abstract representations of social interaction features along the recently proposed lateral visual stream.

2023

Current Bio

Hierarchical organization of social action features along the lateral visual pathway

Emalie McMahon, Michael F. Bonner, and Leyla Isik

Current Biology May 2023

Abs PDF

Recent theoretical work has argued that in addition to the classical ventral (what) and dorsal (where/how) visual streams, there is a third visual stream on the lateral surface of the brain specialized for processing social information. Like visual representations in the ventral and dorsal streams, representations in the lateral stream are thought to be hierarchically organized. However, no prior studies have comprehensively investigated the organization of naturalistic, social visual content in the lateral stream. To address this question, we curated a naturalistic stimulus set of 250 3-s videos of two people engaged in everyday actions. Each clip was richly annotated for its low-level visual features, mid-level scene and object properties, visual social primitives (including the distance between people and the extent to which they were facing), and high-level information about social interactions and affective content. Using a condition-rich fMRI experiment and a within-subject encoding model approach, we found that low-level visual features are represented in early visual cortex (EVC) and middle temporal (MT) area, mid-level visual social features in extrastriate body area (EBA) and lateral occipital complex (LOC), and high-level social interaction information along the superior temporal sulcus (STS). Communicative interactions, in particular, explained unique variance in regions of the STS after accounting for variance explained by all other labeled features. Taken together, these results provide support for representation of increasingly abstract social visual content—consistent with hierarchical organization—along the lateral visual stream and suggest that recognizing communicative actions may be a key computational goal of the lateral visual pathway.
TiCS

Seeing social interactions

Emalie McMahon, and Leyla Isik

Trends in Cognitive Sciences May 2023

Abs PDF

Seeing the interactions between other people is a critical part of our everyday visual experience, but recognizing the social interactions of others is often considered outside the scope of vision and grouped with higher-level social cognition like theory of mind. Recent work, however, has revealed that recognition of social interactions is efficient and automatic, is well modeled by bottom-up computational algorithms, and occurs in visually-selective regions of the brain. We review recent evidence from these three methodologies (behavioral, computational, and neural) that converge to suggest the core of social interaction perception is visual. We propose a computational framework for how this process is carried out in the brain and offer directions for future interdisciplinary investigations of social perception.

2021

CogSci

Understanding Mental Representations Of Objects Through Verbs Applied To Them

Ka Chun Lam, Francisco Pereira, Maryam Vaziri-Pashkam, and 2 more authors

May 2021

Abs PDF

In order to interact with objects in our environment, we rely on an understanding of the actions that can be performed on them, and the extent to which they rely or have an effect on the properties of the object. This knowledge is called the object “affordance”. We propose an approach for creating an embedding of objects in an affordance space, in which each dimension corresponds to an aspect of meaning shared by many actions, using text corpora. This embedding makes it possible to predict which verbs will be applicable to a given object, as captured in human judgments of affordance, better than a variety of alternative approaches. Furthermore, we show that the dimensions learned are interpretable, and that they correspond to typical patterns of interaction with objects. Finally, we show that the dimensions can be used to predict a state-of-the-art mental representation of objects, derived purely from human judgements of object similarity.
JoV

The ability to predict actions of others from distributed cues is still developing in 6- to 8-year-old children

Emalie McMahon, Daniel Kim, Samuel A. Mehr, and 3 more authors

Journal of Vision May 2021

Abs PDF

Adults use distributed cues in the bodies of others to predict and counter their actions. To investigate the development of this ability, we had adults and 6- to 8-year-old children play a competitive game with a confederate who reached toward one of two targets. Child and adult participants, who sat across from the confederate, attempted to beat the confederate to the target by touching it before the confederate did. Adults used cues distributed through the head, shoulders, torso, and arms to predict the reaching actions. Children, in contrast, used cues in the arms and torso, but we did not find any evidence that they could use cues in the head or shoulders to predict the actions. These results provide evidence for a change in the ability to respond rapidly to predictive cues to others’ actions from childhood to adulthood. Despite humans’ sensitivity to action goals even in infancy, the ability to read cues from the body for action prediction in rapid interactive settings is still developing in children as old as 6 to 8 years of age.

2019

JoV

Subtle predictive movements reveal actions regardless of social context

Emalie McMahon, Charles Y. Zheng, Francisco Pereira, and 3 more authors

Journal of Vision Jul 2019

Abs PDF

Humans have a remarkable ability to predict the actions of others. To address what information enables this prediction and how the information is modulated by social context, we used videos collected during an interactive reaching game. Two participants (an “initiator” and a “responder”) sat on either side of a plexiglass screen on which two targets were affixed. The initiator was directed to tap one of the two targets, and the responder had to either beat the initiator to the target (competition) or arrive at the same time (cooperation). In a psychophysics experiment, new observers predicted the direction of the initiators’ reach from brief clips, which were clipped relative to when the initiator began reaching. A machine learning classifier performed the same task. Both humans and the classifier were able to determine the direction of movement before the finger lift-off in both social conditions. Further, using an information mapping technique, the relevant information was found to be distributed throughout the body of the initiator in both social conditions. Our results indicate that we reveal our intentions during cooperation, in which communicating the future course of actions is beneficial, and also during competition despite the social motivation to reveal less information.

2018

Kin Rev

The Embodied Origins of Infant Reaching: Implications for the Emergence of Eye-Hand Coordination

Daniela Corbetta, Rebecca F. Wiener, Sabrina L. Thurman, and 1 more author

Kinesiology Review Jul 2018

Abs PDF

This article reviews the literature on infant reaching, from past to present, to recount how our understanding of the emergence and development of this early goal-directed behavior has changed over the decades. We show that the still widely-accepted view, which considers the emergence and development of infant reaching as occurring primarily under the control of vision, is no longer sustainable. Increasing evidence suggests that the developmental origins of infant reaching is embodied. We discuss the implications of this alternative view for the development of eye-hand coordination and we propose a new scenario stressing the importance of the infant body-centered sensorimotor experiences in the months prior to the emergence of reaching as a possible critical step for the formation of eye-hand coordination.