Recognizing the Gist of a Scene
People can
recognize the meaning, or “gist,” of a scene, for example
that it is a beach, a dining room, or a street, during their first eye
fixation on it. In fact, our own research has shown that viewers can
recognize the gist of a scene at over 80% accuracy after as little as
36 milliseconds of uninterrupted processing time (click the
image to see an example). This raises the questions of how we
are able to recognize images so rapidly, and what information we use to
recognize them. Answering these questions is important for our
understanding of scene perception, because research has shown that the
gist of a scene activates our prior knowledge associated with the
scene’s category (e.g., that beaches have water, sand, and
possibly sunbathers and palm trees). This knowledge strongly guides
where we pay attention, it may help us recognizing objects in the
scene, and it plays a big role in determining information what we
remember from a scene. At its core, research on scene gist recognition
explores the interface between perception and cognition—a problem
that has proved extremely challenging to workers in both artificial
intelligence and cognitive psychology. Such research can be applied in
designing artificial intelligence systems capable of recognizing the
categories of scenes. We have carried out a number of studies on scene gist recognition
over the past several years, which are described below.
The Roles of Central vs. Peripheral Vision in Scene Gist Recognition

An interesting question is, which region of the visual field is most useful for recognizing scene gist, central vision (the fovea and parafovea) based on its higher visual resolution and importance for object recognition, or the periphery, based on resolving lower spatial frequencies useful for scene gist recognition, and its large extent? Here are links to a YouTube video describing results of a study on this question, and to a newspaper article covering it by United Press International .
We have done a number of studies investigating this issue. Scenes were presented in two experimental conditions: a “Window,” a circular region showing the central portion of a scene, and blocking peripheral information, or a “Scotoma,” which blocks out the central portion of a scene and shows only the periphery. Results indicated the periphery was more useful than central vision for maximal performance (i.e., equal to seeing the entire image). Nevertheless, central vision was more efficient for scene gist recognition than the periphery on a per-pixel basis. A critical radius of 7.4º was found where the Window and Scotoma performance curves crossed, producing equal performance. This value was compared to predicted critical radii from cortical magnification functions on the assumption that equal V1 activation would produce equal performance. However, these predictions were systematically smaller than the empirical critical radius, suggesting that the utility of central vision for gist recognition is less than predicted by V1 cortical magnification.
Other studies in our lab have investigated how the use of central versus peripheral vision to recognize scene gist varies over time. Scene gist is recognized within a single fixation. However, we have investigated whether gist recognition varies over space, specifically central versus peripheral vision, and over time, within a fixation. A related issue is whether attentional focus affects scene gist recognition (Evans & Triesman, 2005; Li, et al., 2001).
Our previous research showed that both central and peripheral information can produce equal scene gist recognition, provided there is roughly twice as much area in the periphery. However, those studies did not vary processing time (through masking) or manipulate attention. We therefore presented "window" or "scotoma" conditions using a critical radius, such that both window and scotoma images produced equal gist accuracy when unmasked (i.e., unlimited processing time). We breifly presented images for 24 ms each and varied processing time via the target-to-mask stimulus onset asynchrony (SOA). Our results have shown that at very SOAs, central information produces better gist recognition than peripheral information, though with unlimited processing time in a single fixation (i.e., no-mask) performance is equal for central and peripheral information, as predicted based on use of the critical radius.
This result is consistent with the hypothesis that attention begins focused at the center of vision and rapidly spreads outward, and that this affects scene gist recognition.
What Categorical Level of Scene Gist is Perceived First?
What level of categorization occurs first in scene gist processing, the basic level or the superordinate “natural” versus “man-made” distinction? The Spatial Envelope model of scene classification and human gist recognition (Oliva & Torralba, 2001) assumes that the superordinate distinction is made prior to basic level distinctions. This assumption contradicts the claim that categorization occurs at the basic level before the superordinate level (Rosch et al., 1976). We carried out a study to test this assumption of the Spatial Envelope model by having viewers categorize briefly flashed and masked scenes after varying amounts of processing time. The results showed that at early levels of processing (SOA < 72ms) (1) produced greater sensitivity to the superordinate distinction than basic level distinctions, and (2) basic level distinctions crossing the superordinate natural/man-made boundary are treated as a superordinate distinction. Both results support the assumption of the Spatial Envelope model, and challenge the idea of basic level primacy.
What Information is Used to Recognize the Gist of a Scene?
What information do people use to rapidly categorize a scene as a “beach,” “street,” “mountain,” etc? Some prominent computational theories of scene gist recognition have proposed the counter-intuitive and provocative hypothesis that the unlocalized amplitude spectrum of images, that is their spatial frequencies and orientations, without regard to their location in the image, provides much of the most important information for categorizing a scene. In simple terms, this suggests that for recognizing a beach scene, it is more important to know that there is a strong horizontal and a strong diagonal than to know that the horizontal (the horizon) is above the diagonal (the water line). However, our studies with human subjects suggest that while the spatial frequencies and orientations of an image certainly play some role in recognizing it, they are not enough by themselves to categorize a scene—localized information is necessary for that. The importance of localization therefore suggests that the layout of a scene (the scene’s global configuration) is probably very important in recognizing its gist.
| White Noise Mask | RISE Mask | Recognizable Mask |
|---|---|---|
A related topic that we have investigated is the masking of scene gist. Visual masking is when one stimulus interferes with processing of another stimulus (click the appropriate thumbnail to see a demonstration). Masking is an important tool for studying the time course of visual processing, and it has an over 100 year history in the field of psychology. Yet very little is known about the masking of complex stimuli like scene images, or relatively high level perceptual tasks such as scene gist recognition. We have compared the effects of low level spatial masking (i.e., masking by spatial frequencies and orientations) with the effects of higher level “conceptual masking” (i.e., masking by meaning). Previous research has shown that recognition memory for a scene is more strongly masked by a recognizable scene (i.e., a scene masking another scene) than by meaningless noise and this has been used to argue for the existence of conceptual masking. A key hypothesis we have tested is that such conceptual masking effects are actually due to the greater visual similarity between 1) any given pair of scenes versus 2) any given scene compared with random noise. Our results do not rule out the existence of conceptual masking of scene gist, because pure visual similarity, in terms of spatial frequencies and orientations, cannot explain all of the masking produced by a recognizable scene mask. However, our results also show that a good proportion of what has been called conceptual masking (namely, the greater masking produced by a recognizable scene compared to that produced by white noise) can be actually be produced by an unrecognizable noise image that shares many statistical properties with a scene. Such research holds the potential to expand our understanding of both scene gist processing and the masking of complex stimuli.
Loschky, L.C., Simons, Smerchek, S., Matz, E., Bilyeu, B., & Artman, L. (2007). Is Unlocalized Amplitude Information of Any Use for Scene Gist Recognition? [Abstract]. Journal of Vision, 7(9):1051, 1051a, http://journalofvision.org/7/9/1051/
Loschky, L. C., Sethi, A., Simons, D. J., Pydimarri, T. N., Forristal, N., Corbeille, J., et al. (2006). The roles of amplitude and phase information in scene gist recognition and masking [Abstract]. Journal of Vision, 6(6), 802a, http://www.journalofvision.org/6/6/799/
Loschky, L.C., Sethi, A., Simons, D.J., Ochs, D., Corbeille, J. & Gibb, K. (2005, November). Using visual masking to explore the nature of scene gist. Poster presented at the 46th Annual Meeting of the Psychonomic Society, Toronto, Canada.
Loschky, L. C., & Simons, D. J. (2004). The effects of spatial frequency content and color on scene gist perception [Abstract]. Journal of Vision, 4(8), 881a, http://journalofvision.org/4/8/881/
Past and Present Collaborators on this Work:
Colleagues at Other Institutions:
Bruce Hansen (Colgate University)
Sebastian Pannasch (Aalto University)
Amit Sethi (India Institute of Technology, Guwahati)
Daniel J. Simons (University of Illinois at Urbana-Champaign)
Graduate Students at Kansas State University:
Adam Larson
Tyler Freeman
Katrina Ellis
Ryan Ringer
Tejaswi Pydimarri (graduated)
Undergraduate RAs at Kansas State University:
Elise Matz, Dan Ochs, Jeremy Corbeille, Katie Brewton, Laura Artman, Ben Bilyeu, and Nick Forristal (Psychology, Kansas State University), Scott Smerchek (Computer & Information Science, Kansas State University)