Recognizing the Gist of a Scene

People can recognize the meaning of a scene, or the "gist," during their first eye fixation on that scene; for example, they can recognize that it is a beach, a dining room, or a street. Our own research has shown that viewers can recognize the gist of a scene at over 80% accuracy after as little as 36 milliseconds of uninterrupted processing time (click the above image to see an example). This raises a few questions: how are we able to recognize images so rapidly, and what information do we use to recognize them? Answering these questions is important for our understanding of scene perception, because research has shown that the gist of a scene uses our prior knowledge associated with the scene's category (e.g., that beaches have water, sand, palm trees, and possibly sunbathers). This knowledge strongly influences where we pay attention, may help us recognizing objects in the scene, and plays a big role in determining what information we remember from a scene. At its core, research on scene gist recognition explores the interface between perception and cognition—a problem that has proved extremely challenging to workers in both cognitive psychology and artificial intelligence. Such research can be applied to designing artificial intelligence systems capable of recognizing the categories of scenes.

We have carried out a number of studies on scene gist recognition over the past several years, and these are described below.

The Roles of Central vs. Peripheral Vision in Scene Gist Recognition

JOV_animation

An interesting question is: which region of the visual field is most useful for recognizing the gist of a scene, central vision (the fovea and parafovea), based on its higher visual acuity and importance for object recognition, or the periphery, based on its larger size and how lower spatial frequencies are useful for scene gist recognition? (Here are links to a YouTube video describing the results of a study on this topic, and also a newspaper article covering it by United Press International.)

We have done a number of studies investigating this issue. In these studies, scenes were presented in two experimental conditions: a "Window" condition with a circular region showing the central portion of a scene but with peripheral information hidden, or a "Scotoma" condition with the central portion of a scene hidden and only the peripheral information available (Loschky & Larson, 2009). Results indicated that the periphery was more useful than central vision for maximum performance (about equal to seeing the entire image!). Nevertheless, central vision was more efficient for scene gist recognition than the periphery on a per-pixel basis. A critical radius of 7.4º was found where the Window and Scotoma performance curves crossed, producing equal performance. This value was compared to predicted critical radii from cortical magnification functions on the assumption that equal V1 activation would produce equal performance. However, these predictions were systematically smaller than the empirical critical radius, suggesting that the utility of central vision for gist recognition is less predicted by V1 cortical magnification.

In addition to scene gist recognition varying over space according to central versus peripheral vision, other studies in our lab have investigated how scene gist recognition varies over time. Scene gist is recognized within a single fixation. However, we have investigated whether gist recognition varies over time within that one fixation. A related issue is whether attentional focus affects scene gist recognition (Evans & Triesman, 2005; Li, et al., 2001). Our previous research showed that both central and peripheral information can produce equal scene gist recognition, provided there is roughly twice as much area in the periphery. However, those studies did not vary processing time (through masking) or manipulate attention. Therefore, we presented "Window" or "Scotoma" conditions using a critical radius, such that both window and scotoma images produced equal gist accuracy when unmasked (i.e., unlimited processing time). We briefly presented images for 24 ms each and varied processing time via the target-to-mask stimulus onset asynchrony (SOA). Our results have shown that at very short SOAs, central information produces better gist recognition than peripheral information, though with unlimited processing time in a single fixation (i.e., no-mask), performance is equal for central and peripheral information. Other research from our lab has also supported this idea, finding that central vision is better at processing scene category early (during the first 100 ms of viewing a scene), while peripheral vision becomes increasingly useful after that time (Larson, Freeman, Ringer, & Loschky, 2013). This indicates that spatiotemporal dynamics of attention play an important role and affect gist recognition, setting spatiotemporal limits on how quickly real-world scenes can be comprehended.

These results are consistent with a zoom-out hypothesis of covert attention where attention is first focused at the center of vision and then rapidly spreads outward, and this affects scene gist recognition.

What Categorical Level of Scene Gist is Perceived First?

What level of categorization occurs first in scene gist processing, the basic level (a beach versus a city) or the superordinate level (a "natural" scene versus a "man-made" scene)? The Spatial Envelope model of scene classification and human gist recognition (Oliva & Torralba, 2001) assumes that the superordinate distinction is made prior to basic level distinctions. This assumption contradicts the claim that categorization occurs at the basic level before the superordinate level (Rosch et al., 1976). We carried out a study to test this assumption of the Spatial Envelope model by making viewers categorize briefly flashed, masked scenes after varying amounts of processing time. The results showed that early stages of processing (SOA < 72ms) produced greater sensitivity to the superordinate distinction than basic level distinctions, and also that basic level distinctions crossing the superordinate natural/man-made boundary are treated as a superordinate distinction (Loschky & Larson, 2010). Both results support the assumption of the Spatial Envelope model, and challenge the idea of basic level primacy.

What Information is Used to Recognize the Gist of a Scene?

What information do people use to rapidly categorize a scene as a "beach," a "street," a "mountain," etc.? Some prominent computational theories of scene gist recognition have proposed the counter-intuitive and provocative hypothesis that the unlocalized amplitude spectrum of images (their spatial frequencies and orientations) provides most of the important information for categorizing a scene, without regard to their location in the image. In simple terms, this suggests that for recognizing a beach scene, it is more important to know that there is a strong horizontal and a strong diagonal than to know that the horizontal (the horizon) is above the diagonal (the water line). However, our studies with human subjects suggest that while the spatial frequencies and orientations of an image certainly play some role in recognizing it, they are not enough to categorize a scene by themselves — localized information is necessary for that (Loschky et al., 2007; Loschky & Larson, 2008). The importance of localization therefore suggests that the layout of a scene (the scene's global configuration) is probably very important in recognizing its gist.

White Noise Mask	Phase-Randomized Mask	Recognizable Mask

A related topic that we have investigated is the masking of scene gist. Visual masking is when one stimulus interferes with the processing of another stimulus (click the appropriate thumbnail above to see a demonstration). Masking is an important tool for studying the time course of visual processing, and it has a history of over 100 years in the field of Psychology. However, very little is known about the masking of complex stimuli like scene images, or relatively high level perceptual tasks such as scene gist recognition. We have compared the effects of low level spatial masking (i.e., masking by spatial frequencies and orientations) with the effects of higher level "conceptual masking" (i.e., masking by meaning) (Loschky et al., 2010). Previous research has shown that recognition memory for a scene is more strongly masked by a recognizable scene (i.e., a scene masking another scene) than by meaningless noise, and this has been used to argue for the existence of conceptual masking. A key hypothesis we have tested is that conceptual masking effects are actually due to the greater visual similarity between any given pair of scenes versus any given scene compared with random noise. Our results do not rule out the existence of conceptual masking of scene gist, because pure visual similarity, in terms of spatial frequencies and orientations, cannot explain all of the masking produced by a recognizable scene mask. However, our results also show that a good proportion of what has been called conceptual masking (namely, the greater masking produced by a recognizable scene compared to the masking produced by white noise) can actually be produced by an unrecognizable noise image that shares many statistical properties with a scene. Other research has indicated that the effects of masking on rapid scene categorization vary depending on the Fourier spectral properties of the masks (Hansen & Loschky, 2013). Such research holds the potential to expand our understanding of both scene gist processing and the masking of complex stimuli.

How Do Humans and a Highly Divergent Species, Pigeons, Compare?

Some of our recent research has taken a new approach, studying how the scene gist categorization skills of humans compare to those of other evolutionarily highly divergent species – in this case, pigeons (Kirkpatrick, Bilton, Hansen, & Loschky, 2013).

A set of experiments used captive-bred homing pigeons to look at how they generalize and discriminate scenes, and also how quickly those processes occur. Using real-world scenes (beaches, mountains, and streets), the first experiment found that pigeons could discriminate between natural and man-made scenes (beach vs. street) and also between two different natural scenes (beach vs. mountain) relatively quickly. However, the pigeons were slower than humans, needing longer stimulus durations - roughly 10 times longer than humans. When presented with new examples of these categories, they were able to generalize their experience to discriminate between these new scenes as well, indicating that they were able to recognize categories instead of simply remembering specific scenes.

A second experiment looked at how the real-world scenes could be discriminated based on their image statistics, and found that pigeons use image statistics to categorize scenes just as humans do. Performance was worse for scenes with a terrestrial viewpoint, but improved with training. These experiments suggest that both humans and pigeons are able to discriminate between scene categories very quickly (though pigeons require more time to do this) and also that both humans and pigeons use image statistics in order to categorize scenes.

This line of research has interesting implications because the two species are very different, yet show similar rapid scene categorization abilities – though there are some apparent differences that may indicate evolutionary adaptive specializations between the two species in the past (Kirkpatrick, Bilton, Hansen, & Loschky, 2013).

Comparing Rapid Categorization of Aerial and Terrestrial Views of Scenes: A New Perspective on Scene Gist Recognition

We see aerial views of scenes from airplanes and satellite images in web applications like Google Earth fairly frequently, but how is our ability to recognize the gist of these aerial views similar to and different from that of terrestrial views (seen from the ground in everyday life)? Scene gist, a viewer’s holistic representation of a scene from a single eye fixation, has been extensively studied for terrestrial views, but not for aerial views. Over the last several years we have carried out a number of experiments investigating this issue. In a recent paper (Loschky, Ringer, Ellis & Hansen, 2015), we compared rapid scene categorization of both views in three experiments to determine the degree to which diagnostic information is view dependent versus view independent. We found large differences in observers’ ability to rapidly categorize aerial and terrestrial scene views, consistent with the idea that scene gist recognition is viewpoint dependent. In addition, computational modeling showed that training models on one view (aerial or terrestrial) led to poor performance on the other view, thereby providing further evidence of viewpoint dependence as a function of available information. Importantly, we found that rapid categorization of terrestrial views (but not aerial views) was strongly interfered with by image rotation, further suggesting that terrestrial-view scene gist recognition is viewpoint dependent, with aerial-view scene recognition being viewpoint independent. Furthermore, rotationinvariant texture images synthesized from aerial views of scenes were twice as recognizable as those synthesized from terrestrial views of scenes (which were at chance), providing further evidence that diagnostic information for rapid scene categorization of aerial views is viewpoint invariant. We discuss the results within a perceptual-expertise framework that distinguishes between configural and featural processing, where terrestrial views are more effectively processed due to their predictable view-dependent configurations whereas aerial views are processed less effectively due to reliance on view-independent features.