The Moment Camera C O V E R F E A T U R E 0018-9162/06/$20.00 © 2006 IEEE40 Computer P u b l i s h e d b y t h e I E E E C o m p u t e r S o c i e t y referred to as qualia in philosophical discussions.1 Somewhere, close to the objective end of the axis but still subjective, lies a point we call a moment. While a quale is by definition both subjective and personal, a moment is subjective but universal. For example, people spend about 10 percent of their waking life with their eyes closed2—a person’s normal, resting blink rate being 20 closures per minute, with the average blink lasting one-quarter of a second. Yet, when looking at our friends, we universally do not see them as having their eyes closed unless we consciously concen- trate on their blinking. On the other hand, taking a photograph of a friend often surprises us because the picture reveals closed or partially closed eyes, as Figure 2 shows. The rather awk- ward expression of half-closed eyes clearly does not cap- ture the moment, because it does not correspond to what we experience when looking at our friend. With the advent of the camera in the mid-19th cen- tury, art began to move away from realistic depiction into the more abstract realms of Impressionism, Cubism, and more pure Abstraction. The camera, although capable of capturing instants in time, cannot on its own—except in rare instances—truly record moments. When coupled with computation and a user interface, digital cameras can bring back the ability to capture moments as opposed to just instantaneous snapshots. Such computational cameras or computational pho- tography systems can provide a wealth of opportunities for both professional and casual photographers. Future cameras will let us “capture the moment,” not just the instant when the shutter opens. The moment camera will gather significantly more data than is needed for a single image. This data, coupled with automated and user-assisted algorithms, will provide powerful new paradigms for image making. Michael F. Cohen and Richard Szeliski Microsoft Research B efore the advent of the camera, artists were tasked with recording events and providing a visual history of their world. Although a great deal of early art recorded religious or mythical stories, by the 16th century, artists in the Netherlands began depicting scenes of normal life, typ- ified by Pieter Bruegel’s paintings (www.ibiblio.org/ wm/paint/auth/bruegel/). Although no one believes that all the action depicted in these scenes took place at the same instant, Bruegel successfully captured the moment. The moment provides a key concept, both in our arti- cle title and in the preceding sentence. What might we mean by a moment in this context? To illustrate this concept, we can construct an axis that runs from the objective to the subjective, as Figure 1 shows. At the objective end, a photograph provides some semblance of an event’s objective visual record. That same visual event evokes a different internal expe- rience in each of us. At the subjective end of the axis, personal experiences of external stimuli are often The Moment Camera Photograph Moment Qualia Universal Personal Objective Subjective Figure 1. A moment. Although subjective, a moment lies close enough to the objective axis to represent a shared experience of a scene. August 2006 41 Our hypothetical moment camera contains new light-capture modalities that can leverage several recent research developments in com- puter graphics, computer vision, and the subfield at their intersection, image-based rendering. THE MOMENT CAMERA When turned on, current digital cameras con- stantly scan the scene they are pointed at, responding to changing lighting conditions by modifying their speed or aperture and setting the focus to adapt to depths in the scene. Mean- while, the user points the camera, trying to frame a shot, and waits for that elusive instant to push the but- ton to record the light entering the aperture and landing on the sensor. At that instant, the camera might decide to fire the flash, at which time the total light then land- ing on the sensor during a fixed exposure interval is mapped to a raw image. This image typically receives further processing from a demosaicing algorithm before being compressed into a JPEG image for transfer to the permanent memory medium. Imagine a modification in the camera’s underlying func- tionality that keeps it always recording, somewhat like a DV camera in record mode. Thus, rather than only record- ing a snapshot, the camera constantly records time slices of imagery. Let’s assume one frame every 100th of a sec- ond or less, depending on the mode. Let’s also assume a finite round-robin buffer of perhaps 500 frames, or 5 sec- onds, resulting in a spacetime slab in memory at all times. We can think of this most easily as a short video sequence. We refer to this new device as a moment camera. When coupled with computational photography algo- rithms and an appropriate user interface, this somewhat unremarkable change in functionality provides many new possibilities. To demonstrate the technology today, we simulate the moment camera with either a still cam- era taking multiple photographs in succession, or with a current DV camera at 30 frames per second and, unfortunately, at a significantly lower resolution. MOMENT CAMERA PROCESSING STEPS Although the input to the moment camera creates a spacetime slab, the moment’s output typically consists of a single image. Thus, the processing primarily selects the color for each output pixel given the set of input images in the spacetime slab. This processing typically includes the following steps: 1. Align or warp the input images so that at any single output pixel, all input pixels represent the same point in the world as best as possible. 2. For each output pixel, from all input images that map to that pixel, select the best one to use for the output. 3. Adjust the selected pixel’s color to blend seamlessly with its neighbors. The first step, aligning images, is most often done by finding features in the images, then matching features across images to determine transformations for each image and align them in a global space.3 Alternatively, dense correspondence fields can be computed and used to perform the alignment.4 The second step involves an optimization that, for each pixel, tries to locally make the best selection based on predefined or interactively defined criteria, while globally trying to maintain smoothness. We often refer to the local criteria for selecting any particular pixel as the data cost, while the cost for transitioning from a pixel of one time slice to another as the smoothness cost. In early work, Image Stacks (http://research.microsoft. com/research/pubs/view.aspx?tr_id=666) relied on the user to make most decisions. More recent applications, including Photomontage1 and Seamless Image Stitching,5 explore the definition of the data and smoothness costs, either by the user or automatically. To achieve the trade- off between optimizing each pixel individually and cre- ating a seamless result, applications often use graph cut techniques6 as the optimization method. In the third and final step, the pixel value can be mod- ified either to adjust the virtual exposure or to compen- sate for other differences between images. For example, gradient domain blending modifies pixel values to match across seams while trying to maintain local gradients.7,8 We rely on these three steps in the examples that follow. STILL CAMERA MODES The moment camera can be used in a variety of modes. Each mode determines some aspects of the actual capture, but perhaps more importantly, it guides the user interface. We do not describe the details of each UI here because any real-world implementation will require much more thought and experimentation. Point and shoot In its simplest mode, from a user’s perspective, the moment camera operates much like a current point-and- shoot camera. The user simply frames the shot and presses a button. However, unlike a current camera, the Figure 2. The blink of an eye. Although these two photographs were taken a fraction of a second apart, only the second one captures the moment. 42 Computer moment camera records images continuously, not just at the instant the user presses the shutter button. As it records, the camera rapidly varies the exposure times, bracketing the neutral setting. The camera tests multiple points in the scene for focus and records images with varying focus settings. If low light is an issue, the flash can fire during a subset of the exposures. Meanwhile, time is inexorably marching forward, so the images vary during the time they are taken. When the user pushes the button, the camera records a slab of spacetime beginning a couple of seconds in the past until perhaps a second or two in the future for further processing. The point-and-shoot moment camera supports several rela- tively simple application scenar- ios, including the following: • Wind time backward or for- ward. Often the camera misses that fleeting expression at the instant the button push captures the image. Selecting a better frame as in Figure 2 more accurately captures the moment. • Flash/no flash.9,10 Low-light situations often lead to very noisy results, as Figure 3a shows. Using a flash can reduce the noise, but at the cost of ruining the subtle lighting, as Figure 3b shows. Because the space- time slab contains both flash and no-flash images, the high-frequency details from the flash image can be combined with a smoothed version of the no-flash image to obtain a desired low-noise image while main- taining the original lighting, as Figure 3c shows. • Expanded depth of field. Particularly when taking close-up shots, getting the whole object in focus simul- taneously can be difficult. While the autofocus seeks to find a consensus depth on which to focus, the moment camera records multiple images with differ- ent focus settings. Thus, for every pixel location, the slab contains multiple versions of the same point with varying focus, as Figure 4 shows. Maximizing the focus involves detecting which pixel has the highest local contrast and selecting it, while simultaneously maintaining coherence using a smoothness term in the optimization criterion. High-dynamic-range imagery and tone mapping Current digital cameras suffer from limited dynamic range: They cannot image both very bright areas and dark areas in the same exposure. To compensate for this, multiple exposures—bracketed shots—can be merged to get a wider dynamic range.11 Inside a moment camera, this kind of bracketing can be performed auto- matically, taking additional underexposed and overex- posed shots when the camera detects that it is not adequately capturing the full dynamic range in a single shot. Global alignment followed by local optic flow can compensate for possible motion in the scene, as Figure 5 shows.4 Once a wide-dynamic-range image has been assem- bled, the camera can store it either in an extended dynamic-range image format for further processing or tone-map it back to a displayable 8-bit gamut. A more intelligent moment camera not only performs this pro- cessing onboard, but also lets the user interactively steer the tone-mapping process by indicating at a high level Figure 3. Flash versus no flash. (a) A noisy, no-flash image and (b) a low-noise flash image combine to produce (c) a low-noise image with good lighting. (a) (b) (c) Figure 4. Expanded depth of field. The (a) single focal plane image is less detailed than (b) a composite of multiple focal plane images. Figure 5. High-dynamic-range imagery. The moment camera can merge multiple exposures—bracketed shots—to get a wider dynamic range comparable to nondigital film techniques: (a) exposures merged without motion compensation versus (b) those with motion compensation. (a) (b) (a) (b) August 2006 43 which regions should be brighter or darker or more or less saturated.12 Group Shot When taking a picture, we often catch a person with their closed eyes. Taking a picture of a group of people exponentially increases the difficulty of avoiding this— it becomes almost impossible to capture an instant when everyone is smiling with their eyes open. With an application such as Group Shot (http:// research.microsoft.com/projects/GroupShot/), a user can assemble an ideal group photograph from multiple shots. The user indicates the best instance of each per- son, and the system finds the best jigsaw-puzzle-like regions that it can compose to create a seamless final image, as Figure 6 shows. The moment camera can perform this operation in- camera to help ensure the creation of a successful com- posite. While viewing the scene, the user points at each person when they smile and look at the camera. Graph cut picks out a region around each selection to cut into the final composite and records a thin spacetime slab for that region. This can be repeated until a successful shot is created. Slight time shifts can be made on each region independently to perfect the result. Panoramas: Widening the field of view We are often confronted with a majestic scene—think of the Grand Canyon—that will not fit into the view finder. Multiple overlapping images can be stitched into a single panoramic image. Several applications can now do this after the fact. Many problems remain, however, that a moment camera could remedy. The first problem is coverage. Without careful plan- ning, we often miss parts of the scene. This happens most often in large sky areas or when the interesting parts of the scene lie at different heights in different directions. The results often have gaps or a snakelike shape rather than being a rectangular panorama. By providing on-the-fly alignment and stitching, the user can literally paint the panorama, examining the cov- erage to ensure capturing the complete scene.13 At the same time, allowing the exposure to vary between over- lapping frames can create high-dynamic-range panora- mas. Using shorter or longer exposures can adjust areas that appear too light or dark. Finally, the world usually does not stand still during a panorama’s capture. Focusing the graph cut criteria on selecting commonly seen and most likely static pixels can avoid including ghostlike figures in the panorama, as Figure 7 on the next page shows. DEPICTING MOTION While the previous examples purposefully remove transient events to create a consistent still, at times a user might want to explicitly depict motion in a single image. This type of representation dates back to the 19th century. Unless taken under careful conditions, strobo- scopic imagery often results in ghostlike representations of the dynamic elements. Stroboscopic-like stills Leveraging graph cut, however, we can create stro- boscopic-like images. By specifying in the objective func- tion that we want to retain dynamic elements, as opposed to removing them as in the bottom half of Figure 7, the result resembles Figure 8, which shows a girl swinging across a set of monkey bars. Figure 6. Group Shot. Working with stored images, the user indicates when each person photographed looks best. The system automatically finds the best regions around each selection to compose into a final group shot. 44 Computer Cliplets A spacetime slab is, by definition, the same as a short video sequence. Sometimes, a very short subsequence, or cliplet, can capture the moment, while still allowing the imagination to fill in what happened just before or after the bit of action. Just as a still image forces the viewer’s imag- ination to fill in what is left out, such short cliplets serve a similar purpose. These short sequences are best viewed by, for example, holding on the first frame for 3 to 4 seconds, then playing the short sequence and holding again on the final frame. Figure 9 provides an example that covers less than one-third of a second. Motion loops Some types of motion are more stochastic or repetitive. Examples range from flowing or rippling water to a person sitting still, breath- ing, and blinking. These motion types are amenable to the creation of looping video tex- tures, which stochastically jump from one frame to a matching frame either forward or backward in time.14 This work has also been extended to panoramic video textures constructed with video taken from a slowly panning camera.15 The spacetime slab that the moment camera captures provides the input needed for these kinds of experiences. ARTISTIC EXPRESSION Many of our examples use the moment camera to first capture a spacetime slab and then choose portions of time slices from the slab to construct a final output image. The goal has been to create a seamless result that “captures the moment.” However, more artistic tools can easily be created to combine pixels in the slab in interesting ways. In Figure 10, we have modified the selection mechanism to create surprising artistic effects. Very simple criteria can be modified in real time to pro- vide a wide variety of expressive results. Figure 9. Spacetime slab. About one-third of a second separates these three time slices of the slab. A cliplet that holds on the first frame, plays the intervening 10 frames, then holds on the last, viscerally depicts the moment. Figure 7. Panoramic composite. (a) The overlapping images are aligned and blended together, resulting in ghosted figures; (b) graph cut finds regions in each image to stitch together to create a consistent scene. Figure 8. Stroboscopic-like images. Dynamic scenes can be represented by optimizing for dynamic elements while also maintaining consistency. (a) (b) F uture cameras might haveeven more advanced capa-bilities than those we’ve described. For example, cameras that notice when someone is smiling are already being devel- oped. Future cameras could sug- gest better ways to frame a scene and indicate that we should back up or point the camera just a bit higher. Cameras might someday even learn our habits and help develop a style of their own based on how we use them. In our own work, we are building a moment camera prototype to continue our research in this promising new area. ■ Acknowledgments This work represents a sampling of years of research at Microsoft Research and the University of Washington. Our colleagues who helped in this work include Aseem Agarwala, Maneesh Agrawala, Matthew Brown, Patrick Baudisch, R. Alex Colburn, Brian Curless, Mira Dontcheva, Steven Drucker, Hugues Hoppe, Daniel Lischinski, Georg Petschnigg, David Salesin, Drew Steedly, Kentaro Toyama, Matt Uyttendaele, Jue Wang, and Simon Winder. References 1. “Qualia,” The Stanford Encyclopedia of Philosophy, M. Tye and E.N. Zalta, eds.; http://plato.stanford.edu/archives/ sum2003/entries/qualia/. 2. A. Agarwala et al., “Interactive Digital Photomontage,” ACM Trans. Graphics, Aug. 2004, pp. 292-300. 3. M. Brown and D. Lowe, “Recognising Panoramas,” Proc. Int’l Conf. Computer Vision (ICCV 03), IEEE CS Press, vol. 2, Oct. 2003, pp. 1218-1225. 4. S.B. Kang et al., “High Dynamic Range Video,” ACM Trans. Graphics, July 2003, pp. 319-325. 5. A. Eden, M. Uyttendaele, and R. Szeliski, “Seamless Image Stitching of Scenes with Large Motions and Exposure Differ- ences,” Proc. IEEE Computer Soc. Conf. Computer Vision and Pattern Recognition (CVPR 2006), IEEE CS Press, 2006, pp. 2498-2505. 6. Y. Boykov, O. Veksler, and R. Zabih, “Fast Approximate Energy Minimization via Graph Cuts,” IEEE Trans. Pattern Analysis and Machine Intelligence, Nov. 2001, pp. 1222-1239. 7. P. Pérez, M. Gangnet, and A. Blake, “Poisson Image Editing,” ACM Trans. Graphics, July 2003, pp. 313-318. 8. A. Levin et al., “Seamless Image Stitching in the Gradient Domain,” Proc. 8th European Conf. Computer Vision (ECCV 2004), vol. 4, Springer-Verlag, 2004, pp. 377-389. 9. E. Eisemann and F. Durand, “Flash Photography Enhance- ment via Intrinsic Relighting,” ACM Trans. Graphics, vol. 23, no. 3, 2004, pp. 673-678. 10. G. Petschnigg et al., “Digital Photography with Flash and No- Flash Pairs,” ACM Trans. Graphics, Aug. 2004, pp. 664-672. 11. P. Debevec, and J. Malik, “Recovering High Dynamic Range Radiance Maps from Photographs,” Proc. Siggraph 97, ACM Press, 1997, pp. 369-378. 12. D Lischinski et al., “Interactive Local Adjustment of Tonal Values,” ACM Trans. Graphics, to appear Aug. 2006. 13. P. Baudisch et al., “Panoramic Viewfinder: Providing a Real- Time Preview to Help Users Avoid Flaws,” Proc. OZCHI 2005, ACM Int’l Conf. Proc. Series, ACM Press, 2005. 14. A. Schödl et al., “Video Textures,” Computer Graphics, July 2000, pp. 489-498. 15. A. Agarwala et al., “Panoramic Video Textures,” ACM Trans. Graphics, July 2005, pp. 821-827. Michael F. Cohen is a principal researcher for Microsoft Research. His research interests include image-based ren- dering, animation, camera control, more artistic nonpho- torealistic rendering, linked-figure animation, and compu- tational photography applications. Cohen received a PhD in computer science from the University of Utah. Contact him at mcohen@microsoft.com. Further publications can be found at www.research.microsoft.com/~cohen. Richard Szeliski, a principal researcher, leads the Interactive Visual Media Group at Microsoft Research. His research interests include digital and computational photography, video scene analysis, 3D computer vision, and image-based rendering. Szeliski received a PhD in computer science from Carnegie Mellon University. Contact him at szeliski@ microsoft.com. August 2006 45 Figure 10. Artistic imaging tools. Researchers used a single time-lapse slab of clouds drifting across the sky to create these images. (a) An algorithm picked out for each pixel location in the time slice with the highest local contrast. (b) A more complex difference function of multiple time slices creates unusual colors when a channel wraps around to indicate colors above 255 or below 0. (a) (b)