A case for gestural interactions in spatial computing

It’s time to reconsider legacy UX patterns and put gestural interactions in the limelight in the age of spatial computing.

Davide Zhang

Published in

UX Collective

10 min readAug 30, 2023

A screenshot of a mixed reality application that uses gestures to manipulate spatial content organized in 3D graphs — Example of a gesture-based AR application (Source: Softspace)

Rethinking Legacy

Interaction has always been an essential component in our relationship with the world. The most fundamental ones are the gestures we use all the time. Personal computers introduced a new paradigm for interacting with machines mediated by the keyboard and mouse combo. The smartphone era, with its capacitive multi-touch screen technology, has brought gestures back to the equation. We are able to tap, swipe, and pinch, but our interaction is still applied onto a screen and translated into the 2D position of a touch target.

As the technology behind spatial computing matures, we are soon approaching an inflection point after which virtual and augmented reality in the form of commercial stereoscopic 3D displays are desirable enough for mass adoption. Companies with vast amounts of resources like Meta are at a big advantage in defining the way we interact in spatial computing. With the 10 million units of Meta Quest 2 devices already shipped in 2021 and Apple’s upcoming Vision Pro, it seems they are well on their way. At the center of this emerging spatial medium is the question of interaction: what is the best way of interacting with this medium and how do we evaluate it?

This question is all the more important at this time because it will directly affect the potential of spatial computing. It is a key factor in determining whether spatial computing will bring a giant leap in productivity or see itself become just another way of using computers. However, what we are noticing currently with the maturing hardware and emerging software is the expedient import of legacy interaction paradigms. It is often the case in current VR software that 2D user interfaces, those familiar to computer and smartphone users, are situated in 3D space. Meta Quest 2’s home menu, as shown below, is a case in point where its flat interface is anchored in a 360-degree environment. The social VR app AltspaceVR is another example where even the background is static, presenting an entirely flat interface in an immersive medium.

A screenshot of the settings panel of the Meta Quest interface — Meta Quest’s home menu settings (Source: Meta Quest Blog)

A screenshot of the home interface of the VR application AltspaceVR — AltspaceVR menu (Source: Cathy Moya)

Interactions in VR inherit from those legacy platforms as well. While VR controllers are themselves a novel input mechanism, it still follows the notion of a cursor in space. In the case of Meta Quest 2’s home menu, a ray shoots from the front of each controller to the interface (commonly known as raycasting in game development jargon), mapping the 3D motion of the controller to the same output as that of a mouse — a 2D position on the interface.

There is, on the other hand, a trend towards hand-tracking as an input method led by companies like Leap Motion, Microsoft (Hololens) and now Apple. Meta has also enabled the experimental vision-based hand-tracking feature which follows the same spatial pointer model as that of the controller. Upcoming mixed-reality headsets like Varjo and Lynx come standard with hand-tracking and even eye-tracking capabilities. Vive, in addition to its headsets, has been leading the full body-tracking segment with the Vive trackers.

In light of the current landscape of interactions on the verge of a major shift in computing platform, we fear that the dominant paradigms, namely 2D interfaces and interactions, are underselling the value of this medium in the name of user-friendliness or user adoption. There are fundamental merits to gestures that warrant a deeper look into their definitions, classifications, and qualities. This paper hopes to foster the conversation around gestural interactions and lay the groundwork for further exploration and experimentation before patterns get set in stone in this liminal period.

A Starting Point

In fact, the usage of gestures is much more diverse than shooting a laser to select a button. To explore all the possibilities of gestural interactions, it is worth getting a better sense of what gestures are and how they are classified. A human gesture, in general, is defined as a movement of the hands, head, or body to express or emphasize an idea or emotion. In the context of human-machine interaction, it could be classified into four major types: deictic gestures, manipulative gestures, semaphoric gestures, and gesticulation according to Karam and Schraefel’s research on the taxonomy of gestures.

Illustrations of different types of gestures according to Karam and Schraefel’s research — Taxonomy of gestures

Deictic gestures refer to the action of pointing to establish the identity or spatial location of an object. Shooting a laser to serve as a pointer falls into this category. Manipulative gestures are those intended to control an object. Semaphoric gestures are defined to be any gesturing system that uses a dictionary of hand or arm gestures to act as symbols to be communicated to the machine. Lastly, gesticulation is the most natural form of gesturing and is often used to accompany speech. Current implementations of gestural interactions engage with deictic gestures for pointing and selection (in the case of Meta Quest 2’s home menu) and manipulative gestures for object manipulation (in the case Leap Motion Interaction Engine) but rarely tap into the potential of other classifications. We set out to do this in the next section of this paper.

Spatial computing is defined by Simon Greenwold as the human interaction with a machine in which the machine “retains and manipulates referents to real objects and spaces.” The connection between the real and the virtual is key, which underpins much of the excitement for the future, whether it is virtual and augmented reality, NFTs, or the Metaverse.

Now that everything is defined, we are able to tackle the core question of interest: why do gestural interactions in the context of spatial computing matter?

Means to an End

To begin with, gestural interactions help us focus more on producing rather than learning to use the tool itself. We are already familiar with gestures because we practice them every day. Professional or creative activities such as 3D modeling, animation, illustration, and so on have a high skill barrier because it takes significant training to learn the interfaces of the relevant software. The learning curve is so high that ordinary people with creative ideas face difficulty realizing them. Take the example of stacking objects in a 3D environment. This is an essential and foundational step in many 3D modeling tasks. If one was to use a piece of professional 3D modeling software like Rhino, they would first need to learn to select the object and drag the gumball. This does not prevent overlapping, however, and for precise results, typing the command “move” and snapping the vertices of one cube onto another is required. On the other hand, the same action achieved using hand gestures (specifically manipulative gestures), as shown below in a Leap Motion Interaction Engine demo video, is natural to most people, if not trivial.

Virtual hands manipulating and stacking cubes in virtual reality using Leap Motion Interaction Engine — Leap Motion Interaction Engine (Source: Leap Motion)

While stacking cubes using Rhino might not take long to learn at all, the gap between learning curves of professional software interfaces and gesture-based interactions quickly widens when it comes to more advanced tasks. The well-known professional 3D sculpting software ZBrush is notorious for its steep learning curve. The user spends a significant amount of time learning the interface instead of practicing concepts that are inherent to sculpting itself. The creative process is more likely to be limited by familiarity with the interface in the case of ZBrush than, say, a gesture-based counterpart in VR. While the examples mentioned above are domain-specific, they are only compelling existent proofs and do not limit the scope of our argument. Gestural interaction’s role in reducing the friction of using software could also be examined in other fields.

One of the reasons for this steep learning curve is the lack of affordances in much of the current state of user interface design. Affordance is generally referred to as the quality of an object that suggests its use. For example, the handle of a teapot suggests that it is grabbable by a human hand. It reduces cognitive load because we automatically perceive its use without training. In this sense, designs with good affordance are necessarily user-friendly. Gibson, in his “Theory of Affordances,” argues that not only is affordance embedded in the physicality of the object itself, but also the environment it is situated in: it is an invariant property of an ecological object. In this sense, neither the physicality of an object nor its environment is present in modern interfaces, which today have gone through a transition from skeuomorphism (mimicking physical qualities of real-world counterparts) to flat design (abstraction and 2D elements) and neumorphism (in between skeuomorphism and flat design).

A diagram illustrating the rough history of UI design from physical affordances (the tea pot handle), to skeuomorphic design (UI buttons look like real buttons), and to flat design (iOS 7 interface) — (Rough) history of UI design

In the process of abstraction, affordances, which come naturally to us by way of visual perception, are lost because the interface elements detach from their 3D physicality and environment. We are at an inflection point where spatial computing could bring back these qualities through immersive environments and 3D interfaces and leverage human’s natural ability to perceive affordances. Ultimately, the interface is a means to an end, an intermediary between us, the user, and the task we seek to perform. Reducing the friction of interaction helps us better achieve the end.

Gesturing is Thinking

While it is known to many that gestures, or more specifically, gesticulation, help us communicate more effectively, they play a more fundamental role than just assisting our expression — they formulate our thinking as well. According to Annie Murphy Paul in her book The Extended Mind, gestures help consolidate our initial nebulous thoughts, which are then expressed through speech and gesticulation. In this phase, we gesture profusely to form insights and as our emerging understanding of an idea concretizes, our language becomes more precise and takes on a bigger role in communicating. Gestures also help us learn better. They are useful in understanding concepts that words struggle to capture entirely — concepts that are often visual, relational, and spatial. Paul provides the example of college students learning a geological concept. When asked to visualize and reason about the interior of a three-dimensional object from what is visible on its surface, those who gestured preparing for the test scored higher.

Another merit of gestures is reinforcing memory. Research has shown that gestures decrease working memory load by illustrating the same information through a visual medium or modality. Retention is also improved according to Paul, because gestures put a proprioceptive “hook” onto the information, establishing a connection between sensing the hand’s position in space and the information to be memorized. In other words, we could create custom semaphoric gestures to make use of their power of symbolic association. This insight offers productive implications for mnemonic techniques such as memory cards. Instead of the text-based two-dimensional cards we are used to, could there be an opportunity for a gesture-based spatial memory card system? By relating complex concepts or information to gestures, we leverage the proprioceptive hook to retain them. Additionally, this mechanism could be integrated into any application that will benefit from better user memory retention. A new approach to onboarding could be considered, for example.

On a more philosophical front, gestural writing and making help us think better and reflect on our work. Every medium contains within it a mode of time. That of paper, for example, is much longer than that of a typing machine because writing on the former takes more time. The time it takes to inscribe onto the surface that we call paper is the time in which we think and reflect, argues John May in his book Signal, Image, Architecture. This gestural labor that we pride ourselves on prior to the age of computers creates what May refers to as a linear historical consciousness. He describes it as a historical sensibility in which the past was tied to the future. In this mindset, we reflect on the past and contemplate the future by design through every sequential brushstroke we lay down and every successive word we speak. By contrast, we now live in “real-time,” a statistical alternative to the linear model, thanks to computerization and signalization. May cautions the implications of this shift in the model of time. As an example, manually reproducing artwork or transcribing writing necessarily involves more thinking than the command control-v and control-c on our keyboard. Gestures in the context of spatial computing, therefore, may provide an opportunity for us to bring back that historical sensibility in this purely signalized world.

Starting the Conversation

To summarize, there are cognitive, communicative, and mnemonic benefits to gesture interactions, and spatial computing could augment this fundamental human trait. In the larger context, raising the case for gestural interactions in spatial computing as opposed to the convenient import of design patterns from legacy platforms is all the more timely. There are a few reasons for this, the first of which is that companies have the business motive to expand their user base quickly and prioritize the traditional UX framework of “user-friendly” designs. These motives may not create the best interactions that truly empower us. Secondly, efficiency does not equate to effectiveness. Much of the research on human-computer interaction evaluates interactions based on quantitative metrics such as the time taken to complete a task. There could be another way of evaluating the effectiveness of interactions, one that prioritizes effective output. Of course, we are aware of the risk of pursuing novelty for novelty’s sake in this endeavor, but we believe it is critical that we start the conversation now, understand the stakes involved, and partake in designing our future relationship with technology.

I would like to express my thanks to Yiliu, who is building the aforementioned spatial thinking tool Softspace, for inspiring me to write this essay and for his deep understanding of space, knowledge, and interaction.

References

Maria Karam and M.C.Schraefel, A Taxonomy of Gestures in Human-Computer Interaction. https://eprints.soton.ac.uk/261149/1/GestureTaxonomyJuly21.pdf
Simon Greenwold, Spatial Computing. https://acg.media.mit.edu/people/simong/thesis/SpatialComputing.pdf
James Gibson, The Theory of Affordances. https://monoskop.org/images/2/2c/Gibson_James_J_1977_The_Theory_of_Affordances.pdf
Annie Murphy Paul, The Extended Mind. https://anniemurphypaul.com/books/the-extended-mind/
Ezequiel Morsella and Robert M Krauss, “The role of gestures in spatial working memory and speech.” https://pubmed.ncbi.nlm.nih.gov/15457809/
John May, Signal, Image, Architecture. http://cup.columbia.edu/book/signal-image-architecture/9781941332467