Researchers at Columbia University’s School of Engineering and Applied Science have designed a robot capable of displaying realistic lip motions for speech and singing.
Past research has shown that most people focus on lip movements during face-to-face conversations. However, creating robots that can replicate these lip movements continuously presents a challenge, and even the most advanced robots on the market today produce, at best, only muppet-like gestures when communicating.
Now, the Columbia University team, led by Hod Lipson, James and Sally Scapa, a Professor of Innovation in the Department of Mechanical Engineering, is producing robots that aim to overcome these limitations. However, at this stage, the team’s creations still appear lifeless, or even unsettling, because their facial expressions don’t match human expectations, thereby invoking a phenomenon known as the “Uncanny Valley.”
The team’s work, detailed in a recent study published in Science Robotics, reveals how their robot used its abilities to articulate words in a variety of languages and even sing a song from its AI-generated debut album, “Hello World.”
Into the“Uncanny Valley”
So what, exactly, is the “Uncanny Valley”? As Lipson explained to The Debrief in an email, “It’s that creepy feeling you get when you watch a robot trying to look human, but missing something essential.”
“I think that half of the problem is lip motion, because half the time humans engage in face-to-face conversation, they gaze at the speaker’s lips,” Lipson said. “To date, robots do not have lips (most don’t even have a face). Our robot _EMO_ is far from perfect, but I think it’s on the path to crossing the uncanny valley.”
Unlike traditional approaches, which rely on strict programming and predefined rules, the Columbia team’s robot learns by observing humans in action. Initially, the robot was designed to practice in front of a mirror, experimenting with its 26 facial muscles to help it “learn” how its own face moves. Once familiar with its own expressions, it watched hours of videos of humans talking and singing, learning about the exact timing and coordination of lip movements.
“We don’t program the motors directly. Instead, the robot’s AI learns over time how to move the motors by watching humans and then watching itself in the mirror, and comparing,” Lipson said. Following such training, the robot demonstrated the ability to translate audio directly into synchronized lip-motor action.
“Robots get better the more they interact with humans,” Lipson explained in a statement. “This learning-based approach allows the robot to continually refine its expressions, much like a child learns by observing and imitating adults.”
“The robot’s facial motors are scattered under the robot’s face, and they are designed to enable the robot to make a large variety of facial gestures, including lip motion, smiling, and other motions,” Lipson added.
Achieving this type of humanlike lip movement requires flexible facial “skin” and many small motors capable of rapid, silent movement. Second, the intricate patterns of lip motion are determined by vocal sounds and phonemes (a type of choreography humans use to perform these movements effortlessly through dozens of facial muscles).
By combining a highly actuated face with a vision-to-action learning model, the Columbia robot overcomes these hurdles. It first explored random facial expressions, then expanded and refined its ability by watching humans, building a model that connects audio cues to precise motor movements. At its current state, the technology still requires a few improvements, as indicated by challenges the robot experiences with making “B” and “W” sounds. Nonetheless, the system has made leaps and bounds beyond the speaking capabilities of other robots currently on the market.
“This is the missing link in robotics,” said Lipson. “Much of humanoid development focuses on walking or grasping, but facial [expression] is essential for human connection.”
“The more the robot observes human interaction, the better it captures nuanced facial gestures, deepening emotional connection,” noted Yuhang Hu, a researcher at Creative Machines Lab, Columbia University.
Researchers currently see applications for such lifelike robots across a range of fields, including entertainment, education, medicine, and elder care. However, Lipson expressed cautious optimism, noting that while the technology demonstrates promise, there are also concerns that must be navigated as it develops.
“This technology is powerful,” Lipson said. “We must advance carefully to maximize benefits while minimizing risks.”
“But the potential to unlock human-robot connection is truly exciting,” Lipson added.
Chrissy Newton is a PR professional and the founder of VOCAB Communications. She currently appears on The Discovery Channel and Max and hosts the Rebelliously Curious podcast, which can be found on YouTube and on all audio podcast streaming platforms. Follow her on X: @ChrissyNewton, Instagram: @BeingChrissyNewton, and chrissynewton.com. To contact Chrissy with a story, please email chrissy @ thedebrief.org.
