AI-driven robot
Image by Michal Jarmoluk from Pixabay

“Turn Ideas into Physical Objects”: If You Describe an Object, This AI-Driven Robot Can Build It

MIT scientists working with Google DeepMind researchers have developed an AI-driven robot that can ‘turn ideas into physical objects’ based on simple text descriptions.

The researchers say their robot can also anticipate the intended use of the final object during construction, improving accuracy and reducing manufacturing errors.

While the current version of the system is restricted to simpler objects like tables and chairs, the team hopes to expand the robot’s functionality so it can create more complex objects, including prototypes with joints, hinges, and other moving parts.

Once commercially available, the research team believes an AI-driven robot assembly system based on their approach could create complex objects directly in a consumer’s home, reducing or even eliminating manufacturing, assembly, storage, and shipping costs.

“We have shown that we can use generative AI and robotics to turn ideas into physical objects in a fast, accessible, and sustainable manner,” explained MIT professor Randall Davis, senior author of a paper detailing the team’s work.

According to a statement from the researchers, the goal was to combine the ease of use offered by generative AI models that can convert simple text prompts into complex designs with the robotic capability to rapidly build and assemble an accurate prototype of the design based on the user’s initial prompt.

To start, the MIT-led team used a vision-language model (VLM) that was pretrained to understand images and text. For the initial prototype, this VLM was tasked with using just two types of prefabricated parts, structural components and panel components, to build a desired object.

Alex Kyaw, lead author of a paper detailing the team’s work and a graduate student in the MIT departments of Electrical Engineering and Computer Science (EECS) and Architecture, explained that there are many ways one can place panels on a physical object. However, to build an object from a generative AI text prompt, the robot needs to “see the geometry and reason over that geometry” before it can make a final decision about each component’s placement.

“By serving as both the eyes and brain of the robot, the VLM enables the robot to do this,” Kyaw explained.

For example, if a user types a simple construction task such as “make me a chair,” the system first generates an AI image of a chair. Next, the specially programmed VLM “reasons” about the chair to determine where panel components would (or wouldn’t) be placed atop structural components, depending on the desired object’s final functionality.

turn ideas into physical objects ai-driven robot
Given the prompt “Make me a chair” and feedback “I want panels on the seat,” the robot assembles a chair and places panel components according to the user prompt. Credit: Davis, Kyaw, et. al.

According to the team, the demonstration version of their AI-driven robot prototyping system outputs its decisions as texts, “such as ‘seat’ or ‘backrest,’” and then numbers the sections. This numbered design is fed back into the VLM, where components are assigned names corresponding to the parts of the chair that require panels, such as ‘back’ or ‘seat.’

As a result, instead of simply building an object shaped like a chair, their AI-driven robotic construction worker’s advanced design determines, on its own, that the backrest and seat need panels to enable sitting, but other surfaces do not. When the researchers asked their system to explain its panel location choices, the team said they confirmed that this complex process was, in fact, correct.

“We learned that the vision language model is able to understand some degree of the functional aspects of a chair, like leaning and sitting, to understand why it is placing panels on the seat and backrest,” Kyaw explained. “It isn’t just randomly spitting out these assignments.”

Although the current demonstration version of the AI-driven robot can complete simple construction tasks after an initial text prompt, it is still designed to operate with a human user in the loop. The team said this allows the user to “refine the process” by offering new prompts along the way.

turn ideas into physical objects
These six photos show the AI-driven robot’s text-to-robotic assembly of multi-component objects from different user prompts. Image Credit: Davis, Kyaw, et. al.

“The design space is very big, so we narrow it down through user feedback,” Kyaw explained. “We believe this is the best way to do it because people have different preferences, and building an idealized model for everyone would be impossible.”

Richa Gupta, an MIT architecture graduate student and study co-author, agreed, noting that this ‘human‑in‑the‑loop’ process lets users direct the AI‑generated designs and “have a sense of ownership” in the final product.

Next, the team said they hope to enhance the AI-driven robot system so it can handle more complex and nuanced text prompts, such as tasking it with building a table from different materials, like metal and glass. The team said they are also looking to expand the range of prefabricated components the robot can incorporate into a prototype, “such as gears, hinges, or other moving parts,” to turn ideas into physical objects with more functionality.

When discussing possible applications of their AI-driven robot system, the researchers said their approach could be “especially useful” for rapid prototyping components used across vastly different fields, from architecture to aerospace. They also suggest their designs could enable a home-based system that can fabricate furniture or other objects locally, “without the need to have bulky products shipped from a central facility.”

While still in the prototype phase, Kyaw said the ultimate goal is to design an adaptable, AI-driven robotics platform that can be applied across different categories of manufacturing systems, enabling humans to communicate their designs more naturally.

“Sooner or later, we want to be able to communicate and talk to a robot and AI system the same way we talk to each other to make things together,” the MIT scientist explained. “Our system is a first step toward enabling that future.”

The paper “Text to Robotic Assembly of Multi-Component Objects using 3D Generative AI and Vision Language Models” was recently presented at the Conference on Neural Information Processing Systems.

Christopher Plain is a Science Fiction and Fantasy novelist and Head Science Writer at The Debrief. Follow and connect with him on X, learn about his books at plainfiction.com, or email him directly at christopher@thedebrief.org.