University of Pennsylvania engineers have developed an AI-powered audio editor called SmartDj that lets users customize their immersive sound environment with simple, everyday language voice commands, similar to how AI image generators operate.
The research team behind the ‘Chat GPT for sound’ said that allowing users to customize their audio experience using everyday language prompts, such as “make this sound like a busy office,” rather than specifying each sound found in that environment, could have potential applications in virtual reality (VR) and augmented reality (AR), immersive gaming, and general sound design.
“With SmartDJ, users can describe the outcome they want in natural language, and the system figures out how to make it happen,” explained Mingmin Zhao, Assistant Professor in Computer and Information Science (CIS) and senior author of the study. “We show that AI can help people edit audio in intuitive ways using simple language.”
How AI-Powered Audio Editor ‘SmartDj’ Creates Immersive Sound Environments
According to a statement announcing the first-of-its-kind AI-powered audio editor, SMartDj was designed to address the two primary limitations facing earlier-generation AI audio-editing tools. First, most early systems only worked with rigid, template-like commands.
The team said this limitation required users to identify individual sounds to add or remove, rather than giving a general command. In the “busy office example,” this would require adding or subtracting several background elements, such as the clacking of computer keyboards, people chatting, and phones ringing.
The second aspect limiting early AI-powered audio editors was that they generally operate on single-channel ‘mono’ audio. The Penn team said the inability to operate in more complex, multi-channel environments results in a loss of the “spatial cues” that are typically deemed necessary for an immersive audio experience.
In contrast, the team said that their system “can interpret high-level instructions and is designed for stereo audio, allowing it to make edits that better preserve or reshape the spatial structure of a scene.”
When highlighting the advancements over other systems, the researchers also noted that SmartDj is ‘interpretable,’ meaning it allows the human user to see each step the AI takes to create their customized sound environment. So, when creating the requested “busy office,” the user can see SmartDj fulfill the request by adding the sound of a phone ringing at a relatively low 3 decibels (dB). If this individual sound is not desired or needs to be altered, users can make those changes without affecting the other sections of the AI-generated sound design.
Combining Two AI Models Lets Users Use Everyday Language
When discussing the secret formula behind SmartDj, Zitong Lan, a doctoral student in Electrical and Systems Engineering (ESE) and the first author of the study detailing its development and testing, noted that understanding users’ requests and generating sounds are usually handled by completely different kinds of AI systems. For example, engineers use language models to process text, such as ChatGPT and Siri. However, AI designers typically use “diffusion models” to edit sounds.
According to Lan, the difference comes down to what tasks each AI system has been trained to perform. AI-powered language models, like those used in chatbots, are trained on the patterns humans use to type and speak, enabling them to generate the most relevant responses. Conversely, diffusion models are designed to take diffuse noise signals and shape them into a coherent audio ‘signal.’

For SmartDj, the Penn team created what they are calling an audio language model, or ‘ALM,’ that can bridge the gap between the two AI systems. This involved training the ALM on both sound and text by having it analyze original audio with the associated user prompt. The team said that this training allows the model to break the prompt into a sequence of smaller editing actions, “such as adding, removing, or repositioning a sound.”
Next, the diffusion model performs smaller individual steps in sequence, allowing the governing SmartDj AI to interpret language and audio simultaneously. The Penn team said this combined ability allows their ALM system to operate as the audio producer, deciding how the soundscape should change, while the diffusion model takes on the role of the studio musician, turning those verbal directions into coherent audio.
“The language model gives the system direction, (and) the diffusion model performs those directions,” explained Yiduo Hao, a doctoral student in CIS and the study’s other co-author.
Training & Testing SmartDJ Yields “More Realistic” Results Than Predecessors
Although the team had a functional design, AI requires large datasets for training. However, the team notes, finding examples that bring together high-level verbal/text instructions, the editing action steps needed to carry out the requests, and the audio before and after the changes was unusually challenging.

“This problem needed a very unusual kind of data set,” Lan explained. “It had to capture the goal, the steps, and the result all at once.”
After an extensive search, the team realized the data set they needed did not exist. So, according to their release, they “built it themselves.”
First, they accessed publicly available sound libraries. Then, they created a pipeline that used an LLM to “generate high-level editing prompts and the intermediate steps needed to carry them out,” as AI-powered audio signal processing produced the corresponding edited outputs.
“For this to work, we couldn’t just show the model an input and output,” Hao explained. “We had to show (our system) the chain of reasoning in between.”
After final refinements, the team tested SmartDj against earlier audio editing systems. According to the Penn team’s release, this comparison found that SmartDj produced “more realistic, better-aligned results” than the simple language prompts the user provided.
“In both quantitative evaluations and human studies, SmartDJ outperformed prior methods on measures including audio quality, how well the results matched the user’s instructions, and how realistically it placed sounds in space,” they explained.
“Making it Easier for More People to Bring Their Ideas to Life”
Although SmartDj is still currently confined to a lab setting, the researchers behind its creation said being able to direct a sound design AI the same way we direct LLMs has potential applications in VR, AR, gaming, sound design, virtual conferencing, and other interactive media platforms, “where users may want to reshape an audio environment without manually specifying every individual change.”
When discussing his team’s motivation, Zhao said the ultimate goal is to make audio editing more accessible so anyone “with a creative vision” can design their own customized soundscape without complex editing skills or tools.
“For other media, like text and images, users can already use AI to make high-level editing requests,” the researcher explained. “SmartDJ unlocks similar possibilities for audio, making it easier for more people to bring their ideas to life.”
The study “SmartDJ: Declarative Audio Editing With Audio Language Model” was presented at the 2026 International Conference on Learning Representations (ICLR).
Christopher Plain is a Science Fiction and Fantasy novelist and Head Science Writer at The Debrief. Follow and connect with him on X, learn about his books at plainfiction.com, or email him directly at christopher@thedebrief.org.
