Human languages as disparate as English, Japanese, and Russian follow remarkably similar evolutionary paths, according to a new AI study, which investigated how new concepts were added over time.
Researchers from Fudan, Harvard, and Stony Brook Universities revealed their findings in a recent paper published in the Proceedings of the Royal Society B, based on their work across 21 languages, many of which are separated by time, going back to the medieval period, as well as distance.
The work used word embeddings in a natural language processing algorithm, which analyzed a wide range of languages to determine their hidden evolutionary connections.
Universal Language
“We are looking for insight into fairly universal questions of how people develop concepts that they feel are worth naming,” Dr. Steven Skiena told The Debrief. “Words are examples of named concepts, and thus the right level to address our study. By proving that the same phenomena holds across many languages, we show that this process is indeed universal in some sense as opposed to culture-specific.”
While languages may have unique sounds and grammar rules, the new work examines similarities in how languages evolve on a large scale, seeking universal properties across disparate languages as more words are added.
“New words, concepts, and ideas are generated all the time,” Dr. Skiena explained. “But do hidden patterns exist that govern which concepts are likely to emerge? And are there simple mathematical models that emulate this process?”
AI Language Tools
“We were inspired by the idea that AI technologies for representing language semantics (word embeddings) give us a rigorous way to reason about the evolution of language,” Dr. Skiena said. “With word embeddings, each distinct vocabulary is associated with a particular point in a high-dimensional feature space. Words with similar meanings are represented by nearby points.”
“In essence,” Dr. Skiena continued, “our paper asks how the vocabulary of languages distributed in this feature space, and what kind of mathematical process would create a similar distribution.”
To perform their work, the researchers needed large-scale word embedding datasets for each of the included languages, which had a major impact on the language selections for the research. Since their research focused on the evolution of language, they included many historical datasets, with embeddings representing languages dating back to the Middle Ages, preferring the greatest possible historical depth. The challenge was producing a model that captured how real languages evolve.
“We wanted to prove that certain mathematical models generated embedding spaces that look very much like real natural languages,” Dr. Skiena said. “But what do real natural languages look like?”
“We had to develop a set of four surprising laws/principles that govern the structure of real languages,” Dr. Skiena added, “and then prove that our favored mathematical model generates embedding spaces that also had these unusual properties.”
Language Analysis Results
“I think of cultural influences as the force that shapes the evolution of languages, but it is clear that the brain shapes these cultural influences,” Dr. Skiena said, regarding the similarities the researchers found among different languages. Co-author Dr. Sergiy Verstyuk added, in a conversation with The Debrief, that although there are potential connections between their work and neuroscience studies, that was not the direct aim of their work.
Among the commonalities the researchers discovered was that popular words were often clustered with other popular words in specific regions of the mathematical space. Additionally, the hierarchy of this type of clustering was quite similar between many languages. Word creation usually occurred in bursts, with recent words surrounding other recent words, as new concepts entered the vernacular, similar to the periodicity of rapid change in biological evolution.
“One important aspect of our work is that we constructed a surprisingly simple model that not only replicates the earlier results on the power-law distribution of word frequencies, but that also accounts for new empirical findings across many additional dimensions (specifically, in the 300-dimensional semantic space and in historical time),” explained Verstyuk. “This was done by marrying a well-known cumulative‑advantage process with a far less often used von Mises–Fisher probability distribution.”
“This paper has had an amazingly long gestation: we have been working on this together for more than seven years at this point,” Dr. Skiena concluded. “But it is great to see where we have finally gotten to. I am not sure we are ready to wait another seven years for the next paper. We remain excited about the possibilities of using AI-generated embeddings as a tool for fundamental research in understanding historical processes in cultural evolution, not just for building technological tools.”
The paper, “Statistical Structure and the Evolution of Languages,” appeared in Proceedings of the Royal Society B on April 08, 2026.
Ryan Whalen covers science and technology for The Debrief. He holds an MA in History and a Master of Library and Information Science with a certificate in Data Science. He can be contacted at ryan@thedebrief.org, and follow him on Twitter @mdntwvlf.
