UC Berkeley researchers develop model to reconstruct lost languages

Proceedings of the National Academy of Sciences/Courtesy

In prehistoric times, the people of Southeast Asia and Oceania spoke a now-extinct language called Austronesian. Fast-forward through seven millennia, and UC Berkeley researchers have designed an advanced computer system to automatically reconstruct it.

This linguistic time machine of sorts offers an alternative to the laborious efforts of human linguists, who have to manually reconstruct languages by analyzing the relationships between languages and patterns in sound change.

The program takes in lists of modern words and information on their pronunciation and subsequently outputs a reconstruction of the “mother” language from which these “children” languages descended. Inputting Romance languages like French or Italian, for example, would generate something analogous to Latin.

“When the Roman Empire fell, populations that spoke versions of Latin were isolated and began to evolve independently, and in each region sound changed differently,” said Dan Klein, co-author of the model and UC Berkeley associate professor of computer science. “Each language is a piece of the puzzle.”

The Spanish word for “fire,” for example, is “fuego,” and in Italian is “fuoco,” so one can infer the Latin word for fire probably started with an F, Klein explained.

The computer system weaves together linkages such as these to mathematically determine a word’s original form. According to the program’s conclusions, published in the science journal Proceedings of the National Academy of Sciences on Feb. 11, more than 85 percent of the system’s reconstructions are identical to manual reconstructions performed by specialized linguists.

The computer model can not only infer the original root of the descendant languages, but it can also focus on each individual child language and identify recurring patterns within them.

“Each (language) preserves something different,” Klein said. “Italian keeps the consonant in the middle, Portuguese preserves the vowels better, and by looking at what’s common and by looking at what changes happen typically in language, you can rewind that process.”

Klein added that by studying these “signature sounds,” researchers can estimate the probability of certain changes repeating themselves and essentially predict how present languages may evolve in different regions in the future.

Additionally, the model aims to answer cultural and anthropological questions. By studying the history of words, historians can infer how cultures merged and split, and they can better understand the development of human civilizations.

According to Tom Griffiths, co-author of the research and campus psychology professor, historical linguistics is highly relevant to the study of cultural evolution — how humans learn from each other. He added that this model has the potential to answer fundamental questions on the history of human language and cognition.

“I’m excited about discovering what makes people capable of solving challenging problems like learning languages and making computers better at solving these problems,” said Griffiths, who worked on the study with UC Berkeley graduate student David Hall and assistant professor of statistics Alexandre Bouchard-Cote at the University of British Columbia.

The team began collaborations in 2006, when Bouchard adopted the project while studying under Klein as a graduate student at UC Berkeley.

The model relies on an algorithm known as the Markov chain Monte Carlo, which fills in unknown variables by freezing results and iteratively improving them on the basis of other models.

In its results, the team concluded an ongoing debate among linguists known as the functional load theory, which proposes that certain sounds are more likely to vanish than others.

“If you have two sounds that are rarely used to distinguish words, then if they just merged together, no harm would be done, like the ‘th’ sounds in ‘the’ and ‘thin,’” Klein said. “The functional load hypothesis says sounds not carrying their weight, in that they’re not very distinctive, should be more likely to collapse.”

Klein said that through the system’s ability to examine trends, the team was able to determine how often sounds merged and ultimately prove the hypothesis true.

According to campus professor of historical linguistics Andrew Garrett, many linguists were initially skeptical of the theory, but the team presented “a good statistical argument,” and he and fellow linguists are now “impressed” by this new information.

“It’s very interesting work — they wouldn’t pretend that it’s more than the first step in a larger project, but it’s something that a lot of linguists will be really interested to look at,” Garrett said. “If you can get a little bit of an edge up with computing, that is really great.”

According to Klein, while the system has made extensive strides in the field of linguistics, human linguists are not yet obsolete. Certain documents require more extensive insight and demand the expertise of human linguists. In poetry, for instance, researchers can tell sounds are missing because of the number of syllables needed to rhyme — an ability that is for now beyond the capabilities of a computer.

“It’s certainly not the goal that this will replace linguists doing manual reconstruction,” Klein said, “but what our system can do is crunch a whole lot of data, and it can do it fast. It’s a complementary tool to answer new kinds of questions.”

Virgie Hoban covers research and ideas. Contact her at [email protected][email protected] and follow her on Twitter @VirgieHoban.