Ohio State researchers use HPC to translate lesser-known languages
Researchers at Ohio State University are using HPC to translate lesser-known languages as part of a project called Low Resource Languages for Emergent Incidents (LORELEI), an initiative through the Defense Advanced Research Projects Agency (DARPA).
The LORELEI programs was set up to develop technology for languages about which translators and linguists know nothing.
Dr William Schuler and his colleagues are using the Ohio Supercomputer Center’s Owens Cluster to develop a grammar acquisition algorithm to discover the rules of lesser-known languages, learning the grammar without supervision so disaster relief teams can react quickly.
‘We need to get resources to direct disaster relief and part of that is translating news text, knowing names of cities, what’s happening in those areas,’ Schuler said. ‘It’s figuring out what has happened rapidly, and that can involve automatically processing incident language.’
Schuler’s team is working to build a Bayesian sequence model based on statistical analysis to discover a given language’s grammar. It is hypothesized this parsing model can be trained to learn a language and make it syntactically useful.
This graph displays an algorithm that explores the space of possible probabilistic grammars and maps out the regions of this space that have the highest probability of generating understandable sentences.
‘The computational requirements for learning grammar from statistics are tremendous, which is why we need a supercomputer, and it seems to be yielding positive results, which is exciting’ commented Schuler.
On a powerful single server, Schuler’s team can analyze 10 to15 categories of grammar, according to Dr Lifeng Jin, a student who oversees the computational aspects of the project. But using the GPUs on OSC’s Owens System allows Jin to increase the number of categories significantly.
‘We can increase the complexity of the model exponentially, so we can use 45 to 50 categories and get results in an even shorter amount of time. It’s a more realistic scenario of imitating what humans are doing. The models are really big, so memory is crucial’ said Jin.
Jin added: ‘The statistical model is also very complicated. In order to train it, we have to do a lot of computation. Say we have 20,000 sentences from a given language; we use that to train the grammar. That’s where OSC comes in. In the first stage, we tried to train the grammar using CPUs, but they’re too slow. So we refactored our code to use GPUs for sampling, and it’s sped up our process greatly.’
Speed is critical to the project because the LORELEI goal is a quick response to disaster relief, meaning high performance computing is critical. In August, DARPA organized a trial run to simulate two real disasters in Africa. Schuler’s group used 60 GPUs on the Owens Cluster for seven days for four grammars of two languages, illustrating the importance of OSC’s resources to the project.
‘For rapid grammar acquisition, when minutes count you need lots of power in a hurry,” Schuler said.
‘We’re answering these fundamental questions about what it means to be human and have language and be the animal that talks to each other. The ability to ask these kinds of questions and get answers is a relatively recent innovation that requires the high performance computing infrastructure OSC gives us. It’s really a game-changer’ concluded Schuler.