How well can Siri and/or Alexa understand us? The Alumniportal Deutschland sought to answer this question with its campaign "#apdsiri". In June 2019, we enlisted the help of some current DAAD scholarship holders at the Scholarship Holders Meeting at RWTH University Aachen. We tested the language competency of Siri by having scholarship holders recite tongue twisters, in German and in English, to the device. We then looked to see if Siri typed back what the scholarship holders had said. The results were interesting to say the least. Afterwards, we asked former DAAD scholarship holder, Germany-Alumna and doctoral candidate at the University of Münster, Anna Konstantinova, to share with us her linguistic expertise in order to explain our varied (and funny) results.
Have you tried asking Siri or Alexa whether you need an umbrella tonight or if the system knows what time your dentist appointment tomorrow is? If you have, you most probably noticed that sometimes you succeed, but other times the system seems to have misunderstood you: it replies with unexpected remarks or provides you with some irrelevant web links. For example, a non-native speaker of English decides to search for some pictures of hybrid or mixed breed dogs. Siri then decides to search for some pirate dog pictures instead. You might think to yourself: “Hm, I can imagine the speaker had a very strong foreign accent in English, and no wonder Siri struggled to understand this search request."
Some foreign accents might seem harder to understand depending on the amount of exposure to this accent and the listeners’ native language. In the beginning, you might need extra effort to follow the speech of your new interlocutor or, for example, a conference presenter who speaks with an unfamiliar accent. After a short while, however, you stop paying attention to it. Moreover, when you come across a similar accent next time, you are likely to perceive it effortlessly. This is the most powerful skill of a human-like speech perception system called perceptual learning. Human listeners do not decode the meaning of the words by joining the sounds together. In a majority of cases we can already guess the rest of the words of a phrase after we have heard the first sounds. This ability makes our speech recognition skills super-flexible and efficient.
Presently, there is no existing speech recognition technology that can correctly process the speech of every speaker of a certain language on an unexpected or unknown topic. Rain, traffic, people chattering, coffee making in the background tend to make the speech recognition rate even lower. A native speaker of a certain language, in turn, has almost a 100% chance of correctly recognizing every utterance in his or her native language. If insight into how humans perceive speech would allow for modelling the similar perception system in automatic speech recognition systems, would it not be obvious that linguistic research should play a central role in designing conversational agents?
Interestingly, the technology of teaching machines how to recognize and understand human speech was developed by engineers who had little or no counselling from the linguistic community. Not that it had never occurred to engineers that linguistic knowledge might be of help. The problem was that the presence of a linguist in an automatic speech recognition group did not lead to much success, and every time the linguist would leave the group the system showed better recognition rates. However, this does not come as a surprise. The less you know about the subject, the easier it might be to come up with a model of it, but the more you deal with languages, natural speech and various extralinguistic factors that influence the way we speak, the more you wonder about the complexities of the communication process. Really, we should all be grateful for every successful conversation that happens in our lives.
Now that systems like Siri and Alexa have become mature enough to leave their laboratory nests and enter our everyday life, we can all play with these systems. A good example of this would be the campaign “#apdsiri” by the Alumniportal Deutschland at the DAAD June Scholarship Holders’ Meeting on the topic of Artificial Intelligence research, where participants, both native and non-native speakers of German, tested Siri’s ability to correctly recognize tongue twisters. Of course, this is not to underestimate the thorny path of engineering and computer research in desire to make such systems highly accessible and, what is more important, highly efficient in performing a particular set of tasks. But apparently, the researches were not expecting Siri to be faced with the phrase “Tschechisches Streichholzschächtelchen” which was once recognized as “Check me tschüss 2018 Hessen” or another time as “Tschechisches drei Kreuzberg”.
For a native speaker of German, in contrast, recognizing tongue twisters is not a problem. But why is this the case? There are several aspects which allow humans to outperform machines. The first one is that an automatic speech recognition system treats natural speech as a sequence of sounds with the same qualities. In real life, various factors are influencing the way each of those smallest language segments sound, e.g. if it is the end or the beginning of the word, if this sound is stressed or unstressed. We also speak differently depending on the situation. Just notice how clearly you speak in front of an audience versus how unintelligible your speech may be early in the morning when uttering something when in a hurry. We do not even pay attention to those nuances, because we connect the first sounds we hear with the potential meaning of the word, thus processing speech on the lexical level. One curious issue about interaction with systems like Siri is that users tend to think that the best way to talk to the system is to hyper-articulate, i.e. speak slowly and make huge pauses. Now just imagine how strongly such utterances deviate from natural speech, which serves as data for training intelligent systems.
Campaign #apdsiri: The Results
The problem of treating speech as a sequence of invariant sounds has already been recognized by the speech recognition system designers, and findings from the field of variational linguistics came very much in handy. In Germany, for instance, a shop keeper might ask you where you are from, not meaning that they hear your foreign accent, but meaning that they recognize that you are not from the Cologne area. Moreover, our social status, age and other “community-in” markers influence the way we speak. Logically, the more we know about the factors causing variation, the degree of variation at all linguistic levels and what happens with speech sounds when one or more factors are at play, the more precise information might be provided to automatic speech recognizers to learn from.
In order for Siri to be able to predict how the speech of a non-native speaker would deviate from what the system expects, the findings from the linguistic field of cross-linguistic influence can help the system get ready for the linguistic features which are likely to deviate from the native speakers’ pronunciation. Interestingly, it is not as easy as it sounds. Firstly, not only the speaker’s native language or languages influence the way a person speaks in their foreign language. The languages which you studied before, e.g. your ten years of English at school are likely to influence the way you speak German, even though your native language is Italian. Secondly, cross-linguistic influence does not simply mean that the earlier acquired and learnt languages influence your new language, but also that your new foreign language makes already existing language systems unstable, thus having a potential influence on the way you speak your native language. This background information also helps human beings better process speech.
Let us go back to the former example of Siri incorrectly recognizing “hybrid dogs” as “pirate dogs” when said by a non-native English speaker. This is an example from my previous research where I analysed the speech adaptation strategies in non-native speakers iterating with a British version of Siri. After I collected data from Russian speakers of English, I asked a native speaker of English to rate the accentedness and intelligibility of the participants’ speech. The result was that the person whose utterance was often misunderstood by Siri had the lowest scores possible. This means that there were hardly any features deviating from the native speaker’s speech. Moreover, when native speakers of English tried asking the same questions, they would sometimes also be misunderstood by the system.
On the way to correctly recognizing a whole sentence, lots of things can go wrong. Siri uses algorithms which help the system estimate what the most probable combination of the sounds and words which you uttered was. “Tschechisches Streichholzschächtelchen” was estimated as a less probable request. Instead, “Check me tschüss 2018 Hessen” won the probability battle. In German, one also has to deal with compound nouns which consist of many words, so the system might have particular problems with processing long nouns. How do native speakers process compound nouns then? Do they treat them as one item or a combination of many? Psycholinguistic studies on compound noun processing might provide insight and be helpful for this automatic speech recognition problem as well.
Tonguetwister Duel: Germany vs. USA
Even the Alumniportal Team gave some tonguetwisters a try, but in the form of a duel. Who won? Find out by watching! #apdsiri
Automatic speech recognition technology can and should go hand in hand with linguistic research. Nowadays, Artificial Intelligence technologies are ubiquitous and linguistic knowledge can be helpful in multiple areas; for example, Intelligent Language Tutoring Systems. Natural speech synthesis may help people, who have already lost or are currently losing their voice, receive a synthesised version of the voice they had before. People who do not yet speak the language of the country they currently find themselves in may be able to have every web page and document translated for them. Moreover, there are some Artificial Intelligence based systems which are able to learn languages themselves. Modelling language learning systems provides important insight for linguists from the field of language acquisition, where scientists study mechanisms behind children’s ability to learn languages unsupervised. The conclusion we come to here is that there are so many possibilities for linguists to apply their knowledge and skills and interdisciplinary research projects are very likely to bring our understanding of what linguists can actually do to a whole other level.