"Need of psycholinguistics diagnostics language models but before we get directly into that I'd love to know a little bit about your background and especially what someone in the linguistics department thinks about all the things that one might classify as computer science every bit of a diverse interdisciplinary background I originally was indicated in linguistics and Psychology Human Governor Science I worked in neuro imaging and after getting my PhD I moved fairly heavily into the computer fiction and computational modeling of commission as well I am a linguist but I have a perspective that definitely has spent a lot of time within the middle of what's going on in NLP for the past five or ten years my perspective is is quite interdisciplinary in that regard will five or ten years has been kind of magic time period in NLP much of that did you anticipate when you started this journey. I don't know that I could have antipated it when I started I was really just getting my sea legs and getting a sense of what was I think it's definitely been very exciting to see what is happening with the rise of deep learning with the rise of models like bird that make use of pre-training have very impressive perform prince across a wide range of tasks but what's also exciting for people who like myself are interested in better understanding how these models work is the challenge of taking blackbox very complex models like these end trying to find ways to illuminate what's going on inside of them and also to find effective ways of assessing how good they really are at what we might think of it is language understanding he's been coming up internally with some of its own diagnostics as you mentioned a lot of those benchmark tests but how many of the tools we be overlooking or ones that the linguistics world already has that we ought to be applying it's not to say that the diagnostics that already exist within an LP are not valuable of course and there are many people already in NLP who have been turning to linguistic type linguistic diagnostic subtypes. But yeah absolutely I think what you're getting at is as we find in this paper there are a lot of very valuable characteristics to tests from psycholinguistics because of the fact that cycling West in cognitive scientists engage have spent many many decades already studying blackbox system namely the brain to try to understand what it is representing language at the given time in how it's doing that better understanding those underlying mechanisms so there's a lot of potential there for drying that area that has been designed to carefully target these types of questions and applying them to talk system like in NLP system deep learning system to try to ask similar types of questions about what those systems no speak so most of the headlines about brag about all of these benchmarks breaking out of box can you tell me a little bit about some of the benchmarks that it struggling with there are a variety of different bird has been tested on many many many different benchmarks and I'm certainly not familiar with all of the different tasks that it's been tried on but there are certainly I think some of the most striking results that people have shown tend to be those were they test with adversarial inputs such as bird is not able to respond effectively to those types of inputs or show that despite the fact that bird is performing quite well on the benchmark on its face if you show how the model is really solving that problem it ends up being that it's exploiting cues in the benchmark that allow it to perform well without really surly needing the types of information that benchmark is designed really to test for so you have examples for instance like despite the fact that bird has performed very well on Intel mark if you tested on examples that exploit cues that the researchers hypothesized may be what the the model is is in fact using they show that they can get the model down to about zero percent accuracy by exploiting the types of cues in using them against Bert. I definitely want to unpack a lot more of the challenges you put two birds before we get deep into that I want to give it one more compliment as you point out in the paper it does have a very strong association for nouns with a hyper names for those all of us who don't have our linguistics definitions ready at hand could you share what that is and how demonstrates itself hyper names represent a higher level Gauri that a given noun as part of for instance a robin is a bird and a a poodle is a dog one of the surprising areas that burt was kind of failing was an understanding negation could you give a quick example of that one of the most striking results that we found was that burt failed completely to show a generalized full understanding of the meaning of negation within the context of the sensitivity tests that we applied we had sentences like a robin is a and then we look it how burt would do in predicting the next word and what probabilities it would assigned to different words in that context we also added negation we said Robin is not a so once you add a not there is not a blank the true continuations reverse because the things that robin is not the opposite of thinks that robin is and humans obviously know this and this is what constitutes an understanding of negation knowing how it changes the proper continuations what we found with Bert intesting it's assignment of ability to words is that if we compared its probability assignment to a word that was a true continuation with the negation with a word that was true continuation without the nation it I heard the word that was the true continuation without the negation in both cases so to give a specific example if you give it a robin is a and then you see how much likes the word food versus how much it looks the word tree it always prefers the word bird even if it's creating a sentence that's untrue like a robin is not a bird so I would call all that a rather simple example almost probably one a young child would be capable of filling in this conundrum because bird on one hand is awesome it's certain asks but then surprising to me that fails on this one I mean maybe I could understand it from an point of view but it seems like an easy task to a human where does this leave you you in in your perception of how advanced or a on the path towards Agi something like bird is absolutely. I think that it's an important sanity check in it's a useful type of thing to be able to identify because of the fact that bird does perform so spectacularly well in so many benchmarks sometimes with superhuman performance it leads us to really need to wonder how good is this model really if we can't find limitations to the model has it really solved natural language processing has really reached superhuman level language understanding I think the most researchers are fairly clear on the fact that this is not the case in that this is a limitation in our benchmarks but it's very useful to be able to identify a very basic limitation like this and it's limitation that makes a lot of sense if you think about the way that these models are trained as I mentioned in the paper if you think about language models they're trained to be able to predict words in context that's their job and with negation there's not necessarily a clear prediction that you can get from a sentence like a robin is not a very very many things that robin is not so that makes perfect sense that language model might struggle with picking up on a meaning like that so being able to find a limitation like this one that we can explain fairly straightforwardly given the way these models work kind of helps us does ground ourselves in his the beginning I hope of many more clarifications that will be able to find in terms of the linguistic capacities that these models really have in tandem of course people are working on examining the benchmarks that we have identifying which accused burt is exploiting in and things it really is actually able to use in a in a deep way from having learned that linguistic capacity during pre-training"