Listen: Grover: An Algorithm for Making, and Detecting, Fake News
"You hear the news what news sesame street has cancelled. What no don't don't don't tell me that I'm sorry. That's that's actually fake news. That's where we're going to talk about today and it actually connects with sesame street in and an unusual way. You are listening to the New Year decorations. Actually maybe it connects sesame street in the most boring way a possible. which is the model that we're GONNA be talking about his name grover and I mean funny cide thing? Thanks so in other contexts. We've also talked about elmo which is a machine learning or the person. It's like a word and bedding thing. There's something called Bert. I think this is just having a moment right now. Such doctors yeah okay just to be very clear. Sesame Street is is not cancelled goodness. That'd be terrorist mistreat. We'll go on for centuries. I sure hope so about machine learning today. Ah Yeah so this is new algorithm. It's out of the Allen Institute for artificial intelligence which is a group based up in Seattle. That does a lot gotta work with different kinds of so this is a there. NLP group their natural language processing group and in particular as we kind of made a bad joke about early on grover is an algorithm this particular to fake news which as we all know is a problem this only becoming a bigger problem every day so the question is how do we think about it. How do we think about an approach that or the involvement but that machine learning has in making the problem worse than maybe hopefully not making it worse okay so I just before we leave. The topic of Sesame Street was was grover particularly trick story. I thought he was just cute and lovable. I think he's mostly cute and lovable. He's he's a little bit anxious as a muppet. That's his main like big should've called this Oscar or something well number one. Maybe there's some other algorithm called Oscar yeah anyway okay anyway moving on so the general general idea is that we're interested in kind of algorithm that can do two things generate fake news and discriminate between lean fake news and real news and I'm. I'm there's a lot of actually substructure those two ideas there so we should dig into it a little bit more but it sounds like you were about to say something yeah. I JUST WANNA clarify. The purpose of creating fake news is not to mislead as most fake news is created to do you. It's to I guess to interact with the model so the model has something to to compare against real news sources fake news. Is that the motivation motivation well. Yes so to think to unpack a little bit what we mean by fake news real news on this kind of stuff so number one. There's kind of two different types of news that are relevant in this treatment here so one is fake news in the sense that it is generated by I a computer and it might not have any corresponding anything in real life but it sounds like kind of a regular new story if you want to think about it that way and and second is there's the notion of computer generated what they call propaganda so that's like the fake news that's related damaging and offensive right right so there's just false stuff and then there's stuff that's may be more acerbic right so when they were thinking about training this algorithm they had two different types types of data that they traded on one was ordinary what I would call mainstream media written news articles coming from like like NBC CNN in The New York Times and these sorts of outlets and then there's also news that they or a version of the algorithm some some training of the algorithm they did on fake news came from like Breitbart an info wars and some of these places that are a little bit more famous for having let's say adhering to less rigorous journalistic listrik standards and the idea is that these are two different types of generation and discrimination that the Algorithm Might WanNa do and so so those are those are the two the two different types of fake news but then there's also of course the corresponding version of each of those Suzan Ras of journalism but actually written by humans right okay so I'm imagining one of those like two by two grits where you've got the two kinds of news in that column and maybe you have who generated it in the the rose yeah like is it mainstream news or propaganda style news is one access and then was that generated by a computer generated by Brin by human so those are sort of the parameters and so then what they do is go acquire a pretty big data set of sounds like they had to go scrape at themselves from a number of different news sources put this data set together that was going to be there training data set and and plug it into as far as I can tell the the actual structure of the algorithm itself is really similar to GP t two which we talked about for a little while ago that was that language algorithm that was yeah that was very had very interesting thing kind of generative very very interesting generative passages that it would create. I was the one that we called like the I'm using air quotes here. The model that was too dangerous crystalise because of how high quality of the fake stuff was that it could make imagine the Harry Potter Fan Fiction that produce right right and how quickly right so it's the general idea of we have sort of a two part algorithm one of which is a generator and that makes passages and then there's a discriminator that tries to tell the computer generated stuff from the real stuff oh right and these things can kind of play off of each other and both learn in grow together yeah exactly so it's the same general structure the same idea they had a few different versions of the model that had different were different sizes like how many free parameters that they have of each one of the models which has sort of how much detail are they trading it in a sense and and so then they go in and create this grover model and talk a little bit about at some of its attribute which I think are kind of interesting so what are some of those attributes so the first is that exactly what kind of news exactly how they generate the news has a an impact on how good the generation Tasca's so they actually broke the problem apart into a few different pieces and then had different versions of the model that they trained but when I break it apart into different pieces. There's several different parts of a new the story that you can imagine so. There's the main body of the news story but there's also accompanying Meta data one. Is the domain like what is this article generally about dot com toxic. Yeah yeah sorry I didn't. I thought you were going to go slower so oh yeah that's right. so what what domain is this this. Is this a new story coming from. What is the date on which it's published who are the authors of this piece? What is the headline of the peace and then the actual textual body of the piece itself and so in some cases they would pre pre generate or seed the starting conditions of the algorithm with say the domain date and authors and then have it generate a tight headline than take those four attributes domain date authors headline news that to generate the body sometimes they would use different combinations of what they were starting with and what it had to fill in but in general when they started with more contextual Tormenta data type stuff at the beginning then they would come up with textual bodies that sounded more sounded more believable to like humans men's who are evaluating these things second. I not this was I thought this was pretty interesting. So grover does not make particularly plausible sounding real news quote unquote so if Lila's sound like the sesame street character so if you're if you're reading a real news articles versus one that was generated by grover in aggregate humans are relatively successful at telling the difference between the two two but when it comes to propaganda like the you know more sensationalistic fake news grover actually generates better propaganda than humans. Do I think is really funny and interesting way yeah so there's this there was a step that was a manual the valuation step so they had three annotate errors presumably people and then on the metrics of style"