Audioburst Search

Coming up next

Learning Visiolinguistic Representations with ViLBERT w/ Stefan Lee

This Week in Machine Learning & AI
|
2 months ago

Rethinking Model Size: Train Large, Then Compress with Joseph Gonzalez - #378

This Week in Machine Learning & AI
|
20 hrs ago

The Physics of Data with Alpha Lee - #377

This Week in Machine Learning & AI
|
4 d ago

Burst Details

Learning Visiolinguistic Representations with ViLBERT w/ Stefan Lee

Automatic TRANSCRIPT

Art of one. I am here at nervous. Twenty nineteen and I am with Stephan. Lee Stephan is an assistant professor at Oregon State In the School of Electrical Engineering and Computer Science Stephan. Welcome to the toy cast. Yeah thanks for inviting me absolutely. Let's talk a little bit about your background before. We dive into some of the many things that you're presenting here at the conference. Sure sure so I DID MY PHD at Indiana University. And most of my work there was sort of on the core computer vision side. So how do I use computer vision to help? Scientists do various tasks zoo. Lots of it was replacing. What would otherwise be human labeling tasks? Okay I I get bored quickly. So in my post DOC extended out to vision in language. So I'm thinking about Problems where an agent tasked to reason about visual input and linguistic input so it s not only understand the visual content to express that understanding by creating language or responding to language things like that and his visual question answering one of the tasks that is interesting. You Yep so. Visual question answering. I've got a number of pieces of work on topic. Okay there's also things like captioning or doing digital dialogues are multi-round Qa. Stylings dialogues. Okay and then. Lately I've been extending out to task that not only have vision language but also some form of action so he's our Language tasks situated and embodied contexts. Where an agent test to see and talk and move to accomplish some sort of task and are the agents that were referring to here they simulated agents mostly. They're saying Leonard agents but some recent work has sort of extended out of the simulator into physical platforms with some surprising success. Okay what's an example of a platform that you're using for simulation over the robotic further robotics on the body Saturday What's the kind of dimensionality? How Complex are they? Yeah so most of the workers even working on so far has been on this pie robot platform which is something that Facebook has recently released. Which is wonderful. Low cost robotic platform local. But I think they call it. That has a really nice interface for machine learning practitioners. Oh really so I have not come across. It's worth looking out so it's You know you can say from tyrod. Import robot and then robot go forward one meter and that sort of the level of the interface. So I'm real machine learning persons robot for things like navigation in grasping so here at the conference. You've got a number of presentations that you're and posters you're involved in Will talk about One of them in most detail. Vilbert about what are some of the others that you've been focusing on sure? So this year I'm presenting number things. He said One of them is one of these embodied tasks so it's a work on what's called vision in language navigation where an agent is sort of spun never before seen environment and given a natural language navigation instruction something like go down the hall turn left at the Wolf head and the Stop on the third bedroom the one with the yellow bedspread or something like this so instructions or this mix of trajectory clues like go forward and turn left and Works and visual grounding landmarks. Knuckling and then the goal is to have an agent that can reason about all of this while actually following that path in simulated world. What are the prior that the agents bringing into this environment? Does it already know what a wolf head is or is it need to figure that out from training so it's a tricky question A Lot of vision in language and we'll touch on this in Vilbert Sort of starts from a set of pre-trade image features instead of pre-trading language features what it doesn't have as a sense of grounding so it may be able to represent a wolf head visually and it may understand the word. Wolf head but the connection between the visual incarnation of a wolf head and then the term isn't there that you expect to learn during the specific task though in Vilbert. The point is that we're trying to P- retrain grounding itself and so you've got the agents paper Yup and then there's Vilbert and then I'm also giving a talk at a workshop on emerging communication. So if you have a agents that interact to perform some task and you want them to share information between each other or communicate How could you make that communication protocol? More interpretive Humans and one way to do that is to make them actually use sort of discreet symbols. So somebody that looks like a word More or less. That doesn't necessarily have that meaning and then the talk is about. How do we make these these communication protocols? Morte now this whole field was kind of. I don't know if popularizes the right word or Villain is a few years ago with the facebook Those two agents that said to develop their own coded language. Yeah Similar kind of vein of of research. Here Yeah except for the fact that we we want understand what the code is right. Yeah Yeah that was the dealer Nodia. Paper had some similar similar goals. Yeah Yeah And so it's an innocence. Kind of merging the field or the desire for explain ability into this emergent communication work. Yeah Yeah and it If you're thinking about building agent that eventually will work with the human communications up being a big part of that. That's how we organize ourselves so making sure that we understand what agent is saying What we're saying is sort of one of the goals my research but not constrained necessarily by English not necessarily an agent to come up with its own stuff if we can map it back some space that has human like structure. That would be fine right so rather than having a unique unique word for the concept of Red Ball would prefer the agent to have a word for red that modifies. A word for ball. So even if it's not English it's something we can map back to a way that we understand How language works so Vilbert What's over all about? Yeah so I hit on this a little bit ago but billboards about learning the associations between the visual incarnation of concept and the linguistic Concept and this is sort of tricky because most people instantly hallucinate the visual part of something whenever they hear the word. When it's when I said Wolfe had. I assume a lot of people immediately. Thought of like game of thrones type of things And there was some visual concept that got brought to mind but the machines don't buy automatically have you can you can learn language and just in a linguistic context and in fact that's what most of natural language processing does is just learning based on its association with other words likewise on the visual side. You're just sort of learning to represent some sparse set of classes and those classes often relate to specific nouns. But they don't have a sense of closeness right so there's new idea that the feature for a cat should be close to the future for Tiger because they're related linguistically or uncertain taxonomy so the point of Vilbert is to try to learn these associations between vision and language directly and this is something we usually call visual grounding out of worker. And so what's the the general approach there presumably in involves Burt transformer former models the name. Yeah so if you look at the Bert models and some of the big successes in NLP. It's these large self supervised tasks so they take a large language Corpus and they learn certain little things to build supervision from unlabeled data so they'll their mascot a few words and have them re- predict it based on the other linguistic context other Alaska sentence follows another sentence in text. And what we find analogues for that in the language space. So there's this data set called conceptual captions. That came out recently. Which is this massive out. On the order about three million Image text pairs where they just found images online. That had all text so some human had provided alternative text for people usually for people who visual impairment You might interact with us by melting over image and it produces some some tex tag. They did some processing on top of that. But is this sort of Webley supervised state where they scraped office down and now the images filtered for kind of symbol simple kind of image types of or are they just raw whatever's out there across the board from everything from flicker style images to pictures of maps. And things like that so it's pretty pretty broad but that's sort of where we start our data source and we perform similar self supervised trainings where we're masking out. Certain amid parts of the image just random and then asking it to reconstruct those given the rest of the image and language and likewise were asking does the sentence match with this image or not or masking out parts of the language and having reconstruct from the image on the text. So we're designing this sort of self supervised multi modal task with this large weekly supervised status. And what we get at the end is a model built some representations that bridge between vision language and then fine tune that for a wide variety of other