Text Analytics and NLP in Financial Services - with Ram Sukumar of IndiumSoft
This is daniel fidel. And you're listening to the a in financial services podcast. There's a term that i don't really liked using because it's awful played out but the term is dark data. Maybe five or six years ago in the podcast. We had people actually saying the term dark data. Turn into a bit of a buzzword for a while but there is some truth to it and in the financial services universe. There's a tremendous amount of dark data from microfiche to physical paper to absolutely unintelligible documents and images stored in various and sundry places within a a banker financial services institution. There's a lot of info the cannot get to. We have to manually pick apart and look into so being able to do text analytics and apply natural language processing fluently to get value from that. Dark data is a big deal for business helps us be up. Operations helps us automate certain workflows and can even able entirely new capabilities. We speak this week with rob sukumar. Rahm is the co founder and ceo of indian. Software indian is headquartered. In cupertino with the preponderance of their workforce in india. I interviewed romm from india for this episode and he speaks with us about particular workflows for text analytics. What does it look like in operation. People get it wrong. where can it fit into play. And where the potential value for textile when it to ideas for use cases and when it comes to the practical realities of what you need to expect when you're deploying these technologies. Ron provides some useful guidance that i hope will be helpful for all of you. Podcast listeners. if you're interested in more use cases and if you're interested in more best practices around ai adoption in our than checkout emerge plus. It's e. m. e. r. j. dot com slash p. Want this is a resource explicitly for enterprise innovation strategy leaders and consultants and advisers. If you need to bring ai to life if you need frameworks and best practices for finding roi if you want your fingertips on our entire library thousands of use cases than checkout emerged plus it may be a useful resource for you if you wanna take your insights one step further again. That's e. m. e. r. j. dot com slash p. One that's as and plus then the number one without further ado fly into this episode. This ram with indian soft here on the in financial services. Podcast so from where i wanted to start us off is on extraction. I know text is your world when it comes to and obviously a big companies whether it's healthcare financial services. Whatever are struggling with texts that they then have to put into their system. Somehow if we just talk about financial services as an example. I know you work there with that process. Look like now you know we get invoices we receipts. We get paid performs. What is the manual process to put that in today dan. Different companies are adapting to this differently at the stage there different levels of majority some just use a large bdo coasting centers to be able to process these documents and use a manual process. Some use the a semi automated Process be maybe a combination of some most often. Don't automating dimension. I think from what missing is an of course. There are no of some emerging that the are trying to democratize this whole extraction process. I think the challenges. The complex city of documents varies from company to company b. obscene With pumping night from new york to singapore seeing a but idea of documents with different complexities. Different types of labels nestor tables of dubai off beneficial data types of data are different. So i think where we use. Our technology is solution. You're right yes. We we think of on gordon text in on capabilities deployed a goal competence in solution around solutions. And we're seeing increasing use as we obviously use some of for example vision uses in august act. Which isn't we didn't apply about more like cnn. Seattle and you know. And then we limit these using tensor of on a lot of by libraries us also use ns dm a algorithm are deep learning to them so he is a combination of these to actually find the boundaries of the devils and look at the federal data within the documents and then extract them and convert this into of meaningful information for the enterprise. This big different resume bank statements busying invoices probably among the some blow ones that got invest in mestre analysts reports. Who have been. I mean people. It's difficult to go through a full debate. The has one extract and somewhat so seeing an opportunity where we used both out. Solutions will extraction and somebody's them for the analysts to be able to go through in a shorter span of nine reporter company. Let's say reach so. I think we're seeing several. Use cases added up and some of these deboning out the dems and these can be applied and benefit could be improved the speed at which the documents can be processed the accuracy at which it can be processed think that the benefits are are many. I think will yeah. That's the promise the promises we can do it faster. We can do more accurately and we can have data. That's in a form that we can search we can learn from. Maybe we can even train algorithms you whatever. The case may be so You know we a cleaner data ecosystem faster process. I think that's the golden dream of wet. We'd all hope. Ocr would turn into you know from from the folks that have been in this space for a long time. There's also a lot of data that's coming out from weiss text so extracting human that information so being the legal industry being the medical industry. So there's a lot of contacts data. I mean it's just not the medical terminology the legal terminology so there's a lot of context jewel loaning that needs to be done. So i think they're auckland industries that where we can text extraction ordina bo financial services committee claim processing where they wanna use it for violating accuracy of many claims. I think i think the use case is clearly they are. yeah extraction. Is you know applicable anywhere. We need to extract texts from with it. Whatever whatever it be an image ugly looking in word doc. In store it somewhere. I like to use individual examples. Because i like the listeners to have a mental image of the before a mental image of the after. And that's so Without a conceptual grasp very very hard for people to make the right strategic decisions about these technology so we we just think about if we hang in finance for a second. of course we've got we've had nuance on the program. I mean there's jillian players in healthcare and the other domains here but if we just hang in finance we think about when this information is being kind of pulled you mentioned. Vpo that's a business. Process outsourcing firm often. They're going to be in india. Maybe the philippines these are people who are gonna read these things they're going to copy paste the right sort of data or manually type the right data into the right form fields and then push enter and then it's all going to be sorted in the actual system the right way. Imagine some ocr systems do that as well for most extraction problems Is there is there sort of a a big custom. It feels because we have to be able to find how we're gonna to store. This right at every bank is going to store things different. Every insurance companies Store things different. What exact fields. What exact format so it feels like. There's a bespoke new from the get-go about hey you need these invoices to go in this way. You need these reports to be put in this talk a little bit about how that has to be customized tweaked. Hit on the name that yes. That's one thing that we're seeing this. There's no one size for. Why would we have done as but it. All these algorithms than the. I jin to be able to act this process or for it but linda. Financial services industry so by side in a body missed out aboard board in our bank statements in so many different types. And it has to be customize. The twin we use on services capability. And where we have an expert does where we've bringing data scientist to be able to customize this and the way this is is consumed by enterprises also seems to he. Did they want the back in the of it wanted to become. Api's some of them. Just want them back as in the former consultant file format. Devil the in use within their databases and the with their teams to consume. So how do fermentable prizes consumed is. Also we see a lot of billion so it's difficult to do one size fits all solution. But when i would say i would fifty percent sixty percent of this as being a lot of commonalities in water the models we're using and the something of training and that we need to do to that particular use case and then the final. The output is how each customer also uses the final extract. information is again customized. Yea because it's everybody's business process is different as like. Oh well for us. We take these documents. We use a summary. Somebody reviews at this level and then they go into storage this way and then another bank might do a totally different thing with different orders in a different process. And maybe that has to do with some legal concern. Or who knows but you gotta find fit for for all of those. My guess is that over time when it comes to certain kinds of docs there might only be like a set number of categories of things you would do with them like one of them would be. You said well. We need a simplified form in this format. That's one like Turning up a piece of paper into another piece of paper that simpler another is we need to take these seven. Bits of information for all these. Docs you know who's the client. What's the amount to be paid. You know whatever and we need to enter it into these fields and push enter in push that into a database. So it's another instance. There's probably only so many broad umbrella categories of what will we do with this document. I my missing any there. I mean i've touched on too. But i wonder if there's any other big ones aren't the those pretty much covered wants man. There's also this whole volume is not a question that comes up their customers with this one time just as millions and millions of that to the somewhere in almost ninety of documents but the volume so far less so i think there's a idea of So so the extraction. Challenges wade's consumed. Also we've seen radian some like it on the cloud where we thank to code options gov vote volume is less than their some one and on prem of and where where you know. It's it's managed as a as an enterprise solution. So we we think the way concord loma use these vide- significantly. But but yes. I think to give bond that. The gm challenges that the customs is facing. This yeah i am familiar with kind of the bpo world. We've actually for our market research work with a number of clients in the bdo space. Where trying to now get into actually automating these processes fred because they realize if we're just a houseful of humans there's gonna be limitations there my personal hope from your in india. My personal hope is that a lot of india wakes up to. Hey what if we could climb a little bit upstream in keep some of this value as opposed to just have the lower price value proposition. Which i don't think it's gonna last for another ten years so i'm i'm rooting for india in a big way to do a lot of what you're doing here now south But i'm very familiar with that world. The other approach of people are trying to take today and we'll go a little bit more into where things the state of the art is more old school. Ocr approaches when you think about what ocr used to be eight or ten years ago and then you think about the state of affairs now the different combinations of algorithms what to you is. The main of improvement uptick there. What can we now do. That was particularly hard with with older school. Technologies for extraction. Jails come a long way. I mean these are some of the that we have seen. People grapple with today's technology options to be able to use some of these algorithms at scale itself is is. I think it said think there's a lot of our technology is is at play there. And i think even look at her doesn't act a just an open source yada engine. I think there's a lot of majority. That's that's gonna go. that's not the end. I mean our businesses to take that. And of course us i on top of that because that just gives you. The was your thing to be able to do that. Scale thanks to a goss and the technology of the options that are available today a long way from the seattle not even five years ago I've seen some of those yard. I don't think they can work at scale than that could. Solutions are are able to and for example. Let's say our solution solutions on google's cloud we have an instance for customer. We have a customer process. Saying let's say a million records a month so you're gonna be so we're using some of the open sociology within that and just to be able to even who that the offer that kind of scalable solution letting creditable when some of them the policies have prompt. So yeah we. We've got more compute to just crank documents through we can serve. Deploy a lot of that compute up in the cloud instead of having it sits wherever the heck it used to sit twelve years ago but also the capabilities of the tackle. Obviously more expanded a you talked about the contextual data. You know the different kinds of formats challenges here that you addressed in than a ton of work in the search and discovery space in familiar with some of the challenges where the format for x. Kind of doc whether it be an invoice whether it be a certain kind of report whatever they're gonna come in with variants in variety that a lot of the time we just don't have control over and to there's going to be a lot of this manual training. How is that adapting to different formats at adapting to different kinds of documents. How is that more powerful now than it was. Then what sort of allowed that staffing. Obviously that was the are available. But i think there's a lot of the become is if able to do a big thing. We want our was. You're able to see this. Handwritten documents are documents to a different quantity images. I think differently. Those taxes is significantly invoked in line. with other digital's. I think some of the models be able to even implement some of these bottles of by town libraries You know that has the michio it over the years back now. Tens of lower. There's a lot of these markov models that we use which also using in speech recognition and so many things that have been known for the while. I think the overall lot dick options available with some of these packages can be used in deployed at scale with higher quality and I think that's really has made a difference to be able to afford this. What we're doing today and that was able to do. It's five years ago okay. We'll get into now sort of what it looks like to deploy a into this process so back in the day we have. Ocr's it was which may be a little more limited in scope to a certain boundary boxes of documents that we could reliably train. Maybe not as much variety though and then we have the bpl where we can kind of. Send everything in you know. There's there's folks that aren't going to cost a ton of money in our to put that stuff into fields. Now we move to a world where we can start to expand what is machine dual and hopefully again faster hopefully better quality this dream right. It's not always easy on the first go. What is it. Look like to to get that process going. Let's just pretend you know. I'm a bank. I come to you. I say hey you know. Here's these three different kinds of documents that we just never been able to handle those cr. What are the steps to sort of start to integrate into that process and get the output that they're for typical probably gonna customers gamblers with a problem statement. Saying you know we have these documents and we've had Challenges using traditional cro. Whatever other debatable out there to be able to extract can take a look at this. The first thing we do is we. Actually we created and train our models to that particular document and an opioid is only one aspect of that which would contain to make to whether it is account information. No old side of the table. And sometimes you have negative tables that we have to deal with so we use audie i models and again wrote. Use a lot of machine learning foul better to be able to do this. And we've been customize the model to that particular use case so we have seen Right from a document that has a drawing and there were tables and then we have seen a document that had a lot of table in the textile and then we have a symbol simpler structure the table dot which is all the more easier wants to handle so i think each one presents a different challenge and i think i focus on extraction and then comes the context tool. That's where some of our summarizations capabilities coming in a lot of Solutions some video faction but we. We thrive in a complex. And we try. When there's more analytics that needs to be line is to be applied on that expected intuition so apply context to that particular information. That could say any misdemeanor. Banco might abbas needs in. His will be very different from well. Let's say a bank operation but it's no someone from many claim processing. These are all each one presents different set of context which we then train the the solution to be about. Everybody's a one hundred percent accuracy yoyo. Obviously that's that's a big and so we started the process. Where bit is somebody of mandolin. Dimension to be able to achieve hundred accuracy but over time as as a model london. Some of the You know mitchell them you know that kind of gets into a diamond becomes hundred. Listen i automated tosses. That's yeah that's the goal anyway. So let me know if i'm correct on the the steps here around because again i liked to make a visual in the minds of the listener the way i understand these technologies. And you'll have maybe a more granular way to articulate this than i will because you do it while i'm a market researcher And i really look at a million other things so the way i understand this is will find these new complex documents. Okay we've got images and then we got tax who got tables text on both sides of. It's very kind of very donkey to train. On in some cases it might be possible to ask the client. Hey is impossible to get these reports in the first place a little bit. better format. Sometimes the answer is yes. Most the time i would i would guess. It's totally out of their hands against so we can try. But then we have the docks as simple as we can get him which is often not great. And we've got a essentially do some some human lifting here and say okay for this document. This field is going to get entered in this place. These two fields. We're gonna exclude. We're never going to do anything with them. This this field at the bottom looking for these terms. And we're gonna put whatever we find into these two slots and then we're going to enter it into the database. I'm making up an example and so there might be fifteen slightly different variants of the weight of these docs come in. And so we need to take x. number of each of those fifteen in manually like. Here's the bounding box. Here's what were searching for. Here's how we enter it. Here's what we extracted and get some of this human training and then we'd get an algorithm to look at a new dock and say which of the fifteen is it and which routine and my going to run. It feels like that's the process at the end of the day. However again you're the guy i'm not so let me know. Know you hit it on. I think that's that's a nice way to put it for for someone to understand. Because of wesley. Customers valley change their document for the We wish thing that happened so so that'd be nice but but it rarely happens but i think Yes what we do is a human. Would i look at the document take visit documents. Okay got account information at one oppose manually do was is what we chain on algorithms and then that uses mlm deboning algorithms to be a bad combination of those on repeatable. Add scale As as as more documents come in and said pretty much what we do but but obviously not come to us to throw up a lot of humans at this problem so we be used that step only learn how to do it first time so that weakened and train out models be able to finish your best position so we're not customers come to us. They usually a customization and setup costs that we have a which could take anywhere from a few days to you weeks. And that's what we do in that. Yeah yeah yeah. I can imagine again that that time is going to be a little variables. Maybe sometimes they're just working with docs. That are fortunately quite simple. And you can just you get a certain number labeled examples and you can just run because you've seen a lot of similar stuff and maybe in other cases it's going to be a little tough so customers period. My guess is you and your subject matter. Experts on the feature extraction kind of side of the house are gonna come together. You're gonna look at dr. You're gonna edit docs. They're gonna give you as much as they as they need to tell you for you to manually score a bunch of them. Probably there's going to be a period where you're going to run a bunch of fresh ones through and you're going to work with them again and say how did we do and calibrate things where you need to so again. There's there's kind of this trailing period of training those normally always going to be some ongoing level of. Let's say quality control. Nothing's ever one hundred percent they're gonna come up with new docks. We have to adjust to the algorithm. I drift a little bit and start doing things we don't like. What is that long tail. Look like for are maintained communication with the clients and make sure catching edge cases. You know keeping things improved controlling for quality. Can we go through that a bit again. So if you look at the solution until we of these demo of textile so you long to do you have one of the document you're done with the customization starting the process. So there's something called of be able to be processing module when there's some degree of processing that began then there's also a juicy module where we actually have to do random checks of some of the process data so i think that qc dimension is something that we're seeing is ongoing. This game of that depends on the number of document again on volume. And but i d could be just our few hours. Among could be several members mendez amounts. So davila depends on the volume an ideal but we notice some of your flop choosing dimension that needs to be there for in charge on a cent water. The tables over time as was the model machar's as learning happens. The machine learning algorithms dacca's he gets better and better and we always have a let me start with gusto must come to us with desert customization and then there's a proof of concept period where we actually doing this for the moment and they're seeing the output on the quantity and obviously they doing this without that makes their life easier so that's where we strive for as much as wanted one hundred percent and again if somebody of qc that we have to factor in overtime. Yeah what does that look like. So i imagine what it might look like. Is somebody at the client. Company is calling the database and when they see errors they click a little button. They fix it and then you have a dashboard of what all those quick fixes are. And then you you have. Maybe a call every month. They're however long to sort of get a sense of if there's any themes they've picked up on that we haven't to kind of uptick it. I just made that up off the top of my head. I don't run your company. But how does it actually work in in reality when what we do is we look at We have a dashboard that talks about number of documents processed. And then there's also some degree of huseyin that the customers do especially in the show face and then we have a dashboard. Wears they report any facts that we look at the okay. Where was that. Was that because of clubs kind of a we need to fix out of them dude alone. Them particularly the context data that we have had so far is something new we haven't seen at is again sped into the algorithm. So that even a minute happens in to joe. So i think we don't have it a click. Get something as simple as that but we do have a dashboard where the process they dow to present documents processed indiana's and then we can kind of build gazans. Then we end of the eminent extreme. Kind of what's on why that happened. And then we can along the model and and we mean chicken doesn't happen again so then deal it really depends on how long depends again on the complex at the end but idea of the documents but it can range from a month. Do even several months of continued juicy work total. Yeah yeah and i imagine you know in terms of marking the things that are ariza keeping track of those errors. The people on the client side have to know at least somewhere to to keep track of them is. They're often at champion. Who sort of heading up like this project. Disintegration this partnership with you and part of their role is sort of coming up with a mandate of look when you find an era. Here's where we're going to store it because clearly you have to have that data populated and somebody in their. If nobody's responsible for it than it's no one's gonna report error is is. It often like their champion there. Or how does that work or you know. There's every enterprise we're talking to them. No the bean and the need for this come from different buttons it could be customer service department could be operational and financial in may be a mess journalists team that has does need so the need bibi's rama. We have seen this usually not with his companies are now having these vote van summation beams abby beans so there's usually jambi looking at better and smarter ways to these things so it usually known as we're seeing that kind of champion was who's driving this working as single point of contact but they're also dime. Who ended this. I an event. It's mood of a support. This will a person najem this usually someone from the innovation team or someone from the internal operations was leaving with these things and challenges. Tempting cool okay. Good to know figuring out the you know you've got all of us have plenty to learn in the next couple of years as to where these things fit in. What's the right combination of our talent in their talent to come together girl you know. Make it work as quickly as possible. Those lessons learned in my opinion as hard if not harder than the right combination of algorithms. It's cool to hear your take on that. Because i think you're going to be learning a lot coming up here. It'd be week being finding different. Use cases different customer. Needs so call while i know that we went a little bit into overtime but ram. I think it was completely worth it. We some fun stuff. I think we've got a great mental picture of the before and after sincerely appreciate you sharing your insights with us today. Thanks so much. Thanks name and w you join the conversation and appreciate all the good stuff. You're doing any minds. Enda trying to create these awareness about use cases in the i n. We'd love with the future guests. You bet remained. So that's all for this episode. Thanks to rama for sharing his insights here. Thanks to you for listening all the way certainly appreciate having you. Here's a listener if you enjoy what you're hearing that be sure to follow us on social to find emerge at at e. m. e. r. j. on twitter emerge artificial intelligence research on linked in or on facebook. We've had a growing social audience and more engagement than ever over the course of twenty twenty s. We've become more active on social channels. Sharing our frameworks and best practices. Charing our latest interviews and also sharing all of our latest articles as soon as they come out. So if you wanna miss a thing and you like use cases and you like best practice information than follow us on social at emerge otherwise stay tuned next month for another episode here on the financial services podcast.