Internet Archive Book Scanning with Davide Semenzin
Welcome to the show. Thank you. You're on the Internet Archive. What does the Internet archive do. That's a great question. Deterrent archive is the world's largest digital library, and whereas most people may know of us because of the way back machine, which is this really rather needs tool that allows you to go back in time and kind of see what web pages used to look like. We really are fully-fledged online is the library and that we have different types of media types. We hold texts and television and audio images, movies, all sorts of things and yeah, the introduce archive you can think of as this huge repository of Internet. When did you start working there? I started here in two thousand sixteen. So. We've been yeah for years. And what do you work on their today? Well, I work on the books. That's mostly what I would I have always been on. I'm spending the bits inside of this. So usually when we think about our media types, we think of in terms of bits and bits out how we procured them, and how we distribute them. My specialty is working on the book bits in saw in order to build up our collection of almost four million books we have Candan, and my job is to sort of keep running the whole pipeline that allows us to do that. So over the last four years, we've my team, I built it. And now we achieved over our objective of being able to digitize million books per year which we're doing, and it's pretty interesting challenge so far. So you work on book digitisation and I WanNa talk about that. But first, let's talk more about the Internet archive at a high level. He told me about what is being stored across the Internet archive and who pays for it, and how do people use it just share a little bit more about the Internet Archive. Yeah. That's a great question. So I'm going to start from a WHO pays for it because I think that's the result of depth and that question Internet Archive. If you think about it as a repository, it's just essentially a bunch of hard drives spinning connected to the Internet. Somebody's GONNA. Pay For both danger and connection and hard drives and the electricity and all of that largely you can think. Of of our revenues in treated front weight. So we're a nonprofit and we don't really run for profit businesses. We don't benefit in any way of the data that comes on on our servers. We do benefit from your donations and so by and large, we are a community funded effort, and so if you type slash donate, we actually just added integration with apple pay so people will not help us. That'd be great. So we receive a fair amount of money that we we need to run from patrons, Cintas like people who supported us. On the side, we do have some some small some businesses. So we have our archive it. Our arm where essentially contract alto were machine capabilities and we we are maintaining a very large amount of curated website collections. In fact, we I, think we have about seven hundred can ization that are that are partnering with us to create these collections and if you tens of billions euros that have been collected for for our partners, and so they pay us to do the service and we do it for them and same is true for books. digitisation. So as we have built up to large infrastructure that is required to do this kind of tasks, we have to an extent, the ability to contract out to third parties, and so we do get some some revenue streams that way not anything particularly substantial in terms of like our ability to to sustain ourselves. But you know every little bit helps and then obviously throughout the twenty twenty, five years of our existence, our founder Brewster Kahle has. Chipped in here in Deir a significant amount, I guess over the years to to keep us running. So we have donations we have a little bit of our non for profit business, and then we have brewster who is there so This is in terms of who pays for it, but the question would be I guess who benefits from it. Right and that's a very, very large segment of the Internet. We're not the biggest website on the Internet. They think we are. We're ranking about two hundred and something the Alexa rank. But since we've been around for a long time, the users that that lovers the Lavas like I, every day I am in contact with people who tell me their story about how they use the Internet archive for their specific need always always amazed by the depth and breadth of. The of the use cases user spring to us. So it it spans from teachers to researchers, journalists to lawyers Theresa very, very large diversity also in terms of the country's from the backgrounds from from when users from. So it's kind of hard to to to paint them with the same brush but in general I want to say they are people who have some degree of laugh for knowledge and you may know our our motto, our slogan our mission is Universal Access to all knowledge, and so I guess people who have an interest in that eventually land on on our website. Okay. Well, let's talk about book digitisation as a particular project that is under the auspices of the Internet Archive. What is book digitisation? So, books digitisation is the effort of transforming physical books into digital artifacts. So that's the definition can take it forms. You know if you are if you have a scanner in your home and your scanning document in a way, that's obviously that's digitisation if you take pictures of the book. That's a book book digitization. So the definition that needs to be applied to the use case at hand, there have been other efforts at large scale of books. This decision famously Google had one but dare. Different From Ours, for instance, where they did distractive digitisation so they would pull the spines from books and and turn dot process into a sort of sensitive. Kind of problem we do non destructive book dissertation and I think non-destructive bit. It's just a little bit as important in the Beth nation as the fact that we're these books digitizing them so that we can keep them so that we don't destroy them. So the process by which we turn books into bits and then returned books to wherever they came from or wherever they need to go. So Why would I want to digitize a book and how many books get digitized each day just tell me more about the volume that's going through this. I'm very happy to answer this. So the reason why you would want to digitize book there's multiple. So think about for instance, the first thing that comes to mind is obviously preservation if famous birtherism is that accessibility drives preservation so if you don't have something. It's almost like it doesn't exist especially in this age of information, we do have immediate access to all of all of these resources and so if we if you actually think about this, if you have to go to the library to to procure a certain book chances are you won't, and if the if the record of that book actually doesn't exist, you may never get to it and were. This is a problem is for all of this huge amount of books that were printed in the twentieth century for which there is really no digital equivalent books nowadays that are published like currently obviously, they have a book artifacts. That stuff is not to get lost. and. That stuff is searchable and it's reachable but we have. Tens of millions of books that are unaccounted for and as time progresses getting lost, and if we if somebody doesn't save them, they will be lost forever and that's that would be a pity and huge loss of human effort and so but first of all, I think important to scope the problem I think the D estimates that there is about one hundred, million books out there. Give or take unique unique books and. Scanning them we're, probably not gonNA scan all all one hundred of them first of all because. You would be able to source and that's my fire the hardest thing. So we tried to scope down the problem and trying to figure out. Okay. How can we do this in a way that is useful for people so first of all, I think we had to come up with a list of books that we wanted to get into we knew. Books that are important and we need to can these first so that? We'll. We'll get. We'll get into to people and this will be evidently immediately useful and a good place for us to start was freaky Pedia, which is collected. A long list of SPN's the where commonly cited in Wikipedia compiled the list came out to a few hundred, thousand books, and so whenever we we come upon one of those sourcing process, we make sure that we get. We can talk about the senator sourcing, Proxima, little bit later but in general, we do have a little bit of a concept of priority or at least we did this was the first million million and a half. And then the problem was that we started running out of books you would be surprised how hard it is to source books by by the half a million you know and if you if you do it by your smaller scale, it doesn't really make sense to to us in terms of maintaining our our economic scale. So the whole system works only if you scan at huge volume and time and but huge volume, we're talking about a million bucks a year, which is about three thousand books day some things some days we'll do thirty, five somedays. We'll do twenty five on a seven days week averages houses about. Between Twenty to twenty, twenty, five, thousand books. Every book is about three hundred pages so that. COMES OUT PRETTY NEAT about million million pages per day five to seven million pages per week and you know that's not a huge amount of data in total. I wouldn't be surprised I. think like last time I checked it was about between ten and fifteen terabytes of data week. So we're not talking about huge amounts but it's not a small amount eater and we can talk about the challenges of Piping data over the Internet in a reliable way later but it's a significant volume and this operation is running you know twenty, four seven. And so. In terms of why even do this? So I called for the first part, which is obviously people want to get to the books. There is a second benefit in having digitize books, and that it's a wholly new format, it allows you to interact with the body of knowledge in a way that you never have before if you have. A physical book artifact, it has some very desirable properties, for instance, very low random access time and doesn't depend on the battery. It's very, very hard to censor, and these are not properties of digital artifact but this is the active factor searchable, and in fact that we have like it's pretty amazing next search engine where you can instantly search all forty million text items that we have. So that's a million books plus all of the patents papers I'll all sorts of stuff and you can search that instantly that was just not possible with the previous format. So I don't think this is dwell ISM in any way I think books. Digital format and books their physical format will continue to coexist. They just help each other out, and in fact, if we are able to digitize them in the first place is because of the properties of. Physical artifacts that they don't just disappear. If we find one, we can scan it. Well. Those are great summary of what you do and I can tell how excited you are about it. Let's talk a little bit more about the high level, and then we'll get into the engineering. So can you describe the steps of digitisation in more detail if I have a book how am I digitized it? Yeah. So, the books that position pipeline is predecing people and it's like in a way if you're an engineer I think is kind of what to expect so I D-. A physical sorting. Step where your book is ingested into the system. It's given ID and it's it's placed in a container. So we know that the the exists. So to speak the second step is it gets to a scanner. The scanner picks it up within the in the machine loads up the data necessary whereby The books method data we can. We're going to have to talk about that. I, guess it's pretty interesting facet of it all and then proceeded to actually scan it, which means they turned the pages page by page and they take pictures of the pages, and once this process done they click upload and the book vanishes into the ether and so at this point, we have a fork the digital artifact goes into our servers divisible artifacts either goes back to the person who gave it to us in the first place or it goes into our warehouse. and. This largely depends on what kind of book it is. So obviously, the recent larger conversation to be had about copyright and like what books is it is it okay to scan and under what guys it is but suppose we are just you know scanning Yearbook Jeff and you you just wrote the book and you want to have it digitized to risk no claim on it just wanted back at the end. So after we're done scanning it, we're handing it back to you with slip inside which will tell you the Internet archive identifier and the. Or is just the name of the item on the Internet Archive. Everything is an item and you're just going go to type slash details, slash your identifier and a few hours. Later, you will find her book. Wile you wait the second part of the pipeline is GONNA kick off. So That's the digital server side stuff and it's divided essentially three phases. We have a first phase which it's a preprocessing stage where we get a look this images that came raw from the camera we'll look at them crop firm we discovered them and we just make sure that everything is is ready to go. There was a second phase of Manual Review Sa- currently all books that we upload have to be checked by a human for correctness, and so this is a step were. Reviewer just goes through the images in shorts that everything is fine and then when this is done, they kick off the third stage of the pipeline, which is A. Is the real processing stage where we take all of these files and compiled them in such a way that they are suitable for consumption by our web front end what we call book reader and from their wheel derive. We call them to rotate formats such as PDF, Abi e POB and either a text file. So CR it all happens at at this stage. This is kind of like the bird I view of the of the books that decision pipeline.