Dr. Oren Etzioni, CEO of the Allen Institute for AI (or AI2), talked to Monte about CORD-19, AI2's initiative to aid researchers studying COVID-19 . CORD-19 is an open research database that identifies the most pressing advances using comprehensive natural language processing. Learn more about in Oren's WIRED article, or get the details in the academic preprint.
Read on for the transcript!
Hi, I’m Monte Zweben, CEO of Splice Machine. You’re listening to ML Minutes, where we solve big problems in little time. Every two weeks, we invite a different thought leader to talk about a problem they’re solving with Machine Learning, with an exciting twist: our guest has only one minute to answer each question. Let’s get started!
This episode, our guest is my friend Dr. Oren Etzioni. Oren is Chief Executive Officer at the Allen Institute for AI, or AI2 to most people. Oren has been a Professor at the University of Washington’s Computer Science department since 1991, and his work has helped to pioneer meta-search, online comparison shopping, machine reading, and Open Information Extraction. Oren has founded or co-founded several companies, including Farecast (acquired by Microsoft). He has written over 200 technical papers, and has been featured in The New York Times, Wired, and Nature. Outside of work, Oren plays way too much Bughouse online!
Thank you, Monte, really kind introduction. I feel like I don't want to say anything, though, ‘cause it's gonna be downhill from there.
Well, that's funny, because I'm gonna ask you to talk more about yourself. Can you tell me a little bit about your journey to how you got to where you are now?
Well, let's start in the beginning. Like a lot of people in high school, I got my hands on a TRS-80, asimple personal computer, some people refer to dismissively as the Trash-80. But I loved it. I started programming in basic, and it was just so much fun. I feel like I had a balanced palette. I played basketball, I was very interested in girls. But then there was the TRS-80 and programming. And then later, at the end of high school, I read the book Gödel, Escher, Bach, which connected me with the fundamental questions of intelligence. How do we build human level intelligence into a machine? And that combination is what got me going. Big Questions, superb technology.
That is a great answer. And I have to admit, I had a TRS-80. I even programmed it in Assemblr. And I read Gödel, Escher, Bach at that time, too. Okay, now, I'd like to take this to your research, I'd like you to talk a little bit about the work you're doing on COVID-19. And, of course, there is a great deal of research in the community trying to find a vaccine. But I'd love to hear what you're doing in this community. What's the problem that CORD-19 is trying to solve?
So we're a nonprofit research institute, and we developed a free search engine for scientific information called Semantic Scholar. One day, early in March, we got a call from the White House, from the CTO of the United States, saying you need to help us take all the research on COVID-19 and the Coronavirus and so on, and put it together, make it available for researchers to build AI and information retrieval search systems on top of. And we said, How much time do we have? They said, We need this yesterday. I'm really proud of our team, we had some relevant infrastructure, which is why they contacted us. Within five days, we had the first version of this, within 10 days it was out and available for the public. We had a corpus of research papers that was machine readable, and people ranging from Amazon to the Chan Zuckerberg foundation from Korea and elsewhere, build search engines and question answering systems on top of that, to help biologists and virologists tackle COVID-19.
That's fantastic. So you had a very short period of time to build upon your original research of information extraction and search, and focus it in this area of research on COVID-19. What are some of the ways that the community used the research?
Well, first of all, we partnered with Kaggle out of Google, and they launched a competition to answer key questions about drugs, about vaccines, about how long the virus lives on various surfaces. And it became the most popular competition ever: there were more than 2 million downloads of our dataset. So we felt we're really in the thick of it. And a whole bunch of specific answers about masking, about convalescent plasma, but the key issues that affect us every day came out of that, and they've been published in medical journals and they've been informing policy groups ever since. Nowadays, CORD-19 is updated daily, has more than 200,000 papers in it, and people are continuing to work on it and use it to hopefully find a vaccine.
That's fantastic. I love the fact that you're able to keep this corpus very up-to-date with all of the new research that's publishing, and getting into the hands of the people who need it. So moving on from the importance of the research, let's look at how you did it. How are you actually solving this problem?
Well, if I had to give you a phrase, it would be machine learning. No surprise there, machine learning is the tide that's lifting all boats. So natural language processing, or NLP for short, which is what we do, is based on modern machine learning techniques. On top of that, we use what's called embedding representations, basically projecting the context of different words, into a vector space that allows us to understand the meanings of different words. And then we build up from the meaning of words to the meaning of sentences, from the meaning of sentences to the meaning of documents, and from that to the ability to do things like answer questions, summarize documents quickly, and more importantly, be able to extract findings and medical results from, say, 200,000 research papers.
Yes, and I think that your research that I read a little bit about, tried to take some of the original NLP research, using machine learning to extract meaning from just sentences. You were able with your team to really extract meanings of documents. And that's what led to a, I think, a unique capability. You talked about word embeddings; these are just being able to understand the likelihood of one word being close to another word, or incorporated in a sentence or in a document, and being able to understand meaning through those embeddings. With respect to the research, though, what was one really interesting or significant challenge that you faced along the way?
Well, I think that modern natural language processing research has worked, as you said, at the sentence level, but if you think of a document, like a scientific paper, or even a Shakespearean play, or a memo at a corporation, reading it one sentence at a time, it's like trying to see a movie through a keyhole, okay, it's a very, very limited view. Often, you have to put pieces together across sentences, across different sections, to make sense of the entire document. So we're really scaling up natural language processing, from the sentence level, to the document level. And we look at things like hierarchy, how the document breaks into sections, what's salient? What's the most important sentence or set of sentences in this document? And then we go beyond the document level. So if I have 20,000 papers, written recently, about COVID-19’s persistence on surfaces, and I've asked the question, How do I put the pieces together across thousands of documents that a person might not even have time to read at all?
That's fantastic. So what you're saying here is that, instead of just looking at these micro-features, of sentences at the word level, you are able to take semantic constructs that describe documents and concepts, even beyond documents about the domain and create a representational structure that the machine learning can leverage to learn some of these concepts. That's really interesting. Were there any specific difficulties in trying to incorporate higher level concepts in some of the deep learning methods that you used?
Absolutely, let me get a little bit more technical here in a minute or less. You talked about how all these models, what they do is try to figure out how likely is the next word or the next word. If I say, once upon a, you would say time, right? So we constantly have expectations based on our experience of what the next word is going to be, and the word after that. But the window of context that's used to compute that is very short. In standard models, we had to scale that by two orders of magnitude, which required technical innovation way up and down the stack, to go from sentences to contexts that really have the whole document inside them. That's an example of a major challenge that's familiar to everybody: it's scaling algorithms.
Thanks, Oren. That really explains some of the complexity in the research. I've seen some of the work on natural language that uses LSTMs to string together words in a sentence. And you're right, it's usually just a few words apart. And it sounds really interesting how you scaled it to look at a much larger semantic context. What's next in the research? What are you going to tackle next in this really interesting area, NLP and information extraction?
So a huge problem is information overload, right? We're all inundated with tweets, Facebook posts, email messages, slack messages, right? So an academic researcher, a doctor of neurologists, somebody that we count on, to help fight COVID-19 has all that stuff. And they have all these papers that have to read and all these reports of clinical findings, and so on. So we really want to go from sentences to documents, to set of documents, to really tools that help these scientists do their research, do their literature search, find support for examples, understand where the field is going. And we're using, for example, information visualization techniques to find graphs where the different genes and the different proteins and the different viruses, how they all relate to each other. And so that, for example, could be a substitute for reading maybe 20 papers, just look at one graph, you know, they say, a picture's worth 1000 words. Well, a graph could be worth 10,000 words.
Well, that's interesting. So I guess what's going to happen is, these tools are going to help people do very specific jobs, perhaps summarizing lots of research or lots of research documents into common areas all the way to being able to reformulate the information in these documents into other visualizations and other ways that people can consume them.