How do you know if something is sustainably sourced? It’s hard to tell down here on the ground, but cameras way up in space are capturing movement from all around the world to help answer this question. James Crawford, CEO of Orbital Insight, came in to explain how his company uses computer vision to make observations about the world we live in today, and ways we can use these observations to optimize sustainability in business supply chains. We also discuss the potential for unsupervised learning, including simulated imagery, to improve accuracy in computer vision models.
Read on for the transcript!
Hey, it’s Monte. You have two more days to enter our Apple Watch giveaway! Raffle closes December 25th at midnight. Enter at mlminutes.com/giveaway.
Hi, I’m Monte Zweben, CEO of Splice Machine. You’re listening to ML Minutes, where cutting-edge thought leaders discuss Machine Learning in one minute or less.
This episode, our guest is Jimi Crawford, CEO of Orbital Insight, a startup that uses satellite imagery to understand national and global socio-economic trends. Welcome, Jimi.
Jimi, you and I have been around similar circles: you were at NASA Ames, that was a great experience for me in my career, you've worked on supply chain initiatives, and I2 technologies, built data platforms and composite. And you even worked on mining the moon! But in your own words, can you tell us about the journey that you're on right now and how you got there?
Certainly. So about six years ago, I was looking at the state of the space. And one of the things I noticed was that there were a tremendous number of startups, working on launching rockets, launching satellites to look at the earth. And I'm thinking Skybox, Planet Labs, EarthCast, Black Sky, and they all had plans to build massive constellations. But nobody was thinking about what to do with the imagery when it comes down to the earth. And if you do some back of the envelope math, you realize that if you have imagery of the land surface of the earth, and you went to look at it all every day, and see all the cars and all the trucks and all the ships and all the planes, all the roads and all the buildings, you would need 8 million people doing nothing all day, every day, but staring at satellite imagery, if you had that 8 million people, and you had them well organized, you could figure out the world's economy, right? What's going on, where the economy is going up, going down, and who's trading with whom, and who's farming well, and who's mining right. But but that's a lot of people, nobody's ever gonna dedicate 8 million people to looking at satellite imagery all day every day. So the answer it seemed to me was to use AI for this, especially with the tremendous advances we're seeing in computer vision, and set up the computer vision so that it can process the imagery, and count the cars and the trucks and the ships and the planes.
That's a fantastic discovery there, and I'm really excited about your venture. What I'd like to know is, how does your company build models from this satellite imagery and other complimentary data?
Sure, that's actually relatively straightforward. If we have a type of object we want to find, let's say we want to identify trucks, we simply start with a tremendous amount of imagery. And we do what we do labeling campaigns, you know, it's really very classic supervised labeling from a CV point of view. Now, it's a little bit different from other kinds of CV because we can do rotations. And we can do other kinds of transformations that don't make sense on terrestrial imagery. Because you know, you've never seen see an upside down truck on a road. But from a satellite point of view, you can see the trucks in any orientation. So we do different kinds of perturbations on the imagery. But other than that just traditional labeling campaign, we build large datasets, large labelled data sets, we try to get different lighting conditions, different parts of the world, different kinds of roads, different size trucks, put it all together into a large training and test set. And then beyond that, it's a pretty straightforward, you know, training of a compositional neural network.
Fantastic. And so what the supervised learning models are doing is taking those labels and learning exactly which of the images and what features of those images are predicting those labels. One of my favorite applications that I've read about for your company, Orbital Insights, is increasing the transparency of business supply chains. Could you tell us a little bit about what your inspiration was for this? And was there a specific problem you were trying to solve?
Sure. There's a bunch of great problems in supply chain, but one of I think the most profound ones is around sustainability. And there are actually several problems that you have to solve. One of them is, what does deforestation mean? Because often supply chain sustainability is about all you causing deforestation. Now, if you are managing a plantation and every 10 or 20 years, you cut down the trees, that's not actually deforestation. It may be ugly, it may be unfortunate, but it's not deforestation. Deforestation is when you have a virgin rainforest. It's been there for 1000 years, you cut it down. So we actually trained the deep learning algorithms using the same kind of labeling approach to learn the difference between a managed forest and a virgin rainforest so that we could then look at incidences of deforestation, and then go back a couple years and see whether the thing that got cut down was a was a virgin rain forest or a managed forest. So that's a part of sustainability that's that's really nice.
Excellent. So why is it important for you to be able to locate deforestation from a supply chain perspective, what do you do with that information?
So in order to say, let's say that you are a big company like like Unilever or Bungie and you're buying a tremendous amount of palm oil, it goes into the things you make. And you want to be able to put a sticker on your goods in the store shelf that says "Sustainably farmed." What that really means is that the process of building that product didn't cause environmental harm, didn't cause serious environmental harm. And one of the major kinds of serious environmental harm you worry about, for chocolate for, for palm oil and for other products is deforestation. So you want to look at the places where that product is farmed, and make sure that those places weren't virgin rainforests, anytime in the last X number of years where x is defined by your definition of sustainability.
Thank you so much for that. Let me see if I understand. Then translating that into the machine learning models. Does that mean that for any one of these farms, you're creating models that predict the likelihood that the raw materials were sourced from a good location, versus a location that may have committed some sort of deforestation?
Yeah. And then the other part of this, of course, is knowing where the goods came from. So we know where the Unilever factories are. But then we have to figure out where the trucks are coming from that come into those factories and trace them back to the farms. And we actually use anonymized cell phone data for that. So we get, we get very large amounts of anonymized cell phone pings. And we don't know whose phone it is. But we can tell it's the same phone that pings this is pinging that within the course of the day. So we can say this, these trucks all came from this plant, and they all went to this farm. So this is one of the farms that supplying that mill. And we do that, you know, thousands and thousands, and thousands of times, we actually figure out the empirical structure of the supply chain. Right. So now you know where the stuff is coming from. And you look at those places, you go back in time, and look to see whether or not they were deforested sometime in the last few years. And that gives you a picture of the sustainability of that plant.
I see. So you've traced the supply chain, using machine learning models, and knowing the routes of both the source and the destination to tackle this very important problem for our planet and for business. What's one specific challenge you faced along the way?
In terms of the sustainability? Yeah, so one of the interesting problems that we ran into that I mentioned earlier is that really the hardest thing for the customer is tracking all the way back to the farms, because there are literally millions of farms. Many of these palm oil plantations are mom and pop operations. And so getting the details of that supply chain is quite hard. Unlike Monte when you and I used to work on supply chain optimization, those were supply chains that were compared to that relatively small and short and well understood, right? These are in Indonesia, they're hugely broad. So getting enough data, enough cell phone pings to really elucidate that supply chain has been a real challenge. So we've been looking at multiple providers for that data, we've been looking at having the drivers that are working in the in the supply chain for our customer actually having them install custom apps that just ping us under the contract where we only use those pings to establish supply chain structure. Right. So just getting a data set together that's rich enough to really understand that very complicated supply chain has been a really interesting challenge.
Fantastic. Maybe one more question: is there a specific challenge that you've had in building a business on machine learning, and you know, not just one particular use case, like the deforestation and sustainability use case we're talking about, but you're taking this very fast moving science that has really just emerged into the business world, and you're building a whole venture up on this--what's been a big challenge there?
I think the biggest challenge has been balancing the the need for R&D against the need for business certainty. So if you go into a customer and the customer says, you know, we want to track trucks, in let's say, you want to check trucks in Cairo, because you think it would be a good proxy for Egyptian GDP. And, and we've got a truck detection algorithm, but the precision and recall is only 60. Right? 60%. So and we know we need to get a 90% recall, in order to get or we think we need a 90% recall to get a good correlation with the GDP. So what kind of timeline and what kind of precision and recall are we willing to commit to that customer? Given that we're going to have to perhaps break new ground in terms of the structure of convolutional neural networks for satellite imagery for trucks? Right, and how aggressive can we reasonably be in those commitments given that, as I say, we are doing fundamentally a kind of R&D on imagery?
Yeah, that's fascinating because every project that you take on is a project independent of previous projects, it has its own challenges technically, and you can't know ahead of time whether you can nail it with a high degree of precision and recall. So you must always be balancing that in your sales cycles. Well, now looking forward, what is the biggest innovation you see coming next? What are you most excited about?
So one of the things that we're interested in is how much can be done with synthetic data. So the labeling problem, as we mentioned earlier, tends to be the bottleneck, it tends to be the case that if you need higher precision and recall, the quickest, not the easiest and cheapest, but definitely the most certain way to increase that precision and recall is to just label you know, 10 times as much imagery. But that does get expensive. So the idea of using basically computer game-derived image generation and in certain cases, maybe going directly from the CAD drawings for the object or just from an idea of what the object looks like. And using that to generate it, once you've got that working, you can generate an arbitrary amounts of labeled imagery. But it comes at a price, because now that is synthetic imagery, it's not real satellite imagery. And so there's a huge danger that the competition on their own network will take advantage of artifacts in that synthetic imagery that are not going to be present in the real imagery. So you will overtrain. But if we can solve that problem, it's a fantastic way to get very large training sets.
Fantastic. Yeah, there's a number of people who have been talking about generating synthesized examples to be able to learn from or using simulations to generate examples. And it seems like the always bias is the worry that your simulation or your synthesis might actually be introducing a bias that's making the model not as good as it could be from the actual examples. Well, now I'd like to ask you a few questions outside of Orbital Insights, and looking at our field in general. If you think about all the work that's being done, in our field of machine learning, what's an innovation that you're really excited about right now?
Um, I think I think the whole area of unsupervised learning is probably the area that I think is most exciting, because building on the idea of synthetic chemistry, but but going way beyond it, right, as long as you are working from labeled training sets, you are necessarily approximating human performance. And you're always at the mercy of the quality of your labeled data. And you're always limited by the number of labelers that you can afford to get. But the more we can push into unsupervised areas, you know, the work that AlphaGo did, where the Go game actually played itself like a gazillion times, and they eventually achieved superhuman performance, I find that incredibly inspirational for the whole field.
Fantastic. So unsupervised learning does not require let labels, typically clustering and categorization types of techniques. How will you use unsupervised learning if indeed, they become very powerful?
Well, I think I think there's a few different areas. Like for instance, I actually would view the synthetic imagery generation as a kind of unsupervised learning, even though it's technically a labeled image, it doesn't require that people go and write that label. Right? So it can be run at arbitrary scale, which is the real advantage of unsupervised learning. So we will use it in general as a way to get higher and higher accuracy. I think another area that's really interesting is there's some anecdotal evidence that it may be possible to detect things like cars and trucks and roads and buildings in much lower resolution imagery where humans can't see them. And the problem is, there is no way to label that, right? Because humans can't see it. Right? So if we could find a way to get synthetic data generation, or some sort of co-collects, or some other unsupervised techniques, we may actually be able to over time, do some sort of super resolution type solution for lower resolution imagery, which might let us really open up a lot of lower resolution imagery to this problem of understanding the earth.
Oh, fantastic. I'm really intrigued to see where you go with this, being able to actually generate examples, and to be able to then find finer-grained models than you could have done with actual labeling techniques. All right, fantastic. Well, thank you so much. It's been a pleasure, Jimi, to chat and learn more about how you're using machine learning in various ways with both satellite imagery fused with all kinds of other data. You're doing great things for the planet and being able to help find sustainable supply chains and we really appreciate you sharing some of your thoughts today.
Thanks a lot. It's been a lot of fun.
If you want to hear about the specific tools Orbital Insights uses for computer vision, check out our bonus minutes. They're linked in the show notes below and on our website, mlminutes.com . We'll be talking about self-driving cars for the first episode of 2021 on Wednesday, January 6. To stay up to date on our upcoming guests and giveaways, you can follow our Twitter and Instagram @MLMinutes. ML Minutes is produced and edited by Morgan Sweeney. I’m your host, Monte Zweben, and this was an ML Minute.