HomeCareer Training In Person Distinguished Lecture Program Distinguished Lecture Presentation

Michael Pyrcz - Subsurface Data Analytics and Machine Learning

AAPG Distinguished Lecture Series, 2019-20 Season

AAPG Distinguished Lecture Series, 2019-20 Season


A Distinguished Lecture talk given by Michael Pyrcz during 2019-20 AAPG DL Season. Click here for abstract.

Full Transcript

Howdy, everyone. It's my pleasure to be able to talk to you today about subsurface data analytics and machine learning. Now, I get to do this as part of the AAPG/AAPG Foundation 2020 Distinguished Lecture series. And it's my pleasure to be able to participate in this, in appreciation to AAPG/AAPG Foundation for supporting this. I'm Michael Pyrcz. I'm an associate professor at the University of Texas at Austin.

Now, what's my motivation? What's my goals in giving this talk? Well, it's not just fame and fortune. I am actually interested in how we build new digital competencies for geoscientists. I want to help people be ready for the digital transformation.

I want to also demystify data analytics and machine learning. In other words, what I want to do is provide you with an anti-baffling defense, because I see a lot of people out there, in our industry specifically, getting baffled by the new technologies. I want to share the benefits and limitations of this technology with you, so you can be an informed consumer, so you can understand what it can do and what it cannot do.

And I also want to provide some useful ideas and concepts. In other words, I want this talk to be a call to action, to invite you to try something out, to try some data analytics and machine learning. And you'll see, even later in the lecture, I'll go through some examples that I've provided to you on GitHub, so you can actually follow along. So this is going be highly interactive.

How are we going to accomplish this? Well, this is what I'll cover. I'll go through and introduce myself. You'll see, it's a little bit of shameless bragging. I promise, I'll keep that to a limit.

But it does kind of set up my perspective and kind of get us started. Then I'll talk about the prerequisites. I'll treat you like you're in one of my courses. And we'll go through the fundamental concepts, terminology, so we can communicate and move forward from there.

Then I'll talk about energy data analytics and machine learning. And I'll make statements around how we're unique, how we're different, and how we should see and use this technology. Then I'll provide a whole set of examples in data analytics and machine learning, and even some of them, you'll be able to work on and follow along with me.

Then I'll go into a bit of philosophy. I'll get on my soapbox. I'll talk about data analytics and machine learning best practice. I'll end up with a little bit of euphoria-- a little bit excited about all of the great opportunities in data analytics and machine learning, by showing you more advanced research, mainly from my research group at the University of Texas at Austin.

Let me start with a couple of slides about myself. First and foremost, let's talk about the most important topic, that's my name, Pyrcz. And it's pronounced just perch, just like the fish. If you say it like that, you're saying it just as well as my family that's been in Canada for about 100 years. I've run into Ukrainians who tell me, we don't actually pronounce it properly. Apparently, we lost that along with the language.

Now, the other thing is, I do have practical experience. I am a professor right now, but it's only been for a short time. I have grown up within the industry, conducting projects in this topic, data analytics, geostatistics, statistical methodologies. I have worked on a wide range of consulting and teaching, and even at one point, leading an R and D-- research and development team, developing and deploying technologies in this area. So I know something about using this technology in our industry.

Another thing about myself is availability. I have an open door policy. Now, my wife has cautioned me about this, because she does notice I come home many days at 8 o'clock at night from campus, because I'm always working with students, always engaged. And it does force me to get some of my work done at night and on the weekends.

But I do think it's great to have an open door policy. You can drop by my office, drop me a line. Recently, actually just a month or so ago, a vice president from Noble Energy called me up and said, you know, Michael, you're one of my favorite professors. You actually pick up your phone. You're available here. And try it, you'll find, I actually pick up my phone. I'm happy to help out.

Background. I left the industry just a couple of years ago, really motivated by the idea of giving back, being involved in education. I love teaching. I love sharing my knowledge. And I thought over those years, I learned something important, and I could share with the next generation.

I'm working quite a bit to help basically change and modernize some of the curriculum that we're teaching at undergrad and graduate levels at the University of Texas at Austin. I'm on a bit of a mission here. But at the same time, I still teach and consult and work in industry as much as I can, I think 20 separate teaching engagements in industry last year alone. I love industry. I'm very comfortable in industry.

I'm very active in outreach and social media and professional organizations like the AAPG. I'm working very hard to support geoscientists and engineers in our field with the digital transformation. I think a professor is a role of service to society. And I think it's part of my role, is to serve and help our industry and our people, our experts, in gaining new skills and work in the digital transformation.

So I've recorded all of my university lectures, every single one of them, and I put them on my YouTube channel. And every single example workflow is put up on GitHub. Anyone, anywhere in the world, can follow along with my courses. And I believe in that. I think it should be open.

Now, what's funny is, I visited Chevron recently. And a manager at Chevron made a joke-- said, Michael, you're more famous around here since your left, because apparently, there's many people around the company using my materials. And I welcome you to use them. I hope you find value in them.

I've been at University just a couple of years now. And I've already built, I think, a really great, exciting team. I've seen a germination of something great, something start-- a wave, in a way. And so I already have 12 PhD students. I've had to stop letting new students in, because it's just too much. And I'm trying to keep up with it. I have a brand new consortium that just kicked off.

I also have a Freshman Research Initiative and Ventures Program-- the only fully funded one from industry, funded by Conoco Phillips-- in the College of Natural Sciences, working on energy analytics. And so there's a whole movement going on right now. I think we're changing things, and we're getting geoscientists and engineers in our field ready to work with data analytics and machine learning, digital technologies.

So let's talk about some prerequisites. So I'm going to go into professor mode now. We're going to cover the basics of data analytics and machine learning. I want to cover those fundamental concepts and terms, so that you're able to follow along with the rest of the discussion. But you're going to see, there's a lot of nuggets nested in there too that are going to help you understand the overall technology and what it can do.

First of all, everybody is talking about big data. Now, the question people have is, do I have big data? Now, if you look up big data online, you're going to find out that the criteria for big data are a set of these, the v's-- volume, velocity, variety, variability, veracity. And if you have enough of these v's, you can claim that you have big data.

So let's talk about a couple of them. Volume. Volume would mean that you have a large number of samples. It's difficult to handle. It's difficult to visualize. Now, I've been to tech meet-ups where people will say, hey, if it fits on your laptop, it's not big data. It's not big enough. I don't really have that feeling about it. I'd say, if it's difficult to handle and visualize, that's big data.

Velocity. This is the idea that the rate of collection of the data is high, continuous relative to the decision-making cycles. Now, many people in a tech meeting will say it has to be real-time data. What I say, is for energy-- given the complexity of our workflows and our decisions and the rate at which we gathered that information, and the vastness, the quantity of the data we collect when we do things like shoot seismic and drill new wells-- I would suggest that we have velocity, too.

Variety. Data from various sources with various types and scales. We clearly win at that. We handle data all the way from the intergranular, the poor scale, all the way up to the drainage radius of individual wells that are producing and beyond that, to base an analysis and the detection of even basins that we should be working in. We definitely have variety.

Variability. Data acquisition changes during the project. We have that, too. I think many of us have worked with projects where you start with two-dimensional seismic lines, three-dimensional seismic surveys. Maybe there's a reprocessing. Maybe there's a reshooting of the seismic to improve it. And then you're installing ocean bottom nodes. You're doing 40 seismic. We have a lot of variability in our projects.

And veracity. Data has various levels of accuracy. We work in the subsurface. Every one of our data sets is uncertain. It's actually very difficult to find any of our data that's truly hard data. It all has some level of softness.

And many times, we work with data that's core, samples, measurements-- we get directly from cores. That does have a pretty high level of precision. But you can still wonder about the issues around recovery of the core from the subsurface. And then we work with seismic, where we have to do very complicated, physics-based inversions. There's a lot of uncertainty.

So what do I say from all this about big data? Every time I go and meet with a tech company, I proudly exclaim, energy has been big data long before tech company even learned what big data was. We know something about big data.

Statistics. Well, if you go back and you look at the fundamental definition of statistics, it's just about collecting, organizing, and interpreting data. You draw conclusions from it, but everything comes down to making a decision. If you don't impact the decision, you don't add value. And so statistics is all about supporting decision-making.

Geostatistics is a branch of applied statistics. Now, the great thing about it, it was developed back in the 1950s-- really focused on practical needs of subsurface estimation, back in gold mining, but later on, used in oil and gas.

Now, the great thing about it is, because it was based on the practice, it's actually quite intuitive. The theory was just added on later. So we have the math. But really, it's applied statistics, where we have a spatial context, a geologic context, spatial relationships, volumetric support, and uncertainty, which is always there for us. So it's a subset of statistics. And we can draw a Venn diagram like I show over there, where you can see geostatistics as being a subset of statistics.

Now, data analytics, if you look it up online and you try to get a definition of it, you'll find it's all about analysis of data to support decision-making. Now, often, people will talk about business decision-making. They'll put a business slant on it. But what's very interesting, if you look at that definition, I have trouble distinguishing that from statistics. In fact, I would say, data analytics really is the use of statistics and visualization.

Now, big data analytics is the process of examining large and varied data sets- the big data we talked about just now-- to discover patterns and to make decisions-- shouldn't be any surprises there. And spatial big data analytics is the expert use of spatial statistics, geostatistics on big data to support decision-making.

Now, what can we say about all this? Well, given the fact that data analytics is really the use of statistics and spatial data analytics is the use of geostatistics with visualization to support decision-making, I would say, go back home and update your CV, because you work in data analytics if you understand geostatistics and statistics and you use it in your job.

Machine learning. Well, I did what everybody else does. I looked up machine learning on Google, and the first page was Wikipedia. So I went ahead and copy and pasted from Wikipedia, and this is what I got. Let's break it down and do a little bit of an analysis of what it means.

Well, machine learning is a study of algorithms and mathematical models. Now, you notice that's plural. So what it's telling us, it's a tool kit. It's not one method. It's many methods we work with, the computer system used to progressively improve their performance on a specific task.

It's learning. It's improving its performance. Its learning. Machine learning algorithms build a mathematical model of sample data known as training data. It's learning from data. It's training with data in order to make predictions or decisions without being explicitly programmed to perform the task, without being programmed to perform the task.

It's general. It can be applied on a wide range of problems. It doesn't need to be programmed how to solve that exact problem, but it can be general and applied to many different problems.

Now, many people stop there. I read to the end of the article. Near the end of the article, you'll find the following phrase, where it is infeasible to develop an algorithm of specific instructions for performing the task. What does that mean? It means this. If you understand the physics, if you understand the geological concepts, use your knowledge. Don't let a machine decide for you.

It's not a panacea. Machine learning is not supposed to be just used on every problem. It's really best for those problems for which the data is too big, the problem is too complicated, we don't understand the physics. But you'll see, we'll talk a little bit more about how we can put it all together.

Machine learning nuts and bolts. What does a machine learning model look like? Well, that's it right there. A machine learning model is simply going to be a function where we take a set of x's and we get a y. Now, let me just define each one of those components.

The x's, x1 through xm, they're the predictor features. Now, before it was all cool and trendy to talk like that, we would have said, those are the independent variables. But in machine learning, we'll call them predictors for the inputs and features instead of variables.

Now, what we're predicting on the other side of the equation, the y, is the response feature. And there might be more than one, but it'll be a response feature or features. And that, back in the good old days, was just the dependent variable. And so that's the output from the model. So when you look at it, machine learning is all about estimating a mathematical model f for the two purposes, inference or prediction. Let's talk about inference first.

Machine learning inference. What is the relationship between each one of the predictor features? That's important in itself. That's understanding the system, the sense of the relationships.

Is it positive or negative relationships? As one goes up, does the other feature go down? What's the shape of the relationship? Are there sweet spots? Are there specific locations where you get certain concentrations of samples?

Maybe there's other combinations of features that don't happen because of physical constraints. And understanding all of the complicated relationships between each one of the predictor features is very powerful. That's about understanding the system.

Now, if you want to understand a little bit deeper about what inference is, think about inferential statistics. I'll give you a very simple example. If I give you three heads and seven tails, and I tell you this is the coin that did that, tell me what the probability is that that coin is a fair coin, 50/50 chance of having heads and tails.

That's inference, given the sample, describe the population. Given the result of the coin tosses, tell me about the coin. That's inferential statistics, which actually is very difficult to do. It is a bit more complicated.

Prediction is actually easier. What is prediction? Well, in the case of machine learning, what we're doing is we're estimating that function f, for the purpose of predicting y. Our focus is on getting the most accurate estimate of y. That's what it's all about. We want to get the best estimate of y.

Now, if you want to understand a little bit deeper once again, think about predictive statistics. It's the case of, given a fair coin, and I tell you that coin is fair, what's the probability of an outcome such as three heads and seven tails? In other words, given an assumption of both the population, predict the outcome of the next sample. That's prediction.

So when we're doing that, building those functions, f for machine learning, inference and prediction, we've got two different types of functions that we can work with. We've got parametric models and non-parametric models. It's very straightforward.

A parametric model, we make an assumption about the functional form or shape of the model. We gain from simplicity. We have an advantage, because now we can describe the entire system with very few parameters. And because of that, we typically can build our models with fewer data.

Now, here's an example right here. Our function is simply a linear model, where we just say that our y is equal to a set of coefficients, multiplied by each one of the predictor features. Not a big deal and so I show a model right there, a simple linear model that describes the relationship between elevation and standardized porosity, which would simply be something equivalent to a-- some type of compaction trend.

Now, working with non-parametric models, that's our other option. In that case, what we do is we don't make any assumption about the functional form or shape. It's much more flexible, because we can fit any possible shape or function.

You could imagine, if we had data like the data we're showing right here, that if we tried to fit a linear model to that, we miss a lot of those cycles, because our model was-- we already assumed it was linear. We don't have the flexibility to fit that. So we have less risk that our function or our estimate of the function is a poor fit for the actual natural system. But typically-- there is always a trade-off, no free lunch-- typically, you need a lot more data for an accurate estimate of that function.

So how do we build a machine learning model? How do we do it? It's just like this. This slide right here shows you the entire process. What we'll do is we'll take all of our available data, and we'll split it into train and test subsets.

Now, I know there are some experts out there saying, well, it could be much more complicated. There could be validation, train and test and all of that. I'm showing the simplest workflow possible.

So we take the data set. We separate it. Usually, about 80% goes in the train. 20% or so goes in the test. There's papers written about what's the best split or proportions for the split. And then what you'll do is you'll take the train data, and you'll build models with a variety of different levels of complexity.

So at the top, what I'm showing is a very simple model. It's a polynomial that's just linear, a first order polynomial. And at the bottom, it's more like a seventh order polynomial. And then we'll take the parameters of that model, and we will set them so that we get the best fit model to the training data. So for each level of complexity, from the top to the bottom, we're getting the very best model to fit the data. We're minimizing the error.

And then what we do is we take those best fit models, and we check them against the withheld testing data. That testing data was not used to get the best fit model. And we can calculate the error over each one of those testing data, and we'll do that for each level of complexity.

Now we'll pick the model that performs best with the data not used to train it. What we're doing is we're picking the very best level of complexity. In other words, we're tuning the model complexity, or as what I'll show right away here, we're tuning the hyperparameters.

So now we have to define a couple terms, parameters and hyperparameters. The parameters, it's not a big deal. The model parameters are simply-- in the case of our linear model, just those coefficients, the B-3, the B-2, the B-1, and the constant terms. And so we will set those parameters in training such that we minimize the error with regard to the training data. We're getting the very best fit model.

The model hyperparameters are totally different. They're the constraint on the model complexity. We're going to select hyperparameters that maximize accuracy when we're testing against the testing data.

Now, for the case of our polynomial model that we're showing here in this example, the hyperparameter is simply going to be the order of the polynomial. Are we working with a first order, a second order, a third order? I show fifth and seventh orders in the cases in the bottom right hand corner. So that's our hyperparameter. It's the degree of complexity of the model. And we tune our hyperparameter with the testing data.

Now, when we're doing that tuning, the hyperparameters, why are we doing that? What's going on here? And what it all comes down to is a variance in bias trade-off. We want to get the best estimates, the most accurate estimates in testing.

Now, testing really means that we're trying to mimic the idea of using the model in cases not used to build the model. That's real world application of the model. Now, you could take the mathematics of expected tests being squared to error, and you could expand it out. I won't go through the derivation here.

But when you do that, you get three components, additive components of error for the real world use of your machine. And what you'll find, there are model variance, model bias, and irreducible error. Now, let me explain each of them.

Model variance is the error due to sensitivity to the data set. In other words, if you were to collect slightly different data, how much would your model change? Now, you could imagine that if you have a situation where you have a linear regression model, a simple linear model, if you change the training data a little bit, it might wiggle a little bit but not that much.

If I increase the complexity and I have a ninth order polynomial-- if you change the data a little bit, that ninth order polynomial will swing around radically. It will change quite a bit. It's more sensitive. Model variance increases as the model complexity increases, and that's the orange line in the bottom right.

Now, model bias is the other side of the coin. It's the error due to the fact that you have a too simple model, an approximative model. And so if I use a linear regression model but I have a complicated natural phenomenon, I have high model bias, because my model is not flexible enough.

As model complexity goes up, what actually happens, as you can see with the blue line there in the bottom right corner, model bias goes down. So we're balancing model bias and model variance with complexity. In other words, as we tune our hyperparameter, we're shifting along that line.

Now, irreducible error is another component, and that's error just due to the data, the limitations of the data themselves. You could have features you didn't sample, but they are central to understanding the natural system. If you don't have that information, of course, you're going to have error in the model.

Or there could be combinations of features you've never sampled. Maybe you didn't have enough samples. In other words, this is just the limitation of the data. And even if you've got the world's leading expert in machine learning here to help you out, they can't reduce a reducible error. And so irreducible error is just constant over all possible models you choose, over all levels of complexity.

Now, we've got to talk about overfit, because that's what it's really all about when we're talking about balancing variance and bias. It's about avoiding overfit. So what is overfit? Overfit could be defined as fitting the data noise in the model. And data idiosyncrasies now become part of the model, and that's a problem. It's going to lead to very bad predictions with your model.

If you increase the complexity-- as you can see on the top right over there-- if you increase the complexity, you'll generally decrease the error with respect to training data. If you look at the example I'm showing below, you can see where we use a very complicated model, we perfectly fit the data. We have no error at the data locations.

Now, when we decrease the complexity of the model-- over towards the right on the bottom-- you can see, we start to have error, but we still have a good model. So as we increase the complexity, we will reduce the training error to zero. But what will happen-- and we'll find this when we do validation of our model-- we'll find that the testing error will go up.

And so you can see, the blue line-- we reach a point in model complexity where the training error is low, but the testing error is going very high. In other words, you're fooling yourself. You think you would know more than you actually do. You have a model that's going to perform very poorly in real world circumstances. That's an overfit model.

So we covered some of the prerequisites of data analytics and machine learning. Now, let's talk about the specifics of energy data analytics and machine learning. First of all, if no one else has done this yet, let me welcome you to the fourth paradigm of scientific discovery. It just started. Isn't that exciting, to actually be alive during a brand new scientific paradigm, to actually be there and see it get started?

When did we have other paradigms? Well, you learned about this back in high school. The empirical science approach, well, you can go way back to antiquity, and you could see they were running experiments. And they were learning about the natural setting by doing that.

The theoretical science came along later. And we start to develop the equations, the analytical expressions, and discover the laws of the classical mechanics, electrodynamics, to start to understand our natural system. Now, what we found out, back when the computers started to get more powerful, was there is many cases for which these natural laws are not sufficient.

Complicated heterogeneous systems, you can't solve it using just the analytical expressions. You had to run computer simulations. And so that's the computational science simulation paradigm, and that would be the third paradigm.

The fourth paradigm is the data-driven science approach, the idea of detecting patterns and anomalies in big data sets. In modern society, we're surrounded by data, and we lack the physical explanations we're now using data-driven science to try to understand. Artificial intelligence is really starting to take off. And so that's the fourth paradigm, and we're there now.

Now, what does that mean for society? Well, if you look at all sectors of our economy, everybody is facing a digital transformation. Deloitte did a recent study. Just last year, they looked at the preparedness of different sectors of our economy.

And when they looked at oil and gas, they put us in the middle. We're not at the lead. We're not trailing behind. But we're somewhere in the middle for all of the ranking as far as readiness. They can see that we're making efforts.

Now, the good thing is we're not alone. Everybody, in all sectors of our economy, are rushing right now to try to find new ways to add value with digital technologies and to add capabilities to their teams around digitalization. We're all doing it together.

Now, I have some biases, though. I should be honest about that. I get asked to speak a bit about the topic. And Price Waterhouse Cooper, just last summer, had me come and sit on a panel. You can see, I kind of stand out. I have the long hair.

And they asked me to stand up there in front of a bunch of energy executives and talk about what's going on with energy digitalization. And this is what I tell them. I think I disappoint and surprise people when I say these things. I say, there's opportunities to do more with our data. I think that right there is the low case, is that we'll just do more with our data, and we'll treat our data better.

There's opportunities to teach data analytics and statistics, machine learning methods to engineers and geoscientists to improve their capabilities. And I mean the students, and I also mean the working professionals. I think it'd be better if we all understood that better.

And geoscience and engineering knowledge and expertise remains core to our business. I don't think we should all be replaced by data scientists. And I'll make a couple more comments around that. I think it's necessary to retain that strong level of geoscience and engineering knowledge.

Now, what am I saying when I make that strong statement? I'm saying that just because we discovered the paradigm of computational science and simulation, it didn't mean we abandoned theoretical science and empirical science. We actually augment new scientific paradigms to our toolkit. We don't replace the older paradigms.

When we have the analytical and theoretical expressions, we're going to use them. When we can solve the system by first principles, we'll do it. When we have to work just based on observations and trying to sample the problem, we're going to do that, too. And when we just have data available to us and we lack physical explanations, we'll use the data-driven approaches, too. They all work together, and they can augment and support each other.

But what's crazy about this, in this data-driven science world that we live in right now, it needs data. And in fact, back in the good old days when I was building subsurface models at Chevron, what we knew was 80% and sometimes 90% of our effort was data preparation, getting the data ready for the model, data preparation, interpretation, and so forth.

We continue to face challenges with data, data curation, the large volumes of data, the large volumes of metadata-- we have a lot of metadata, the data about the data-- variety of data scale, collection methods, the interpretation, the transmission controls and security of our data. They still challenge us.

In other words, clean databases are prerequisite to all of our data analytics and machine learning, and we've got to focus there. We have to work there. That's our foundation for everything we do with our data. And you remember the old adage of garbage in, garbage out? It still stands, even in this modern fourth paradigm.

Now, I also do believe that energy is unique. I would argue that many of the tools and technologies for data analytics and machine learning are not quite ready for what we do, because we are so unique in what we do. We need unique solutions. Why is that? We have sparse, uncertain data, complicated, heterogeneous, open-earth systems.

Compare us to Google or Amazon, they see every click. They have exhaustive data sets. The people working with satellite images, they see every pixel. We sample one trillionth of the subsurface, and even that sample relies a lot on interpretation. And then we have to interpret all kinds of physics-based inversions that go on between those samples to try to understand what's going on.

We have a high degree of necessary geoscience and engineering interpretation of physics because of that. And our decisions are extremely expensive. They are very, very high value decisions. I remember, just years ago, drilling a single well in the Gulf of Mexico. We're talking about hundreds of millions of dollars for a single well in the deepwater Gulf of Mexico, specifically, if you add a production test onto it.

Now, let's compare and contrast that with the very common use of artificial intelligence. I don't know how many of you are using Spotify, but I know many people shop at Amazon. If you've done that, you've encountered what's known as a recommender system. What it does is it tries to look at your behaviors and tries to suggest what you want next.

Maybe this is too much information, but this is my Spotify recommender system back from the summer of 2019. And so what it does it looks at what I listen to at work, and it tries to figure out what I want to listen to next. Now, I'm going to tell you something. I'm Canadian. I do listen to a lot of Canadian music. And because of that, every once in a while, it recommends Nickelback. I want to assure you that not a single Canadian likes Nickelback any longer.

Now, what happens when it recommends Nickelback to me? Well, it starts playing. And you know Nickelback. We all liked it before. Your head starts to bob. Your foot starts to tap. You think, hey, this is pretty good hard rock. And then you remember, this is Nickelback. You fast forward it. You kind of say, darn, I did that again, and you move on.

What was the consequence? Well, clearly, Spotify got it completely wrong. It's like drilling in the completely wrong location. What was the cost of that mistake? There's actually no cost at all. I didn't cancel my Spotify account. Nobody will. We just move on.

They work in the space of very low value decisions. It would make no sense to have human interaction in those decisions. So we have to recognize energy is quite different. We have to be critical users, consumers, and developers of this technology. It was developed for very different applications.

Don't jump to complexity. Now, you remember, I showed you this idea of variance bias trade-off. Let's look at that equation again. I already explained model variance, model bias, and irreducible error.

Now, I did mention that they're all additive, which means the error in testing-- which in other words, is the error in real world use of your model-- is the summation of those three lines. So that red line over there, that's the summation of irreducible error, model bias, and model variance.

Now, what do you see? You'll notice that the best performance, the lowest error in real world use is often not the most complicated model. In fact, model variance usually eats your lunch. It really is a problem. Many of the advanced methods of machine learning are all about trying to defeat model variance.

And so by using a simpler model, we have lower model variance, which is very powerful. What else do we gain? We gain a high degree of interpretability. We understand the model, because it's a simpler model. And so don't jump to complexity.

Interpretability is critical. Now, what's interesting, when you develop methods and workflows, it's very important to have diagnostics to understand what the model's doing. Interpretability of complicated machine learning models is, in fact, very low. Sometimes, it's just absent. Application of a machine may become routine and trusted.

Now, here's a problem. When a workflow or method becomes routine-- it becomes the preferred workflow-- what happens is you'll find that if you don't use that workflow in an organization, it becomes kind of a red flag. You have a lot of explaining to do. The machine becomes trusted. And when you don't understand it, you can't interpret it, it becomes an unquestioned authority. That's very dangerous.

Now, there's a really interesting study that was completed. Ribeiro and others, back in 2016, took a bunch of pictures-- about 20 pictures of wolves and dogs-- and they put them, standardized them. They put them through some type of logistics system. A machine that came out as an output, it said, what's the probability of dog? What's the probability of wolf?

And so what happened when they did this was they put this image into it. And what they found was that it came back high probability of wolf, like 90-something percent wolf. And so they looked at it. And if you're a dog person, you recognize immediately, that's not a wolf. That's a husky.

And so they went back to the machine, and they said, tell us which pixels in this image gave it a high probability of wolf. And now, you seen in the image on the right, that those pixels, in fact, are the snow in the background. The reason this was classified as wolf was because it was standing in snow. All those pictures of wolves, they're always up in Canada, far in the North, in those scary, dark places, that's why.

And so a bias has been put into the system. And if you could not interpret it, you would not understand the machine was getting it completely wrong. Now, this example was shared by Peter Haas in his Ted Talk, which I really appreciated. Go ahead, check that out, great talk about why he was afraid of the AI approaches.

And what he says is even the developers that work on this stuff have no idea what it's doing. And sometimes, I believe that in some of the more complicated methods. And what he also says is that these systems do not fail gracefully. When they get it wrong, they get it completely wrong.

Now, it's important to also talk about meeting technology expectations. The Gartner Hype, Technology Hype Cycle is really important to look at. This is a well-known plot that shows us time on the x-axis and expectations on the y-axis. And so we go from the innovation trigger, the discovery of a technology, through a peak of inflated expectations, through a trough of disillusionment. We go up a slope to a plateau of productivity.

And so what we see is that-- when I go to many different companies-- I mentioned before, I think I visited 20 companies or so last year-- I like to show this chart, and I like to ask, where are you right now? And what I find is that the answer depends on the company. And often, it actually depends on the group within the company or the individual within the company.

But I'll tell you what, I haven't had anybody tell me yet that we're at the plateau yet. In fact, most people suggest that we're somewhere between the innovation trigger and the peak of inflated expectations, if not starting to come down the other side a little bit. Globally, the expectations are very high for this technology, and we need to manage that.

So how are we going to meet these expectations? How are we going to harness this technology? Well, we need operational capability, which inside of a company, it's just fancy speak for saying that we need the skill sets among our people to be able to use these technologies. And I agree, we need data scientists.

The Venn diagram for data scientists is shown here on the right. It's the idea of having domain expertise, understanding the geoscience engineering, the statistics, probability, data analytics, and so forth, and coding, the ability to put together workflows, automating, scripting, and so forth. And that individual, that magic individual who can do all three of those is considered a data scientist.

And I'll tell you what, I go to companies right now, and I'm running into data scientists all the time. They're being hired into our companies. What I'd suggest, in general-- what I find is that often the domain expertise may be a bit low. And so what I've seen a lot of companies do is partner them up with experts with good domain expertise.

Now, this is what I think. I took that Venn diagram, and I wrecked it. It's no longer a Venn diagram, because I made some adjustments to it. And I said, well, let me change the size of the circles to represent where I think we should focus and let me make comments about how we can grow capabilities with our geoscientists and engineers, because I think that's a great idea, too.

What we do in our graduate and undergraduate education at the University of Texas at Austin is I've worked to revamp our program and to put in brand new courses that teach the concepts of data analytics and machine learning, specifically for geoscientists and engineers. I think that's a great idea. We continue to develop the subsurface geoscience and engineering expertise. That's critical.

But what we do is we teach them the statistics. We enhance their understanding of statistics, the practice for data analytics. And we encourage, and we get them coding, which is essential, so that they're able to use the very best tools and build the workflows to add value.

So I think that's what we need to do with our students, and that's what I'm doing right now. But also, what we can do is we can build capability among our existing geoscience and engineering workforce. And that's why I put every one of my lectures online and all of the workflows online, is to support this effort to build that capability. I've seen great companies working to build those capabilities. I very much have enjoyed being part of teaching those individuals.

Let me talk about energy data analytics, machine learning examples. I want to show you a bunch of examples. Now, what are some examples of things we can do with data analytics? One of the great things we can do is work on the problem of feature selection.

It turns out, taking every possible feature and the kitchen sink and throwing it into your model is not a good idea. You'll have a very weak model. It can have very low interpretability. It's way too complicated. And so it's better to choose the best features to work with that communicate the most amount information.

So we use data analytics to help pick the very best features to work with, improved interpretability. We reduce model variance, and we get improved model accuracy while doing it. So while we have many different features we work with-- and this is a matrix scatter plot shown right here-- we have a wide variety of different variables or features we'd be working with in an unconventional setting, we may want to go through those and figure out which ones are most indicative or predictive of production of individual wells.

And so that's what we've done right here. We use data analytics. And we said, we can do a correlation analysis. We can look at things such as rank correlation coefficients that are robust, in the case of having outliers and any type of non-linearity that behaves monotonically.

We can use measures such as partial correlation coefficients-- which I think are way underused-- which are able to actually isolate the influence of individual predictor features on the response. In other words, understand how porosity alone impacts production, very powerful stuff. We can also use model-based importance measures, which are really great. Not only do machine learning models give you great methodologies to make predictions, but they actually can help us rank our features and understand which ones are most impactful.

Now, I've got to tell you, sometimes I just go back to traditional statistics. I go back to the idea of just conditional distributions. And so this is an example right here from that same example, where we have a violin plot. And so what we've done is we've taken the wells, and we split them up into low producing wells, high producing wells. And we just look at the conditional distributions over each one of the predictor features.

And what you can do is you can evaluate which ones of these predictor features have the most unique or distinct behavior between those conditional distributions. Porosity is distinctly different, whereas acoustic impedance, it may be difficult to tell them apart. They're a little bit different from each other. This tells us about the sensitivity. It's a good indicator of the importance of each one of the features.

Now, there's another method we can use in machine learning, that's inferential type of approaches, such as cluster analysis. Cluster analysis is really great. It's an automatic assignment of categories or groups, looking for how things group within your data set. It's a first step to finding patterns.

And so we give ourselves two predictor features. One of them is well average porosity, and the other one is well average acoustic impedance. We have a petrophysical property on the x-axis and a geophysical property on the y-axis. We could do a grouping of rock in such a manner that it's observable in the wells and observable and seismic at the same time.

Now, I use the term rock type. I'm not suggesting facies. I know there's a lot that goes into facies, but I'm just suggesting a grouping of the rock. So we can go ahead and do that. And if we use some type of knowledge about the setting and we say that we expect to see three groups, we can get something like this. We have low porosity, high acoustic impedance, high porosity, low acoustic impedance and something in-between.

Now, this can be very useful. Now, this a very simple example, though. And those who know something about cluster analysis will recognize that I've just used k means clustering here.

Now, there's a lot of methods we can use, methods that can impose prior information, expert knowledge about the setting, methods that can also weight the features based on importance. And we can also integrate complicated group geometries into it, too. Many methods can do that, too. So there's a lot of powerful ways to do this.

Now, of course, we can also do prediction with machine learning, and this is very powerful, too. Now, here's our very first machine learning model for prediction, a linear regression model. We've got density and porosity. And it's useful to be able to go between the two, to infer petrophysical properties directly from measures of the rock.

Now, we have our training data shown there, and we're able to make predictions at unobserved cases, just by using that line right there. That's our model. What's really fun is if you think about the fundamental definition of machine learning, it's hard to argue that linear regression is not machine learning. In fact, a really good quote from one of the professors I work with, Dr. Foster, is when you think of machine learning, just think of a glorified or enhanced version of linear regression, and you'll probably be on the right path.

I challenge my students every term to try to prove to me that linear regression is not machine learning. Nobody's been successful yet. Now, of course, linear regression is not very complicated, and it's very simple. We can go to much more advanced methodologies.

Isotonic regression is very cool. It's a piecewise linear regression method, but it allows you to capture certain physics of the problem, in this case, the monotonic relationship between porosity and density. You expect it to be negatively correlated. In other words, as one is high, the other should be low.

And we can capture that within this model in a very flexible manner. It's actually still parametrically a pretty simple model, not a lot of parameters to estimate. And so it's not a difficult model to work with sparse data, like we have right here.

Now, of course, you can go to polynomial regression. The great thing about polynomial regression is that we have a much more flexible model. It's not linear any longer. But we retain the benefit of having just a few parameters to work with.

Now, there's all kinds of devilish details. If you watch my video on polynomial regression, I get into the idea of using orthogonal polynomial-basis expansion and so forth and why that would be beneficial. But we'll just leave it right here, that we have a very flexible method to work with.

Ridge regression. You might of heard of this before-- not complicated at all. It's actually linear regression, where we add what's known as a shrinkage term or regularization term to the minimization, to the equation that we're solving in order to get the very best fit line. What does it mean? Well, what we're actually doing when we do regularization is if you look at that fit, it doesn't actually look great. It looks like the slope is too low, that it should actually have a steeper slope, and that's on purpose.

What regularization does-- it actually shrinks the parameters to go towards zero. And so it makes the slope be shallower. Now, in doing that, what it does is it reduces model variance. You remember the model variance, model bias trade-off? But it increases model bias. In many noisy data sets, we get a much better prediction when we go ahead and shrink and reduce that slope. That's ridge regression.

Now, what I'll do is I'll switch to an example that has three features. For the following machines, it actually is really cool to look at its behavior over two predictor features. We're going to work with standardized porosity and standardized brittleness. That could be any type of geomechanical property, that indicates anything about frackability or brittleness of the rock. And what we're plotting that against-- the color here is production rate of the individual wells, and this is the samples we're working with from an unconventional data set.

So we'll start with linear regression. When you look at that model, it should look very straightforward. It basically is just a plane in space. It's a linear model. If it's a higher order or a higher dimension model, I should say, you would see a hyperplane that we can project into different dimensions. Very intuitive, very few parameters.

Now, let's move to something more complicated for machines. We have k-nearest neighbor. And what k-nearest neighbor does, and what's really cool about it is you can think about it as basically attempting to do interpolation or mapping in the predictor feature space.

Now, us, as geoscientists and engineers working in subsurface problems, we're very used to the idea of trying to do interpolation or to try to do mapping. And so k-nearest neighbors is just trying to make a map. We're choosing a set of nearest sample data, or training data in this case, to make local predictions.

Now, what's very cool about that is we have a very intuitive hyperparameter. We'll understand this if we've done any type of interpolation. If you use more nearest neighbors, you get a smoother map. If you use fewer nearest neighbors, you get a rougher map, and you fit the data more specifically. What does that mean? We have a very intuitive hyperparameter, k, the number of nearest neighbors.

Decision tree. Decision trees are pretty simple, but they're used to build much more complicated models, like random forest. So let's start with the tree, so we can understand the forest.

So it's a hierarchical binary segmentation of the predictor feature space into blocks. So if you think about it, we just go through the feature space, and we just make splits-- 1 split, the next split, the next split, the next split. And by doing that, we break it up into a bunch of regions. And inside of each region, we predict with the average of the training data within the region.

Very intuitive model, it's very simple. It's actually non-parametric. But underneath the hood, it's not actually that many coefficients required, so the number of parameters actually used in the models is pretty low. It's very intuitive.

The hyperparameter in this case is going to be a pruning of the tree. I love the terminology they use. You overgrow the tree, and then you prune it back down to get it to the level of complexity that does the best in testing, hyperparameter tuning.

Random forest, what we do is we take a set of trees, and we put them together to get the very best estimate. It's an ensemble learner. We'll take a bunch of trees-- take the average of all of the trees' estimates, and we'll say that's the best. What that does, it reduces model variance through averaging, a very powerful concept.

Now, what's very cool about random forest is you enhance ensemble diversity through randomly subsetting the predictor features. In other words, every time you make a split, you don't get to use all of the features. You only use a random subset.

What's the message here? Diversity is strength. It turns out, by having diversity within all of the estimators, you get a better estimator. And random forest, in fact, competes with many of the leading machine learning methodologies. It's very powerful.

Now, gradient boosting, what do we do here? It's another ensemble learning method to reduce model variance, but it's actually very cool. What it does is you take a very weak model, a very weak learner, and you calculate the error from that first learner. Then you train another weak learner on the error. You fit the error. And then you have the second order error, the third order error, and you keep going.

Now, people who are maybe around my generation will remember, there was a razor blade commercial that said, the very first blade cuts very close, the second blade, even closer, the third, even-- that's exactly what gradient boosting does. It's going to the first model. It will try to take a cut at the estimate. The second will get even closer, the third, closer. And it turns out, by learning slowly, that we're able-- and using methodologies related to gradient descent optimization, we can actually get a very good estimate, very, very powerful methods.

So let's talk about how are we going to use this in practice. Let me give some philosophy, some advice as far as using machine learning within energy. First of all, fit-for-purpose modeling. I've been involved in subsurface modeling for a long time. We knew this.

Right at the very beginning, before you even start making modeling choices, you've got to understand the goals of the model, and you've got to put that into the model. It'll affect all of the decisions you make if you're goal-focused. You may also consider future goals, though, because maybe you'll want to grow a little bit into the model. That's fine. But you've got to account for the resources. You're always resource-constrained-- time, people, expertise.

This is the old Venn diagram again, good, fast, cheap. You can't have good, fast, and cheap. It's not possible. You get to have fast and cheap, good and cheap, or fast and good, but you don't get all three.

Modeling for discomfort. I'm a big fan of Mark Bentley. If you see this video, Mark, hey, howdy. He was worried about this idea of modeling for comfort. He said models often become-- subsurface models often become tools for verification of decisions already partially or fully made. We're just proving to ourselves what we already think.

This is what's known as modeling for comfort. It makes us all feel good, but it's very dangerous. It's really a form of confirmation bias. Mark Bentley actually recommended that we model for discomfort. He wants to make us uncomfortable at work, which I think is great. We're stress testing our current concepts and our decision-making.

So in other words, when we do that, we're really testing the extreme cases for identifying and understanding the upside potential, and we're securing ourselves against the worst case. When I teach this in my courses, I talk about MythBusters, that show-- I don't know if you guys watch that. My kids watched it, so I watched it.

Whenever something didn't actually happen in the show, what would they do? They just put more TNT in. They just put more pressure in. They'd make it break. Go to break, I think we should do that. And in doing all of this, we really need to recognize our biases, as Mark Bentley reminded us. Thanks, Mark.

Now, we've got to remember, too, that our foundation and everything we do is in probability and statistics. If you use methodologies like Naive Bayes classification, it's, in fact, derived directly from Bayesian statistics. And here's the equations right here with the independence assumption. All of the methods have a statistical foundation. To understand the method, you have to understand the statistics.

Now, there are times you don't really need complicated machines. If you have enough samples, you can just work directly with conditional, joint and marginal probabilities to make predictions. Here's an example right here on the bottom right, where we have acoustic impedance versus porosity.

We have enough data. We could just calculate the conditional distributions and make predictions with that model. We don't need to build a machine to do that. But remember, machine learning is statistical learning.

Now, let me just give a couple of warnings right now. The concept of parsimony. Now, start simple, build the simplest model you can and then build up from that. Models must be understandable and interpretable. Don't jump to complexity. In fact, you may even do worse, since we showed with the variance bias trade-off.

And in fact, I always challenge my students. If they build a complicated model, I always say, well, did you build a linear regression model first? When I did research and development within Chevron's energy technology company, I always had to demonstrate incremental value of every type of workflow I proposed or methodology we developed. It should be the same way. We've got to show that it performs better than a simpler tool, because we're going to lose interpretability, too.

Scientific method. Go back and just think about the rigor of the scientific method, the robust use of statistics, the highly critical approach that should be used. We should always be trying to disprove our models. We should always be trying to disprove.

And we should take inspiration from the traits. My twin brother and my grandfather were both machinists. And I'll tell you what, I respect very much the traits. Their knowledge of their tools, the tools that they carry in their toolkit-- in their tool box-- the tools that they use every day, is exceptional.

They don't wreck a lot of metal. They know exactly how the tool will perform in a wide variety of different circumstances. They know what's the best tool to use. We should have the same level of competencies when it comes to machine learning tools, when we use them in practice. We should be inspired by the traits.

Getting started. There's a lot of resources available to help you get started. This is a great time to get started. This is my talk, so I'm going to promote my resources. But I know, it's shameless. Many people have great resources out there, and you can find them.

All of my lectures are available. These are three example lectures. On the left hand side from my three courses, spatial data analytics, data analytics and geostatistics, and machine learning courses. All of the examples are available on GitHub. So you can follow along, you can go through it. Everything in the course, you can work it out at home. Everything is there for you.

My advice to you, if you haven't done it already, go back home or do it at work if it's allowed-- download Anaconda, install it-- one-stop shop. You'll get Jupyter notebooks. You get scikit-learn pack. You get all the standard packages, NumPy Pandas. Everything will be there for you to get started. You can run the workflows that have my course coding.

The most powerful, flexible methods are in Python packages. Open source is really, really awesome in this area. So you'll need to know a little bit of coding if you want to really maximize your impact. Don't worry about it. It's not like in the dark ages when we were all doing Fortran and C++. I spent many years doing that myself. It's Python and R. And really, it's more about scripting and putting workflows together.

If you don't believe me, well, look at this example right here. So the first line of code is just simply importing the package for the decision tree from scikit-learn, which is awesome. I love scikit-learn. The next line of code is simply loading your data up. And you can actually load that data, too. It's on my GitHub account. It's just a simple, comma-delimited file. You can load it up in Excel. It's very simple.

The next line is building your decision tree. It's actually instantiating or making a decision tree. You're setting the hyperparameters, the degree of complexity in those parameters. The next step-- you're going to fit your data. I had porosity and brittleness data, and I'm trying to predict production of wells.

And the next line-- I'm just taking a combination of porosity and brittleness percentages, and I'm making a prediction at a new location with my model. That's it. That's it. I did machine learning in five lines of code. It's that simple.

Now, just in case you want to get started-- if you're like really excited and you want to start right now, go to that link right there. And the last five machines that I showed you, from linear regression all the way to gradient boosting, there's a very simple workflow in Jupyter notebooks-- actually in the top right there, that's the header of the Jupyter notebook. It's got instructions. It's pretty well documented.

And you can go ahead and work through the examples that I just showed you, play around with the hyperparameters, change them up, overtrain the model, undertrain or underfit or overfit the model. Have a little fun with it, it can be a lot of fun. The best way to learn machine learning is just like being a machinist-- is to actually use the tools and get practice with the tools.

Just in case you need a little bit more motivation to start coding, here's a couple of points. I like to talk to my students about my top reasons to learn to code. First reason, transparency. No compiler, no computer accepts hand waving. You don't get to hand wave. Coding forces you to make your logic bare. You have to actually show people exactly what you mean.

Reproducibility. Run it, get an answer, hand it over to somebody else. They run it. They get the same answer. This is the main principal of scientific method. Quantification programs need numbers. And so you've got to feed the program. You'll discover new ways to look at the world through quantification. I think that's powerful.

Open-source. As I said before, the very best methods for machine learning are in Open-source. You can leverage the world of brilliance. I needed to build a subsurface trend model, and the best code I could find to get the job done was Astro Pi. It's an astronomy package that's being used-- or astrophysics-type package that was being used for all kinds of mapping in space, literally space. And I start to use it for mapping in the subsurface. It was perfect. It worked great. And I didn't have to code it myself.

Deployment. If you have a great idea, you build a workflow and you code it up, you can share it with others. And you can multiply your impact. You may be concerned about your performance metric at the end of the year, or you may just be a super nice person who wants to help people out. It doesn't matter. It's both going to happen. You're going to look great while helping others.

Let me talk about just a couple of technologies or examples that are more advanced machine learning, so again, the idea of what can be done with machine learning. This is the work of one of my PhD students, Wendy Liu. And what she was doing was developing a brand new methodology for spatial data analytics, for spatial anomaly detection.

Now, it turns out that if you have a data set-- I show here porosity, but it could be production. It could be a dense data set with many wells. You may have one well that tends to be a little bit high, and you wonder, is that anomalous? Is that unusual, that something different happened here, or is this something we would expect?

Now, what she's done is she developed a methodology whereby we can build maps. And the map on the bottom right, the purple areas are locations where you have well-to-well transitions that are unusual or have a low probability, therefore, likely anomalous. You can detect discontinuities in your data set spatially. You can also detect when you have an unusual well, maybe something went wrong with the completion, or maybe something went right. What is the difference that makes a difference?

Here's another example. This is a student, Honggeun Jo, another one of my PhD students-- is working on building subsurface models using machine learning, now, specifically, the challenge of precise conditioning to wells. Now, this is very important, because we need to match the wells at the well locations. Remember how expensive that well data is.

The methodology is using a semantic inpainting. In fact, it's the same technology that's used for image restoration when you have a rip or a tear in that image-- you know, when the pictures of grandma and granddad have worn out, and they want to fix it using the machine. That's exactly what we use here.

What we do is we build models that approximatively honor the well data. We mask or remove around the wells. And then what we're able to do is replace around the wells, such that we match the conceptual information-- the model around the mask, and the perceptual information, the model elsewhere, so that we match the well data, but we don't mess up the concepts. This example is deepwater lobes, compensation with lobes. It works quite well.

Seismic downscaling. Another one of my students, Wen Pan, is working on this. This is the idea of taking a machine-- it's a pixel-2-pixel method, a variant of convolutional neural nets, where we're able to train it on a wide variety of high resolution models.

Now, that model in the center bottom is a truth model with a high resolution of architecture, maybe like a meandering type of channel fluvial belt system. We upscale it to seismic-- the image shown above-- and we train it, so that the machine can tell the difference. It can map between the seismic and the high resolution architecture.

Once we've done that, we can give the machine a variety of seismic images. And it can predict what the architecture might look like within the seismic. That's seismic downscaling. We can get many, many models very fast. They're actually very good. The model on the top right actually looks really good. It's able to actually put the architecture in quite well and honor the wells at the same time.

Now, I'd like to be kind of transparent about this. The image on the bottom right hand side doesn't look so great, and the reason being, the seismic was ambiguous. It's kind of more of a blob. You don't have the nice arcuate shapes. It didn't know where to put the channels. So this is really interesting, the seismic downscaling.

Now, we're going beyond that. We're actually building reservoir models completely using machines. And so here's examples, shown on the right hand side, of a variety of realizations and models built by one of my PhD these students, Wen Pan.

And so this is incredible, because these models are much more complicated than the models that we've been building up till now. So they can honor much more of the geologic, geophysical, and engineering information. We get better models by using machines than our traditional geostatistical methodologies, and we get a fast methodology to explore over uncertainty spaces.

Here's another example-- just two more examples-- production forecasting. And so what we've done here-- one of my PhD students who has graduated and now working in the industry-- he worked on this idea of making predictions or forecasts for production over time. And so what we did was we trained up the model which was the short term memory network with 2,500 days of production and injection information. There were nine injectors.

So if you look right here on the left hand side, we have nine injectors in a complicated injector history, a lot of cycling of the injection. Then what we did, after we train the system, is we forecast it for 1,000 days into the future. And what was really cool is that this system learned those interactions between the injectors and the producers, which were complicated, to the point where it was able to make very good predictions of flow into the future.

What's kind of spooky about this is if you look really carefully at the image, the red well, injector number four, in fact, has injection behavior during the training stage that's totally different than during the testing stage. And we're still able to make good predictions. It learned the system very well.

Finally, multi-scale flow proxy models. This is very exciting. One of my other students, Javier Santos, has developed a model that's doing very well at making accurate predictions of flow velocity between the grains in rock. So we're taking very small scale models, like these models-- this model right here is 2 centimeters by 2 centimeters by 2 centimeters. And we've imaged the grains, and we're able to actually make good predictions under certain pressure and fluids of what would be the flow velocities.

This is really, really cool. Because if we did this using Lattice Boltzmann or more complete physics-based calculations, you'll find that it would take a long time to do these calculations. These can be done well under a second, very quickly. Now, by doing this, by making very quick calculations, a small scale flow, we hope to be able to work out problems around multi-scale permeability, moving from the very fine scale to the more production-relevant scales.

Let me make a couple concluding remarks. First of all, data analytics and machine learning methods provide new tools. It's a fourth paradigm of scientific discovery for all of us to use and add value. How are we going to add value? Efficiency and automation.

What I like to say is, geoscientists will do more geoscience. We'll be able to automate a lot of the more mundane parts of our jobs and be able to focus more on the more scientific. And in fact, we can do things such as detecting anomalies.

Now, that's great because we work in very large data sets. And so we'll be able to focus on what matters, find the locations that matter most and focus our professional time finding new patterns. We can use these methodologies to see our data in totally different ways and to pose new scientific questions, go back to our geoscience and engineering concepts and try to work it out. It's a very nice balancing between the two.

Assisted interpretation. We've been doing that for a while. It helps us a lot. I'm surrounded by people-- I've seen this a lot-- who have actually injured themselves through repetitive stress injuries. It's happened a lot through all that clicking and interpretation. Well, we'll have machines to support and to help us to guide our interpretation even further, so we can spend more time focused on the geoscience questions.

Now, new, improved models. And I hope I've shown a couple examples right there. I'm excited, because these models actually better integrate our expert knowledge and also provide us with real-time feedback. They're very, very fast. So as we're building our models, we'll get feedback.

I like to teach my students, it's like TurboTax. When you do TurboTax at the beginning of the year, you put in your information from a W-2 or W-4-- I forgot now. But basically, what happens is it will immediately tell you how much you're getting back. And you know what happens next. You start answering questions, and all of it disappears. No, I'm kidding. I'm kidding. But it starts to change.

That's how I would like to see subsurface modeling with machines. We make decisions, like fault transmissibility or maybe we make decisions such as fault throws or offsets or structural shape or all of the interpretations. And we immediately see how that will impact-- or maybe it doesn't impact. Maybe we don't want to focus there.

Geoscience and engineering expertise remains core to our business. We'll have augmented abilities and capabilities with the new digital technologies. Before, people used to say, those with the best data win. What I think we say now is, those with the best data and use the data best win.

So I'd like to acknowledge now the AAPG and the AAPG Foundation, all of the host organizations-- for when to get on the road and get to tour around a little bit. I'm excited to do that-- for this great opportunity to share this message around machine learning and data analytics. Thank you very much.

Show more
Related Videos

In the News

Middle East Blog
The Distinguished Lecturer (DL) program is now available online. This is a unique opportunity to engage and benefit from a hand-picked group of outstanding speakers covering a wide array of topics from your desktop or mobile device. Check out the two lecturers below that are most relevant to our Middle East Region.
American Association of Petroleum Geologists (AAPG)
Explorer Article
Five internationally acclaimed geoscientists have been named for this season’s AAPG Distinguished Lecture program, the Association’s flagship offering of cutting-edge geoscience excellence that once again will be accessible to everyone, everywhere, at any time.
American Association of Petroleum Geologists (AAPG)

Distinguished Lecturer

Michael J.

Michael J. Pyrcz

The University of Texas at Austin

See Also ...


  • 55547 Every energy company that I visit is interested in growing internal capabilities to add value with data analytics and machine learning. Energy has a long history of working with large, complicated geoscience and engineering datasets and there is a growing toolbox of old and new emerging data-driven methods available that may offer improved efficiency and potentially new insights from vast and complicated subsurface datasets. This talk is an opportunity to link subsurface data analytics and machine learning to fundamental concepts from probability, statistics, geoscience and engineering and to provide an enthusiastic, but at times critical perspective on what we may expect in the data-driven science revolution. Data Analytics and Machine Learning for Energy Geoscience and Engineering https://www.aapg.org/career/training/in-person/distinguished-lecturer/abstract/articleid/55547/data-analytics-and-machine-learning-for-energy-geoscience-and-engineering
    Data Analytics and Machine Learning for Energy Geoscience and Engineering


Heather Hodges Programs Coordinator +1 918-560-2621
Susie Nolen Programs Team Leader +1 918 560 2634