Working with Imbalanced Data in Machine Learning Algorithms

Here are some ways to bridge the gap between real and optimal in seismic interpretation.

While scientists and workforces are becoming more at ease with the term “machine learning” and less reluctant to use these methods, there are still many uncertainties about their correct use and output understanding. Therefore, there is a need to explain what might be happening in the so called “black boxes” and the potential causes for high errors and misclassifications. If not well understood, these issues can lead to incorrect interpretations and economic losses.

We would dare to say that 80 percent of any machine learning method, especially unsupervised, relies on input data preparation: the principle of “garbage in, garbage out” applies. So, it needs to be understood not only what kind of data it is, but how it can be optimized. The amount of data, their relationships and quality are just a few aspects to consider in order to do this. The first aspect is in focus here, since disparities in data (imbalance) can result in significant errors.

The term “imbalance” in data refers to the differences in proportions between classes. In this sense, the class that has the majority of records or instances will be called “negative class,” and the underrepresented or minority class will be named “positive class.”

Most of the geological settings we are used to studying are indeed imbalanced. Imagine a deepwater setting in which channel complexes are usually surrounded by shales in a major proportion, or as another example, a setting in which salt tectonics dominate and salt bodies will represent that negative class. Since a perfect dataset would only exist in a utopian world, we need to understand our data and how to optimize results without biasing or overfitting our models.

Please log in to read the full article

While scientists and workforces are becoming more at ease with the term “machine learning” and less reluctant to use these methods, there are still many uncertainties about their correct use and output understanding. Therefore, there is a need to explain what might be happening in the so called “black boxes” and the potential causes for high errors and misclassifications. If not well understood, these issues can lead to incorrect interpretations and economic losses.

We would dare to say that 80 percent of any machine learning method, especially unsupervised, relies on input data preparation: the principle of “garbage in, garbage out” applies. So, it needs to be understood not only what kind of data it is, but how it can be optimized. The amount of data, their relationships and quality are just a few aspects to consider in order to do this. The first aspect is in focus here, since disparities in data (imbalance) can result in significant errors.

The term “imbalance” in data refers to the differences in proportions between classes. In this sense, the class that has the majority of records or instances will be called “negative class,” and the underrepresented or minority class will be named “positive class.”

Most of the geological settings we are used to studying are indeed imbalanced. Imagine a deepwater setting in which channel complexes are usually surrounded by shales in a major proportion, or as another example, a setting in which salt tectonics dominate and salt bodies will represent that negative class. Since a perfect dataset would only exist in a utopian world, we need to understand our data and how to optimize results without biasing or overfitting our models.

To understand the impact of the class imbalance we need to first understand the degree of imbalance between classes, and second, we need to understand the overlap between classes. The degree of imbalance (also known as IR) is obtained by relating the total number of negative class examples and the number of positive class examples. Figure 1 shows an example of how shale facies represent the majority of a training dataset (also known as MTD) used to predict deepwater channel facies. Here, five facies are used as labels. If we relate background “shale” to any of the other facies we would see that it doubles or triples them, creating an imbalanced dataset.

 

However, when we sum up all the channel facies and MTD, they will be almost proportional to the background shale, suggesting that we should simplify our labels.

The type of machine learning technique we intend to use should also be considered. Supervised methods count on error metrics given by the fact that input and output are known. However, in the case of unsupervised ML techniques, only the input is known, and the selection of the right number of clusters is still under debate. So, we will provide general guidance based on our experience on what to do with each type of method for seismic facies interpretation.

Unsupervised Methods and Imbalanced Data

When using clustering algorithms, such as K-means clustering, self-organizing mapping (SOM) and generative topographic mapping (GTM), it is suggested to use seismic attributes suitable for the geological target. Analyze their statistical relationships and refine them if necessary, using a dimensionality reduction technique (for example, PCA, ICA, Shap values). Additionally, consider using a method such as an elbow plot to determine the optimal number of clusters. An elbow plot visualizes the relationship between the number of clusters and the explained variance, aiding in pinpointing where additional clusters yield diminishing returns in explaining the data’s structure. An example is shown in figure 2, which depicts results and error using a ML technique (GTM) with three clusters where five were expected. Notice how error is reduced by using an optimal cluster number determined by an elbow plot. This suggests that, although we intend to depict five facies, the data patterns end up forming three major clusters.

Interpreting these visually, we can see that one corresponds to the majoritarian shale facies, another represents the MTDs, and the third cluster represents the channel facies (axis, off-axis and margin) combined. It is important to remember that while we aim to identify all discrete facies, the properties of the seismic data (frequency, noise, resolution, etc.) and the type of seismic attributes used, as well as their parameterizations, play a fundamental role in the detailed or non-definition of different seismic facies.

Supervised Methods and Imbalanced Datasets

Supervised ML needs an input (labels, plus features) to train the model, allowing us to know the expected facies first and enabling numerical error estimation or performance evaluation.

Figure 2 displays a distribution of training data. By contrasting seismic attributes used as features, it becomes apparent that there is an overlap between shale and most of the channel facies. This overlap will result in misclassification of these facies with shale. One potential solution is to explore other features that create greater distinction between classes or to perform a resampling of the data.

Some of the methods we can employ to address imbalanced datasets include:

  • The use of simple algorithms such as DBscan or K-means clustering with realistic labels or minimum classes: Another option is the use of boosting algorithms (such as random forests) which assign different weights to the training distribution in each iteration. After each iteration, boosting increases the weights associated with the incorrectly classified examples and decreases the weights associated with the correctly classified examples.
  • Oversampling or undersampling: oversampling refers to appending to the original data set, while undersampling involves the removal of data from the original dataset, typically from the majority or negative set, to achieve the same proportion or balance. However, this method could introduce its own set of problematic consequences, which can potentially hinder learning.
  • The use of correct statistical metrics: A confusion matrix is a popular tool for understanding and evaluating classification problems. It compares actual or original vs predicted values. The main diagonal shows the samples that were classified correctly, while the other fields help us understand where and to what extent misclassifications occur. While accuracy is normally the metric evaluated in confusion matrices, it places more weight on common classes than on rare classes. This can make it challenging for a classifier to perform well on rare classes when evaluating imbalanced datasets. In such cases, it is recommended to use the F score, which is a weighted harmonic mean between precision and recall. There is also a G score that, instead of a harmonic mean, uses a geometric mean.
  • Receiver operating characteristic curve and area under the curve: ROC is used to analyze classifier performance by comparing the false positive rate on the x-axis versus the true positive rate on the y-axis. The closer the curve is to the upper-left corner, the better the classifier is. By calculating AUC, we can obtain a score for the classifier. A higher AUC score, closer to 1, indicates a good classifier with a top-left ROC curve. AUC lower than 0.5 can be considered as indicating a poor classifier. It’s important to note that these methods are designed for binary problems and not multiclass scenarios, which are common in seismic facies. When dealing with multiclass scenarios, we may need to evaluate ROC or AUC per class, and it becomes sensitive to class skew, as the negative class would be a combination of N-1 classes.

Conclusions

To summarize the recommendations provided here, we have created an easy-to-follow workflow (figure 3) that could be helpful for interpreters who are beginners in dealing with imbalanced data.

No dataset will ever be perfect, so very accurate models should arouse suspicion because they will probably reflect bias or overfitting problems. Also, not every imbalanced dataset will necessarily result in poor training data. As geoscientists, we need to use the tools such as seismic attributes and ML techniques intelligently, be aware of our data’s limitations, apply best practices, including parameter optimization and, most importantly, recognize that in geology, nothing is perfect. Our understanding of the geological context and subsequent inferences still needs to be carried out. We believe that machines can’t fully replace us, at least not yet.

(Editors Note: The Geophysical Corner is a regular column in the EXPLORER, edited by Satinder Chopra, founder and president of SamiGeo, Calgary, Canada, and a past AAPG-SEG Joint Distinguished Lecturer.)

You may also be interested in ...