26 February, 2020

Excel-Based Random Forest Machine Learning Algorithms: Programming and Application

Part of AAPG's GeoAnalytics Credentialing Program

Software engineering is becoming a key skillset in the oil and gas industry. These skills are used to implement data science projects that increase production, lower costs, and improve safety. To boost the value of oil and gas practitioners, this course teaches software development in VBA for machine learning and the creation of a random forest tool in Excel.

Web-Based Participation

This course will be available for online participation only.

Software engineering is becoming a key skillset in the oil and gas industry. These skills are used to implement data science projects that increase production, lower costs, and improve safety across the industry. As a result, employment rates and salaries are significantly higher for those with industry-focused backgrounds who also have software engineering skills. Not only is it important to be able to build and utilize software tools directly, but it is also important to recognize and define opportunities to add value with software in the incumbent workflow. For this reason, those with an understanding of data science methods and the capacity to implement them have superior leadership qualifications and are increasingly likely to control teams and budgets.

Hear from the instructor and get a sneak peek into the course.

The random forest is a powerful and widely applicable machine learning method capable of adding tremendous value in the workplace through both regression and classification. Regression involves using previous samples to predict variables such as well performance, source rock maturity, non-productive time, drill bit fracture, clay swelling, the swab-off effect, likelihood of human accidents, commodity prices, etc. Classification involves assigning a new sample to the correct category, such as tagging reports, rating reserves, deciding between development plans, dividing work among teams, choosing from a list of tools, etc. Random forests are computationally intensive but well within the capacity of common modern computers. Even without any customization at all, a blank slate random forest is easily capable of yielding accuracies in the top 20% of some of the most competitive global machine learning competitions from as recently as 2012. Random forests can also derive strong accuracy without overfitting on small datasets, of which there are many in the oil and gas industry. Using random forests, industry practitioners will be able to drive more accurate data informed decision making across their firms.

There are many benefits to the industry in working with Microsoft Excel. While machine learning efforts on large datasets are often undertaken using computer languages such as Python and R, most of the industry’s data is already stored and manipulated in Microsoft Excel spreadsheets. Today’s powerful processors in standard work computers allow even computationally intensive machine learning methods to be implemented in Excel within tolerable timeframes. Practitioners used to working in Excel can also avoid the sharp learning curve that new computer languages impose and can avoid the security limitations of using new software in the firm. Finally, Excel allows the user to visualize all the data stored in even the intermediate steps of machine learning implementation, increasing transparency and credibility of results. Practitioners can demonstrate exactly how they have arrived at their recommendations from a random forest in Excel.

Despite Excel’s status as the most popular data storage and data manipulation tool in the industry, many practitioners have not been taught how to use the built-in scripting language Visual Basic (VBA) to develop software tools in Excel. In addition, despite Excel’s capabilities, many of the more powerful data science methods have yet to be adapted to Excel. To boost the value of oil and gas practitioners in the industry, this course teaches software development in VBA for machine learning and the creation of a random forest tool in Excel.

Module 1: Practical Software Development Using Visual Basic for Data Science in Excel

This module is an in-depth exploration of coding and data science techniques available in the Visual Basic (VBA) scripting language in Microsoft Excel. The course will teach attendees to build tools which can programmatically analyze and manipulate dynamic data samples. Attendees will learn to perform tasks such as finding and correcting errors in data, conditionally sorting data into categories, automating calculations and formatting for single-button report generation, and even implementation of more advanced data science techniques.

Beginning with an overview of the contemporary utility of Excel, the module will then progress to cover the additional value VBA adds to Excel and how it is set up and used in conjunction with Excel. The next section touches on best practices in setting up data in Excel for use of data science to extract value. This section also covers the challenges presented by conducting data science projects in Excel and how VBA can help to solve them. From there, the module moves to a deep dive into VBA, covering how to reference the Excel worksheet, how to arrange software for maximal effectiveness and for ease of debugging, and the specific scripting techniques available in VBA with their variety of uses. The module includes regular individual and group exercises as well as introductions of valuable free online tools to help troubleshoot implementations of VBA in Excel. This module is designed for attendees who are already proficient Excel users, but the material is appropriate for attendees who are new to coding in general or new to VBA in particular. Proficient Excel users are those who are familiar with creating, saving, and accessing Excel files; with manipulating rows and columns; as well as with using formulas in cells and across tabs. Examples throughout the module are taken from the oil and gas industry.

Module 2: Random Forest Machine Learning Implementation in Visual Basic for Excel

Module 2 walks attendees through using the techniques from Module 1 in VBA and Excel to create a random forest tool, which is a powerful and flexible machine learning algorithm typically unavailable in Excel. Random forests are capable of working even with smaller data sets to achieve accurate results without overfitting the data.

The module begins with a brief history of machine learning and the general reasons behind its relevance and utility today. After comparative advantages and drawbacks of different machine learning approaches are explored, the module moves to a deep dive into random forest. This technique is covered thoroughly in interactive sections to develop an intuitive understanding of the appropriate use cases, the overall purpose of the method, and each of the component parts. Following a brief section on practical considerations for random forests in Excel, the module then covers the pseudo code for each component in visual basic. Participants engage in activities to create and connect these components from their pseudocode, and results are compared to a master random forest file shared with each attendee. The module then progresses through the entire back end of the tool with intermediary outputs into Excel to help build visual understanding. After another round of exercises putting the tool to use on data sets, the module concludes with a recap of the tool, resources for continued development, and suggested projects for using VBA in Excel as well as the random forest tool in the workplace. Examples and datasets used throughout are taken from the oil and gas industry.

The modules are recommended to be taken together in series, but each can be taken independently depending on the goals of the attendee. The first module covers general methods for scripting in VBA, so it is still useful on its own to attendees planning on building small scale software to manipulate and learn from data in Excel even if they have no other interest in random forests or in practicing larger projects in a course setting. Module 2 involves putting together the various building blocks covered in Module 1, however, so it is still good practice even for those uninterested in random forests. Likewise, the examples and context in Module 1 will prepare attendees for specific challenges presented by the random forest in Module 2. For this reason, it is recommended that even attendees already familiar with scripting in VBA take both modules.

Module 1: Practical Software Development Using Visual Basic for Data Science in Excel

Microsoft Excel refresher
1. What is Excel used for
2. What is Excel best at
Intro to Visual Basic (VBA) for Excel
1. When to write scripts instead of using formulas
2. How to set up and run VBA
Data science in Excel
1. Data engineering
2. Data clean and prep
3. Methods for insight extraction from data
How to implement data science projects in VBA
1. Worksheet as backend
2. Functions and subfunctions
3. Data types
4. Loops and conditional statements
5. Debugging best practices
6. DIY learning tools
Curated practice projects

Module 2: Random Forest Machine Learning Implementation in Visual Basic for Excel

Machine learning introduction
1. History and contemporary relevance of machine learning
2. Case studies for effective use of machine learning in the industry
3. Implementation methods and their drawbacks
Random forest deep dive
1. What are random forests used for in machine learning
2. How random forests work
3. Example use cases and results
How to set up a random forest in Excel using VBA
1. Practicalities of random forest in Excel
2. Intuition and pseudocode
3. Random forest implementation walkthrough
Curated practice projects
Conclusion
1. Recap of the tool you now have and how you can use it
2. Resources for continued development
3. Suggested projects using VBA and using random forests