Excel-Based Random Forest Machine Learning Algorithms: Programming and Application

14-15 April 2020

Texas, United States

Overview
Description
Pricing
Venue

Who Should Attend

Geoscientists, engineers, and data scientists who want a powerful tool that works with data in a familiar spreadsheet format
Individuals who need to develop models, simulations, and scenarios that can quickly and effectively accommodate rapid change, uncertainty and complex interactions
Professionals who need to develop models to manage risk and to support data-driven decision-making
Individuals involved in supply chain, transportation, and water / emissions management
Industry leaders seeking clarity on the do’s and don’ts of machine learning
Consultants seeking to boost their capacities and provide high value to clients
Students and members in transition seeking to build their resumes

Objectives

This course will teach attendees to:

Build and use a powerful tool capable of making valuable predictions and classifications from data in the workplace
Continue to adapt the tool to add new functionality or customization
Develop and deploy basic software in Microsoft Excel using Visual Basic (VBA)
Recognize opportunities for new data science software to add value in the firm
Collect and organize data for use in data science projects in Excel
Identify free online tools to continue their learning outside of the course
Determine when machine learning should be used on a problem
Describe the random forest method and component steps in rigorous detail
Explain random forest’s place in the world of machine learning
Create effective pseudocode and commenting in larger-scale software
Judge the effort and time requirements of software and data science projects

Why This Course Stands Out

You keep the tool: The random forest tool that is developed in the class and its master copy from the instructor are production-ready and will be given to all attendees.
The course is built for you: There are many courses teaching software engineering and data science today. Rather than dilute the lessons in order to teach to a general audience, this course has been built for an oil and gas attendee skillset and use cases. All examples and exercises will reflect common workplace problems faced by attendees.
Move quicker in Excel: Learning to deploy random forest can take several weeks of training in graduate data science programs, but this course makes use of your existing knowledge of Excel to accelerate the learning. You can also skip all the parts about learning to create a robust back end and just rely on Excel spreadsheets for data storage.
Random forests add considerable value: Random forests are widely applicable, do not require extremely large data sets, can be used for both regression and categorization, and are typically capable of yielding high accuracy results without extensive tuning. They are generally not available in Excel, but this workshop makes them so.
Understand every step: Rather than teach advanced data science methodologies using off-the-shelf tools that do not allow visualization of intermediate steps, this course deliberately pushes all intermediate output into Excel, making it easy to see, understand, trust, and adapt everything created in the course for deployment on the job.

Web-Based Participation

This course will be available for online participation only.

Course Content

Web-Based Participation

This course will be available for online participation only.

Software engineering is becoming a key skillset in the oil and gas industry. These skills are used to implement data science projects that increase production, lower costs, and improve safety across the industry. As a result, employment rates and salaries are significantly higher for those with industry-focused backgrounds who also have software engineering skills. Not only is it important to be able to build and utilize software tools directly, but it is also important to recognize and define opportunities to add value with software in the incumbent workflow. For this reason, those with an understanding of data science methods and the capacity to implement them have superior leadership qualifications and are increasingly likely to control teams and budgets.

Hear from the instructor and get a sneak peek into the course.

The random forest is a powerful and widely applicable machine learning method capable of adding tremendous value in the workplace through both regression and classification. Regression involves using previous samples to predict variables such as well performance, source rock maturity, non-productive time, drill bit fracture, clay swelling, the swab-off effect, likelihood of human accidents, commodity prices, etc. Classification involves assigning a new sample to the correct category, such as tagging reports, rating reserves, deciding between development plans, dividing work among teams, choosing from a list of tools, etc. Random forests are computationally intensive but well within the capacity of common modern computers. Even without any customization at all, a blank slate random forest is easily capable of yielding accuracies in the top 20% of some of the most competitive global machine learning competitions from as recently as 2012. Random forests can also derive strong accuracy without overfitting on small datasets, of which there are many in the oil and gas industry. Using random forests, industry practitioners will be able to drive more accurate data informed decision making across their firms.

There are many benefits to the industry in working with Microsoft Excel. While machine learning efforts on large datasets are often undertaken using computer languages such as Python and R, most of the industry’s data is already stored and manipulated in Microsoft Excel spreadsheets. Today’s powerful processors in standard work computers allow even computationally intensive machine learning methods to be implemented in Excel within tolerable timeframes. Practitioners used to working in Excel can also avoid the sharp learning curve that new computer languages impose and can avoid the security limitations of using new software in the firm. Finally, Excel allows the user to visualize all the data stored in even the intermediate steps of machine learning implementation, increasing transparency and credibility of results. Practitioners can demonstrate exactly how they have arrived at their recommendations from a random forest in Excel.

Despite Excel’s status as the most popular data storage and data manipulation tool in the industry, many practitioners have not been taught how to use the built-in scripting language Visual Basic (VBA) to develop software tools in Excel. In addition, despite Excel’s capabilities, many of the more powerful data science methods have yet to be adapted to Excel. To boost the value of oil and gas practitioners in the industry, this course teaches software development in VBA for machine learning and the creation of a random forest tool in Excel.

Module 1: Practical Software Development Using Visual Basic for Data Science in Excel

This module is an in-depth exploration of coding and data science techniques available in the Visual Basic (VBA) scripting language in Microsoft Excel. The course will teach attendees to build tools which can programmatically analyze and manipulate dynamic data samples. Attendees will learn to perform tasks such as finding and correcting errors in data, conditionally sorting data into categories, automating calculations and formatting for single-button report generation, and even implementation of more advanced data science techniques.

Beginning with an overview of the contemporary utility of Excel, the module will then progress to cover the additional value VBA adds to Excel and how it is set up and used in conjunction with Excel. The next section touches on best practices in setting up data in Excel for use of data science to extract value. This section also covers the challenges presented by conducting data science projects in Excel and how VBA can help to solve them. From there, the module moves to a deep dive into VBA, covering how to reference the Excel worksheet, how to arrange software for maximal effectiveness and for ease of debugging, and the specific scripting techniques available in VBA with their variety of uses. The module includes regular individual and group exercises as well as introductions of valuable free online tools to help troubleshoot implementations of VBA in Excel. This module is designed for attendees who are already proficient Excel users, but the material is appropriate for attendees who are new to coding in general or new to VBA in particular. Proficient Excel users are those who are familiar with creating, saving, and accessing Excel files; with manipulating rows and columns; as well as with using formulas in cells and across tabs. Examples throughout the module are taken from the oil and gas industry.

Module 2: Random Forest Machine Learning Implementation in Visual Basic for Excel

Module 2 walks attendees through using the techniques from Module 1 in VBA and Excel to create a random forest tool, which is a powerful and flexible machine learning algorithm typically unavailable in Excel. Random forests are capable of working even with smaller data sets to achieve accurate results without overfitting the data.

The module begins with a brief history of machine learning and the general reasons behind its relevance and utility today. After comparative advantages and drawbacks of different machine learning approaches are explored, the module moves to a deep dive into random forest. This technique is covered thoroughly in interactive sections to develop an intuitive understanding of the appropriate use cases, the overall purpose of the method, and each of the component parts. Following a brief section on practical considerations for random forests in Excel, the module then covers the pseudo code for each component in visual basic. Participants engage in activities to create and connect these components from their pseudocode, and results are compared to a master random forest file shared with each attendee. The module then progresses through the entire back end of the tool with intermediary outputs into Excel to help build visual understanding. After another round of exercises putting the tool to use on data sets, the module concludes with a recap of the tool, resources for continued development, and suggested projects for using VBA in Excel as well as the random forest tool in the workplace. Examples and datasets used throughout are taken from the oil and gas industry.

The modules are recommended to be taken together in series, but each can be taken independently depending on the goals of the attendee. The first module covers general methods for scripting in VBA, so it is still useful on its own to attendees planning on building small scale software to manipulate and learn from data in Excel even if they have no other interest in random forests or in practicing larger projects in a course setting. Module 2 involves putting together the various building blocks covered in Module 1, however, so it is still good practice even for those uninterested in random forests. Likewise, the examples and context in Module 1 will prepare attendees for specific challenges presented by the random forest in Module 2. For this reason, it is recommended that even attendees already familiar with scripting in VBA take both modules.

Module 1: Practical Software Development Using Visual Basic for Data Science in Excel

Microsoft Excel refresher
1. What is Excel used for
2. What is Excel best at
Intro to Visual Basic (VBA) for Excel
1. When to write scripts instead of using formulas
2. How to set up and run VBA
Data science in Excel
1. Data engineering
2. Data clean and prep
3. Methods for insight extraction from data
How to implement data science projects in VBA
1. Worksheet as backend
2. Functions and subfunctions
3. Data types
4. Loops and conditional statements
5. Debugging best practices
6. DIY learning tools
Curated practice projects

Module 2: Random Forest Machine Learning Implementation in Visual Basic for Excel

Machine learning introduction
1. History and contemporary relevance of machine learning
2. Case studies for effective use of machine learning in the industry
3. Implementation methods and their drawbacks
Random forest deep dive
1. What are random forests used for in machine learning
2. How random forests work
3. Example use cases and results
How to set up a random forest in Excel using VBA
1. Practicalities of random forest in Excel
2. Intuition and pseudocode
3. Random forest implementation walkthrough
Curated practice projects
Conclusion
1. Recap of the tool you now have and how you can use it
2. Resources for continued development
3. Suggested projects using VBA and using random forests