Cover
Introduction
1. Who This Book Is For
2. What This Book Covers
3. What Has Changed Since the First Edition
4. How This Book Is Structured
5. What You Need to Use This Book
6. Reader Support for This Book
CHAPTER 1: The Two Essential Algorithms for Making Predictions
1. Why Are These Two Algorithms So Useful?
2. What Are Penalized Regression Methods?
3. What Are Ensemble Methods?
4. How to Decide Which Algorithm to Use
5. The Process Steps for Building a Predictive Model
6. Chapter Contents and Dependencies
7. Summary
8. References
CHAPTER 2: Understand the Problem by Understanding the Data
1. The Anatomy of a New Problem
2. Classification Problems: Detecting Unexploded Mines Using Sonar
3. Visualizing Properties of the Rocks Versus Mines Data Set
4. Real-Valued Predictions with Factor Variables: How Old Is Your Abalone?
5. Real-Valued Predictions Using Real-Valued Attributes: Calculate How Your Wine Tastes
6. Multiclass Classification Problem: What Type of Glass Is That?
7. Using PySpark to Understand Large Data Sets
8. Summary
9. Reference
CHAPTER 3: Predictive Model Building: Balancing Performance, Complexity, and Big Data
1. The Basic Problem: Understanding Function Approximation
2. Factors Driving Algorithm Choices and Performance—Complexity and Data
3. Measuring the Performance of Predictive Models
4. Achieving Harmony between Model and Data
5. Using PySpark for Training Penalized Regression Models on Extremely Large Data Sets
6. Summary
7. Reference
CHAPTER 4: Penalized Linear Regression
1. Why Penalized Linear Regression Methods Are So Useful
2. Penalized Linear Regression: Regulating Linear Regression for Optimum Performance
3. Solving the Penalized Linear Regression Problem
4. Extension of Linear Regression to Classification Problems
5. Summary
6. References
CHAPTER 5: Building Predictive Models Using Penalized Linear Methods
1. Python Packages for Penalized Linear Regression
2. Multivariable Regression: Predicting Wine Taste
3. Binary Classification: Using Penalized Linear Regression to Detect Unexploded Mines
4. Multiclass Classification: Classifying Crime Scene Glass Samples
5. Linear Regression and Classification Using PySpark
6. Using PySpark to Predict Wine Taste
7. Logistic Regression with PySpark: Rocks Versus Mines
8. Incorporating Categorical Variables in a PySpark Model: Predicting Abalone Rings
9. Multiclass Logistic Regression with Meta Parameter Optimization
10. Summary
11. References
CHAPTER 6: Ensemble Methods
1. Binary Decision Trees
2. Bootstrap Aggregation: “Bagging”
3. Gradient Boosting
4. Random Forests
5. Summary
6. References
CHAPTER 7: Building Ensemble Models with Python
1. Solving Regression Problems with Python Ensemble Packages
2. Incorporating Non-Numeric Attributes in Python Ensemble Models
3. Solving Binary Classification Problems with Python Ensemble Methods
4. Solving Multiclass Classification Problems with Python Ensemble Methods
5. Solving Regression Problems with PySpark Ensemble Packages
6. Summary
7. References
Index
End User License Agreement

List of Tables

Chapter 1
1. Table 1.1: Sketch of Problems in Machine Learning Comparison Study
2. Table 1.2: How the Algorithms Covered in This Book Compare on Different Prob...
3. Table 1.3: How the Algorithms Covered in This Book Compare on Big Data Probl...
4. Table 1.4: High-Level Tradeoff between Penalized Linear Regression and Ensem...
Chapter 2
1. Table 2.1: Data for a Machine Learning Problem
2. Table 2.2: Numeric Targets versus Categorical Targets
Chapter 3
1. Table 3.1: Example Training Set
2. Table 3.2: Dependence of Misclassification Error on Decision Threshold
3. Table 3.3: Cost of Mistakes for Different Decision Thresholds
Chapter 4
1. Table 4.1: Example Training Set

List of Illustrations

Chapter 1
1. Figure 1.1: Ordinary least squares fit
2. Figure 1.2: Fitting lines with only two points
3. Figure 1.3: Binary decision tree example
4. Figure 1.4: Input-output graph for the binary decision tree example
5. Figure 1.5: Framing a machine learning problem
6. Figure 1.6: Iteration from formulation to performance
7. Figure 1.7: Dependence of chapters on one another
Chapter 2
1. Figure 2.1: Quantile-quantile plot of attribute 4 from rocks versus mines data
2. Figure 2.2: Constructing a parallel coordinates plot
3. Figure 2.3: Parallel coordinates graph of rocks versus mines attributes
4. Figure 2.4: Cross-plot of rocks versus mines attributes 2 and 3
5. Figure 2.5: Cross-plot of rocks versus mines attributes 2 and 21
6. Figure 2.6: Target-attribute cross-plot
7. Figure 2.7: Target-attribute cross-plot with point dither and partial opacity
8. Figure 2.8: Heat map showing attribute cross-correlations
9. Figure 2.9: Box plot of real-valued attributes from the abalone data set
10. Figure 2.10: Box plot of real-valued attributes from the abalone data set
11. Figure 2.11: Box plot of normalized abalone attributes
12. Figure 2.12: Color-coded parallel coordinate plot for abalone
13. Figure 2.13: Graph of the logit function (logitCurve.png)
14. Figure 2.14: Parallel coordinate plot for the abalone data
15. Figure 2.15: Correlation heat map for the abalone data
16. Figure 2.16: Attribute and target box plots of normalized wine data
17. Figure 2.17: Parallel coordinate plot for wine data
18. Figure 2.18: Parallel coordinates plot for normalized wine data
19. Figure 2.19: Correlation heat map for the wine data
20. Figure 2.20: Box plot of the glass data
21. Figure 2.21: Parallel coordinate plot for the glass data
22. Figure 2.22: Correlation heat map for the glass problem
Chapter 3
1. Figure 3.1: A simple classification problem
2. Figure 3.2: A complicated classification problem
3. Figure 3.3: A complicated classification problem without much data
4. Figure 3.4: Linear model fit to simple data
5. Figure 3.5: Linear model fit to complex data
6. Figure 3.6: Ensemble model fit to complex data
7. Figure 3.7: Linear model fit to small sample of complex data
8. Figure 3.8: Ensemble model fit to small sample of complex data
9. Figure 3.9: Confusion matrix example
10. Figure 3.10: In-sample ROC for rocks-versus-mines classifier
11. Figure 3.11: Out-of-sample ROC for rocks-versus-mines classifier
12. Figure 3.12: N-fold cross-validation
13. Figure 3.13: Wine quality prediction error using forward stepwise regression
14. Figure 3.14: Actual taste scores versus predictions generated with forward ste...
15. Figure 3.15: Histogram of wine taste prediction error with forward stepwise re...
16. Figure 3.16: Wine quality prediction error using ridge regression
17. Figure 3.17: Actual taste scores versus predictions generated with ridge regre...
18. Figure 3.18: Histogram of wine taste prediction error with ridge regression
19. Figure 3.19: AUC for the rocks-versus-mines classifier using ridge regression
20. Figure 3.20: Plot of actual versus prediction for the rocks-versus-mines class...
21. Figure 3.21: Wine quality prediction error using PySpark version of ridge regr...
Chapter 4
1. Figure 4.1: Optimum solutions with sum squared coefficient penalty
2. Figure 4.2: Optimum solutions with sum absolute value coefficient penalty
3. Figure 4.3: Coefficient curves for LARS regression on wine data
4. Figure 4.4: Cross-validated mean square error for LARS on wine data
5. Figure 4.5: Plot of S() function
6. Figure 4.6: Coefficient curves for glmnet models for predicting wine taste
7. Figure 4.7: Coefficient curves for rocks versus mines classification problem s...
8. Figure 4.8: Functions generated to expand wine attribute session
9. Figure 4.9: Coefficient curves for LARS trained on abalone data with coded cat...
Chapter 5
1. Figure 5.1: Out-of-sample error with normalized data—Lasso model on wine taste...
2. Figure 5.2: Out-of-sample error with unnormalized data—Lasso model on wine tas...
3. Figure 5.3: Coefficient curves for Lasso trained to predict wine quality
4. Figure 5.4: Cross-validation error curves for Lasso trained on wine quality da...
5. Figure 5.5: Out-of-sample classifier misclassification performance
6. Figure 5.6: Out-of-sample classifier AUC performance
7. Figure 5.7: Receiver operating characteristic for best performing classifier
8. Figure 5.8: Coefficient curves for ElasticNet trained on rocks versus mines da...
9. Figure 5.9: Coefficient curves for ElasticNet penalized logistic regression tr...
10. Figure 5.10: Misclassification error rates using penalized linear regression f...
11. Figure 5.11: PySpark model performance versus regularization parameter
12. Figure 5.12: PySpark ordered coefficients plot for rocks versus mines
13. Figure 5.13: ROC curve for PySpark model for predicting rocks versus mines
Chapter 6
1. Figure 6.1: Decision tree for determining wine quality
2. Figure 6.2: Label versus attribute for simple example
3. Figure 6.3: Block diagram of depth 1 tree for simple problem
4. Figure 6.4: Comparison of predictions and actual values versus attribute for s...
5. Figure 6.5: Sum squared error resulting from every possible split point locati...
6. Figure 6.6: Prediction using depth 2 tree
7. Figure 6.7: Block diagram for depth 2 tree
8. Figure 6.8: Prediction using depth 6 tree
9. Figure 6.9: Out-of-sample error versus tree depth for simple problem
10. Figure 6.10: Out-of-sample MSE versus tree depth with 1000 data points
11. Figure 6.11: MSE versus number of trees in Bagging ensemble
12. Figure 6.12: Comparison of prediction and actual label as functions of attribu...
13. Figure 6.13: MSE versus number of trees with depth 5 trees
14. Figure 6.14: Comparison of prediction and actual labels with depth 5 trees
15. Figure 6.15: Predicting wine quality with Bagging on depth 1 trees
16. Figure 6.16: Predicting wine quality with Bagging on depth 5 trees
17. Figure 6.17: Predicting wine quality with Bagging on depth 12 trees
18. Figure 6.18: MSE versus number of trees for synthetic problem—eps = 0.1, treeD...
19. Figure 6.19: Gradient boosting predictions versus attribute value problem—eps ...
20. Figure 6.20: MSE versus number of trees for synthetic problem—eps = 0.1, treeD...
21. Figure 6.21: Gradient boosting predictions versus attribute value problem—eps ...
22. Figure 6.22: MSE versus number of trees for synthetic problem—eps = 0.3, treeD...
23. Figure 6.23: Gradient boosting predictions versus attribute value problem—eps ...
24. Figure 6.24: MSE versus number of trees for gradient boosting model of wine qu...
25. Figure 6.25: MSE versus number of trees for bagging + random attribute selecti...
26. Figure 6.26: MSE versus number of trees for bagging + random attribute selecti...
27. Figure 6.27: MSE versus number of trees for bagging + random attribute selecti...
Chapter 7
1. Figure 7.1: Train and test performance for the gradient boosting wine model
2. Figure 7.2: Variable importance as measured by gradient boosting
3. Figure 7.3: Train and test performance for the random forest wine quality mode...
4. Figure 7.4: Variable importance estimated by random forest
5. Figure 7.5: Train and test performance for gradient boosting abalone age model
6. Figure 7.6: Gradient boosting estimates of variable importance in the abalone ...
7. Figure 7.7: Train and test performance for random forest abalone age model
8. Figure 7.8: Random forest estimates of variable importance in the abalone mode...
9. Figure 7.9: Train and test performance for gradient boosting glass classificat...
10. Figure 7-10: Gradient boosting estimates of variable importance in the rocks v...
11. Figure 7.11: Random forest estimates of variable importance in the rocks versu...
12. Figure 7.12: Random forest estimates of variable importance in the rocks versu...
13. Figure 7.13: Receiver operating curve (ROC) for random forest rocks versus min...
14. Figure 7.14: Train and test performance for gradient boosting glass classifica...
15. Figure 7.15: The overall performance of random forests
16. Figure 7.16: The relative importance of the variables used by random forest

Machine Learning with Spark™ and Python^®: Essential Techniques for Predictive Analytics, Second Edition

Published by
John Wiley & Sons, Inc.
10475 Crosspoint Boulevard
Indianapolis, IN 46256
www.wiley.com

ISBN: 978-1-119-56193-4
ISBN: 978-1-119-56201-6 (ebk)
ISBN: 978-1-119-56195-8 (ebk)

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose. No warranty may be created or extended by sales or promotional materials. The advice and strategies contained herein may not be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services. If professional assistance is required, the services of a competent professional person should be sought. Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that an organization or Web site is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or website may provide or recommendations it may make. Further, readers should be aware that Internet websites listed in this work may have changed or disappeared between when this work was written and when it is read.

For general information on our other products and services please contact our Customer Care Department within the United States at (877) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.

Library of Congress Control Number: 2019940771

Trademarks: Wiley and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates, in the United States and other countries, and may not be used without written permission. Spark is a trademark of the Apache Software Foundation, Inc. Python is a registered trademark of the Python Software Foundation. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc. is not associated with any product or vendor mentioned in this book.

Introduction

Extracting actionable information from data is changing the fabric of modern business in ways that directly affect programmers. One way is the demand for new programming skills. Market analysts predict demand for people with advanced statistics and machine learning skills will exceed supply by 140,000 to 190,000 by 2018. That means good salaries and a wide choice of interesting projects for those who have the requisite skills. Another development that affects programmers is progress in developing core tools for statistics and machine learning. This relieves programmers of the need to program intricate algorithms for themselves each time they want to try a new one. Among general-purpose programming languages, Python developers have been in the forefront, building state-of-the-art machine learning tools, but there is a gap between having the tools and being able to use them efficiently.

Programmers can gain general knowledge about machine learning in a number of ways: online courses, a number of well-written books, and so on. Many of these give excellent surveys of machine learning algorithms and examples of their use, but because of the availability of so many different algorithms, it's difficult to cover the details of their usage in a survey.

This leaves a gap for the practitioner. The number of algorithms available requires making choices that a programmer new to machine learning might not be equipped to make until trying several, and it leaves the programmer to fill in the details of the usage of these algorithms in the context of overall problem formulation and solution.

This book attempts to close that gap. The approach taken is to restrict the algorithms covered to two families of algorithms that have proven to give optimum performance for a wide variety of problems. This assertion is supported by their dominant usage in machine learning competitions, their early inclusion in newly developed packages of machine learning tools, and their performance in comparative studies (as discussed in Chapter 1, “The Two Essential Algorithms for Making Predictions”). Restricting attention to two algorithm families makes it possible to provide good coverage of the principles of operation and to run through the details of a number of examples showing how these algorithms apply to problems with different structures.

The book largely relies on code examples to illustrate the principles of operation for the algorithms discussed. I've discovered in the classes I have taught at University of California, Berkeley, Galvanize, University of New Haven, and Hacker Dojo, that programmers generally grasp principles more readily by seeing simple code illustrations than by looking at math.

This book focuses on Python because it offers a good blend of functionality and specialized packages containing machine learning algorithms. Python is an often-used language that is well known for producing compact, readable code. That fact has led a number of leading companies to adopt Python for prototyping and deployment. Python developers are supported by a large community of fellow developers, development tools, extensions, and so forth. Python is widely used in industrial applications and in scientific programming, as well. It has a number of packages that support computationally intensive applications like machine learning, and it is a good collection of the leading machine learning algorithms (so you don't have to code them yourself). Python is a better general-purpose programming language than specialized statistical languages such as R or SAS (Statistical Analysis System). Its collection of machine learning algorithms incorporates a number of top-flight algorithms and continues to expand.

Who This Book Is For

This book is intended for Python programmers who want to add machine learning to their repertoire, either for a specific project or as part of keeping their toolkit relevant. Perhaps a new problem has come up at work that requires machine learning. With machine learning being covered so much in the news these days, it's a useful skill to claim on a resume.

This book provides the following for Python programmers:

A description of the basic problems that machine learning attacks
Several state-of-the-art algorithms
The principles of operation for these algorithms
Process steps for specifying, designing, and qualifying a machine learning system
Examples of the processes and algorithms
Hackable code

To get through this book easily, your primary background requirements include an understanding of programming or computer science and the ability to read and write code. The code examples, libraries, and packages are all Python, so the book will prove most useful to Python programmers. In some cases, the book runs through code for the core of an algorithm to demonstrate the operating principles, but then uses a Python package incorporating the algorithm to apply the algorithm to problems. Seeing code often gives programmers an intuitive grasp of an algorithm in the way that seeing the math does for others. Once the understanding is in place, examples will use developed Python packages with the bells and whistles that are important for efficient use (error checking, handling input and output, developed data structures for the models, defined predictor methods incorporating the trained model, and so on).

In addition to having a programming background, some knowledge of math and statistics will help get you through the material easily. Math requirements include some undergraduate-level differential calculus (knowing how to take a derivative and a little bit of linear algebra), matrix notation, matrix multiplication, and matrix inverse. The main use of these will be to follow the derivations of some of the algorithms covered. Many times, that will be as simple as taking a derivative of a simple function or doing some basic matrix manipulations. Being able to follow the calculations at a conceptual level may aid your understanding of the algorithm. Understanding the steps in the derivation can help you to understand the strengths and weaknesses of an algorithm and can help you to decide which algorithm is likely to be the best choice for a particular problem.

This book also uses some general probability and statistics. The requirements for these include some familiarity with undergraduate-level probability and concepts such as the mean value of a list of real numbers, variance, and correlation. You can always look through the code if some of the concepts are rusty for you.

This book covers two broad classes of machine learning algorithms: penalized linear regression (for example, Ridge and Lasso) and ensemble methods (for example, Random Forest and Gradient Boosting). Each of these families contains variants that will solve regression and classification problems. (You learn the distinction between classification and regression early in the book.)

Readers who are already familiar with machine learning and are only interested in picking up one or the other of these can skip to the two chapters covering that family. Each method gets two chapters—one covering principles of operation and the other running through usage on different types of problems. Penalized linear regression is covered in Chapter 4, “Penalized Linear Regression,” and Chapter 5, “Building Predictive Models Using Penalized Linear Methods.” Ensemble methods are covered in Chapter 6, “Ensemble Methods,” and Chapter 7, “Building Ensemble Models with Python.” To familiarize yourself with the problems addressed in the chapters on usage of the algorithms, you might find it helpful to skim Chapter 2, “Understand the Problem by Understanding the Data,” which deals with data exploration. Readers who are just starting out with machine learning and want to go through from start to finish might want to save Chapter 2 until they start looking at the solutions to problems in later chapters.

What This Book Covers

As mentioned earlier, this book covers two algorithm families that are relatively recent developments and that are still being actively researched. They both depend on, and have somewhat eclipsed, earlier technologies.

Penalized linear regression represents a relatively recent development in ongoing research to improve on ordinary least squares regression. Penalized linear regression has several features that make it a top choice for predictive analytics. Penalized linear regression introduces a tunable parameter that makes it possible to balance the resulting model between overfitting and underfitting. It also yields information on the relative importance of the various inputs to the predictions it makes. Both of these features are vitally important to the process of developing predictive models. In addition, penalized linear regression yields the best prediction performance in some classes of problems, particularly underdetermined problems and problems with very many input parameters such as genetics and text mining. Furthermore, there's been a great deal of recent development of coordinate descent methods, making training penalized linear regression models extremely fast.

To help you understand penalized linear regression, this book recapitulates ordinary linear regression and other extensions to it, such as stepwise regression. The hope is that these will help cultivate intuition.

Ensemble methods are one of the most powerful predictive analytics tools available. They can model extremely complicated behavior, especially for problems that are vastly overdetermined, as is often the case for many web-based prediction problems (such as returning search results or predicting ad click-through rates). Many seasoned data scientists use ensemble methods as their first try because of their performance. They are relatively simple to use, and they also rank variables in terms of predictive performance.

Ensemble methods have followed a development path parallel to penalized linear regression. Whereas penalized linear regression evolved from overcoming the limitations of ordinary regression, ensemble methods evolved to overcome the limitations of binary decision trees. Correspondingly, this book's coverage of ensemble methods covers some background on binary decision trees because ensemble methods inherit some of their properties from binary decision trees. Understanding them helps cultivate intuition about ensemble methods.

What Has Changed Since the First Edition

In the three years since the first edition was published, Python has more firmly established itself as the primary language for data science. Developers of platforms like Spark for big data or TensorFlow and Torch for deep learning have adopted Python interfaces to reach the widest set of data scientists. The two classes of algorithms emphasized in the first edition continue to be heavy favorites and are now available as part of PySpark.

The beauty of this marriage is that the code required to build machine learning models on truly gargantuan data sets is no more complicated than what's required on smaller data sets.

PySpark illustrates several important developments, making it cleaner and easier to invoke very powerful machine learning tools through relatively simple easy to read and write Python code. When the first edition of this book was written, building machine learning models on very large data sets required spinning up hundreds of processors, which required vast knowledge of data center processes and programming. It was cumbersome and frankly not very effective. Spark architecture was developed to correct this difficulty.

Spark made it possible to easily rent and employ large numbers of processors for machine learning. PySpark added a Python interface. The result is that the code to run a machine learning algorithm in PySpark is not much more complicated than to run the plain Python versions of programs. The algorithms that were the focus of the first edition continue to be heavily used favorites and are available in Spark. So it seemed natural to add PySpark examples alongside the Python examples in order to familiarize readers with PySpark.

In this edition all the code examples are in Python 3, since Python 2 is due to fall out of support and, in addition to providing the code in text form, the code is also available in Jupyter notebooks for each chapter. The notebook code when executed will draw graphs and tables you see in the figures.

How This Book Is Structured

This book follows the basic order in which you would approach a new prediction problem. The beginning involves developing an understanding of the data and determining how to formulate the problem, and then proceeds to try an algorithm and measure the performance. In the midst of this sequence, the book outlines the methods and reasons for the steps as they come up. Chapter 1 gives a more thorough description of the types of problems that this book covers and the methods that are used. The book uses several data sets from the UC Irvine data repository as examples, and Chapter 2 exhibits some of the methods and tools that you can use for developing insight into a new data set. Chapter 3, “Predictive Model Building: Balancing Performance, Complexity, and Big Data,” talks about the difficulties of predictive analytics and techniques for addressing them. It outlines the relationships between problem complexity, model complexity, data set size, and predictive performance. It discusses overfitting and how to reliably sense overfitting. It talks about performance metrics for different types of problems. Chapters 4 and 5, respectively, cover the background on penalized linear regression and its application to problems explored in Chapter 2. Chapters 6 and 7 cover background and application for ensemble methods.

What You Need to Use This Book

To run the code examples in the book, you need to have Python 3.x, SciPy, numpy, pandas, and scikit-learn and PySpark. These can be difficult to install due to cross-dependencies and version issues. To make the installation easy, I've used a free distribution of these packages that's available from Continuum Analytics (http://continuum.io/). Its Anaconda product is a free download and includes Python 3.x and all the packages you need to run the code in this book (and more). I've run the examples on Ubuntu 14.04 Linux but haven't tried them on other operating systems.

PySpark will need a Linux environment. If you're not running on Linux, then probably the easiest way to run the examples will be to use a virtual machine. Virtual Box is a free open source virtual machine—follow the directions to download Virtual Box and then install Ubuntu 18.05 and use Anaconda to install Python, PySpark, etc. You'll only need to employ a VM to run the PySpark examples. The non-Spark code will run anywhere you can open a Jupyter notebook.

Reader Support for This Book

Source code available in the book's repository can help you speed your learning. The chapters include installation instructions so that you can get coding along with reading the book.

Source Code

As you work through the examples in this book, you may choose either to type in all the code manually or to use the source code files that accompany the book. All the source code used in this book is available for download from http://www.wiley.com/go/pythonmachinelearning2e. You will find the code snippets from the source code are accompanied by a download icon and note indicating the name of the program so that you know it's available for download and can easily locate it in the download file.

Besides providing the code in text form, it is also included in a Python notebook. If you know how to run a Jupyter notebook, you can run the code cell-by-cell. The output will appear in the notebook, the figures will get drawn, and printed output will appear below the code block.

After you download the code, just decompress it with your favorite compression tool.

How to Contact the Publisher

If you believe you've found a mistake in this book, please bring it to our attention. At John Wiley & Sons, we understand how important it is to provide our customers with accurate content, but even with our best efforts an error may occur.

In order to submit your possible errata, please email it to our Customer Service Team at wileysupport@wiley.com with the subject line “Possible Book Errata Submission”.

Machine Learning with Spark™ and Python^®

Essential Techniques for Predictive Analytics

About the Author

About the Technical Editor

Acknowledgments