Cover Page

WILEY SERIES ON METHODS AND APPLICATIONS IN DATA MINING

Series Editor: Daniel T. Larose

Practical Text Mining with Perl • Roger Bilisoly

Knowledge Discovery Support Vector Machines • Lutz Hamel

Data Mining for Genomics and Proteomics: Analysis of Gene and Protein Expression Data • Darius M. Dziuda

Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition • Daniel T. Larose and Chantal D. Larose

Data Mining and Predictive Analytics • Daniel T. Larose and Chantal D. Larose

Data Mining and Learning Analytics: Applications in Educational Research • Samira ElAtia, Donald Ipperciel, and Osmar R. Zaïane

Pattern Recognition: A Quality of Data Perspective • Władysław Homenda and Witold Pedrycz

DATA SCIENCE USING PYTHON AND R



CHANTAL D. LAROSE

Eastern Connecticut State University
Windham, CT, USA

DANIEL T. LAROSE

Central Connecticut State University
New Britain, CT, USA







No alt text required.

PREFACE

DATA SCIENCE USING PYTHON AND R

Why this Book is Needed

Reason 1. Data Science is Hot. Really hot. Bloomberg called data scientist “the hottest job in America.”1 Business Insider called it “The best job in America right now.”2 Glassdoor.com rated it the best job in the world in 2018 for the third year in a row.3 The Harvard Business Review called data scientist “The sexiest job in the 21st century.”4

Reason 2: Top Two Open‐source Tools. Python and R are the top two open‐source data science tools in the world.5 Analysts and coders from around the world work hard to build analytic packages that Python and R users can then apply, free of charge.

Data Science Using Python and R will awaken your expertise in this cutting‐edge field using the most widespread open‐source analytics tools in the world. In Data Science Using Python and R, you will find step‐by‐step hands‐on solutions of real‐world business problems, using state‐of‐the‐art techniques. In short, you will learn data science by doing data science.

Written for Beginners and Non‐Beginners Alike

Data Science Using Python and R is written for the general reader, with no previous analytics or programming experience. We know that the information‐age economy is making many English majors and History majors retool to take advantage of the great demand for data scientists.6 This is why we provide the following materials to help those who are new to the field hit the ground running.

  • An entire chapter dedicated to learning the basics of using Python and R, for beginners. Which platform to use. Which packages to download. Everything you need to get started.
  • An appendix dedicated to filling in any holes you might have in your introductory data analysis knowledge, called Data Summarization and Visualization.
  • Step‐by‐step instructions throughout. Every instruction for every action.
  • Every chapter has Exercises, where you may check your understanding and progress.

Those with analytics or programming experience will enjoy having a one‐stop‐shop for learning how to do data science using both Python and R. Managers, CIOs, CEOs, and CFOs will enjoy being able to communicate better with their data analysts and database analysts. The emphasis in this book on accurately accounting for model costs will help everyone uncover the most profitable nuggets of knowledge from the data, while avoiding the potential pitfalls that may cost your company millions of dollars.

Data Science Using Python and R covers exciting new topics, such as the following:

  • Random Forests,
  • General Linear Models, and
  • Data‐driven error costs to enhance profitability.

Data sets and supplemental materials can be found under the Related Resources section at https://bcs.wiley.com/he-bcs/Books?action=index&itemId=1119526817&bcsId=11765 and https://bcs.wiley.com/he-bcs/Books?action=index&itemId=1119526817&bcsId=11712.

Data Science Using Python and R as a Textbook

Data Science Using Python and R naturally fits the role of textbook for a one‐semester course or two‐semester sequence of courses in introductory and intermediate data science. Faculty instructors will appreciate the exercises at the end of every chapter, totaling over 500 exercises in the book. There are three categories of exercises, from testing basic understanding toward more hands‐on analysis of new and challenging applications.

  • Clarifying the Concepts. These exercises test the students' basic understanding of the material, to make sure the students have absorbed what they have read.
  • Working with the Data. These applied exercises ask the student to work in Python and R, following the step‐by‐step instructions that were presented in the chapter.
  • Hands‐on Analysis. Here is the real meat of the learning process for the students, where they apply their newly found knowledge and skills to uncover patterns and trends in new data sets. Here is where the students' expertise is challenged, in near real‐world conditions. More than half of the exercises in the book consist of Hands‐on Analysis.

The following supporting materials are also available to faculty adopters of the book at no cost.

  • Full solutions manual, providing not just the answers, but how to arrive at the answers.
  • Powerpoint presentations of each chapter, so that you may help the students understand the material, rather than just assigning them to read it.

To obtain access to these materials, contact your local Wiley representation and ask them to email the authors confirming that you have adopted the book for your course.

Data Science Using Python and R is appropriate for advanced undergraduate or graduate‐level courses. No previous statistics, computer programming, or database expertise is required. What is required is a desire to learn.

How the Book is Structured

Data Science Using Python and R is structured around the Data Science Methodology.

The Data Science Methodology is a phased, adaptive, iterative, approach to the analysis of data, within a scientific framework.

  1. Problem Understanding Phase. First, clearly enunciate the project objectives. Then, translate these objectives into the formulation of a problem that can be solved using data science.
  2. Data Preparation Phase. Data cleaning/preparation is probably the most labor‐intensive phase of the entire data science process.
    • Covered in Chapter 3: Data Preparation.
  3. Exploratory Data Analysis Phase. Gain insights into your data through graphical exploration.
    • Covered in Chapter 4: Exploratory Data Analysis.
  4. Setup Phase. Establish baseline model performance. Partition the data. Balance the data, if needed.
    • Covered in Chapter 5: Preparing to Model the Data.
  5. Modeling Phase. The core of the data science process. Apply state‐of‐the‐art algorithms to uncover some seriously profitable relationships lying hidden in the data.
    • Covered in Chapters 6 and 8–14.
  6. Evaluation Phase. Determine whether your models are any good. Select the best‐performing model from a set of competing models.
    • Covered in Chapter 7: Model Evaluation.
  7. Deployment Phase. Interface with management to adapt your models for real‐world deployment.

Notes

ABOUT THE AUTHORS

Chantal D. Larose, PhD, and Daniel T. Larose, PhD, form a unique father–daughter pair of data scientists. This is their third book as coauthors. Previously, they wrote:

  • Data Mining and Predictive Analytics, Second Edition, Wiley, 2015.
    • This 800‐page tome would be a wonderful companion to this book, for those looking to dive deeper in to the field.
  • Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition, Wiley, 2014.

Chantal D. Larose completed her PhD in Statistics at the University of Connecticut in 2015, with dissertation Model‐Based Clustering of Incomplete Data. As an Assistant Professor of Decision Science at SUNY, New Paltz, she helped develop the Bachelor of Science in Business Analytics. Now, as an Assistant Professor of Statistics and Data Science at Eastern Connecticut State University, she is helping to develop the Mathematical Science Department's data science curriculum.

Daniel T. Larose completed his PhD in Statistics at the University of Connecticut in 1996, with dissertation Bayesian Approaches to Meta‐Analysis. He is a Professor of Statistics and Data Science at Central Connecticut State University. In 2001, he developed the world's first online Master of Science in Data Mining. This is the 12th textbook that he has authored or coauthored. He runs a small consulting business, https://bcs.wiley.com/he-bcs/Books?action=index&itemId=1119526817&bcsId=11765 and https://bcs.wiley.com/he-bcs/Books?action=index&itemId=1119526817&bcsId=11712. He also directs the online Master of Data Science program at CCSU.

ACKNOWLEDGMENTS

CHANTAL'S ACKNOWLEDGMENTS

Deepest thanks to my father Daniel, for his corny quips when proofreading. His guidance and passion for the craft reflects and enhances my own, and makes working with him a joy. Many thanks to my little sister Ravel, for her boundless love and incredible musical and scientific gifts. My fellow‐traveler, she is an inspiration. Thanks to my brother Tristan, for all his hard work in school and letting me beat him at Mario Kart exactly once. Thanks to my mother Debra, for food and hugs. Also, coffee. Many, many thanks to coffee.

Chantal D. Larose, Ph. D.
Assistant Professor of Statistics & Data Science
Eastern Connecticut State University

DANIEL'S ACKNOWLEDGMENTS

It is all about family. I would like to thank my daughter Chantal, for her insightful mind, her gentle presence, and for the joy she brings to every day. Thanks to my daughter Ravel, for her uniqueness, and for having the courage to follow her dream and become a chemist. Thanks to my son Tristan, for his math and computer skills, and for his help moving rocks in the backyard. I would also like to acknowledge my stillborn daughter Ellyriane Soleil. How we miss what you would have become. Finally, thanks to my loving wife, Debra, for her deep love and care for all of us, all these years. I love you all very much.

Daniel T. Larose, Ph. D.
Professor of Statistics and Data Science
Central Connecticut State University
www.ccsu.edu/faculty/larose