Series Editor: Daniel T. Larose
Practical Text Mining with Perl • Roger Bilisoly
Knowledge Discovery Support Vector Machines • Lutz Hamel
Data Mining for Genomics and Proteomics: Analysis of Gene and Protein Expression Data • Darius M. Dziuda
Discovering Knowledge in Data: An Introduction to Data Mining, Second Edition • Daniel T. Larose and Chantal D. Larose
Data Mining and Predictive Analytics • Daniel T. Larose and Chantal D. Larose
Data Mining and Learning Analytics: Applications in Educational Research • Samira ElAtia, Donald Ipperciel, and Osmar R. Zaïane
Pattern Recognition: A Quality of Data Perspective • Władysław Homenda and Witold Pedrycz
This edition first published 2019
© 2019 John Wiley & Sons, Inc.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise, except as permitted by law. Advice on how to obtain permission to reuse material from this title is available at http://www.wiley.com/go/permissions.
The right of Chantal D. Larose and Daniel T. Larose to be identified as the authors of this work has been asserted in accordance with law.
Registered Office
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
Editorial Office
111 River Street, Hoboken, NJ 07030, USA
For details of our global editorial offices, customer services, and more information about Wiley products visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print‐on‐demand. Some content that appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
While the publisher and authors have used their best efforts in preparing this work, they make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation any implied warranties of merchantability or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written sales materials or promotional statements for this work. The fact that an organization, website, or product is referred to in this work as a citation and/or potential source of further information does not mean that the publisher and authors endorse the information or services the organization, website, or product may provide or recommendations it may make. This work is sold with the understanding that the publisher is not engaged in rendering professional services. The advice and strategies contained herein may not be suitable for your situation. You should consult with a specialist where appropriate. Further, readers should be aware that websites listed in this work may have changed or disappeared between when this work was written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any other commercial damages, including but not limited to special, incidental, consequential, or other damages.
Library of Congress Cataloging‐in‐Publication Data
Names: Larose, Chantal D., author. | Larose, Daniel T., author.
Title: Data science using Python and R / Chantal D. Larose, Eastern Connecticut State University, Connecticut, USA, Daniel T. Larose, Central Connecticut State University, Conntecticut, USA.
Description: Hoboken, NJ : John Wiley & Sons, Inc, 2019. | Includes index. |
Identifiers: LCCN 2019007280 (print) | LCCN 2019009632 (ebook) | ISBN 9781119526834 (Adobe PDF) | ISBN 9781119526841 (ePub) | ISBN 9781119526810 (hardback)
Subjects: LCSH: Data mining. | Python (Computer program language) | R (Computer program language) | Big data. | Data structures (Computer science)
Classification: LCC QA76.9.D343 (ebook) | LCC QA76.9.D343 L376 2019 (print) | DDC 006.3/12–dc23
LC record available at https://lccn.loc.gov/2019007280
Cover Design: Wiley
Cover Image: © LumenGraphics/Shutterstock
Reason 1. Data Science is Hot. Really hot. Bloomberg called data scientist “the hottest job in America.”1 Business Insider called it “The best job in America right now.”2 Glassdoor.com rated it the best job in the world in 2018 for the third year in a row.3 The Harvard Business Review called data scientist “The sexiest job in the 21st century.”4
Reason 2: Top Two Open‐source Tools. Python and R are the top two open‐source data science tools in the world.5 Analysts and coders from around the world work hard to build analytic packages that Python and R users can then apply, free of charge.
Data Science Using Python and R will awaken your expertise in this cutting‐edge field using the most widespread open‐source analytics tools in the world. In Data Science Using Python and R, you will find step‐by‐step hands‐on solutions of real‐world business problems, using state‐of‐the‐art techniques. In short, you will learn data science by doing data science.
Data Science Using Python and R is written for the general reader, with no previous analytics or programming experience. We know that the information‐age economy is making many English majors and History majors retool to take advantage of the great demand for data scientists.6 This is why we provide the following materials to help those who are new to the field hit the ground running.
Those with analytics or programming experience will enjoy having a one‐stop‐shop for learning how to do data science using both Python and R. Managers, CIOs, CEOs, and CFOs will enjoy being able to communicate better with their data analysts and database analysts. The emphasis in this book on accurately accounting for model costs will help everyone uncover the most profitable nuggets of knowledge from the data, while avoiding the potential pitfalls that may cost your company millions of dollars.
Data Science Using Python and R covers exciting new topics, such as the following:
Data sets and supplemental materials can be found under the Related Resources section at https://bcs.wiley.com/he-bcs/Books?action=index&itemId=1119526817&bcsId=11765 and https://bcs.wiley.com/he-bcs/Books?action=index&itemId=1119526817&bcsId=11712.
Data Science Using Python and R naturally fits the role of textbook for a one‐semester course or two‐semester sequence of courses in introductory and intermediate data science. Faculty instructors will appreciate the exercises at the end of every chapter, totaling over 500 exercises in the book. There are three categories of exercises, from testing basic understanding toward more hands‐on analysis of new and challenging applications.
The following supporting materials are also available to faculty adopters of the book at no cost.
To obtain access to these materials, contact your local Wiley representation and ask them to email the authors confirming that you have adopted the book for your course.
Data Science Using Python and R is appropriate for advanced undergraduate or graduate‐level courses. No previous statistics, computer programming, or database expertise is required. What is required is a desire to learn.
Data Science Using Python and R is structured around the Data Science Methodology.
The Data Science Methodology is a phased, adaptive, iterative, approach to the analysis of data, within a scientific framework.
Chantal D. Larose, PhD, and Daniel T. Larose, PhD, form a unique father–daughter pair of data scientists. This is their third book as coauthors. Previously, they wrote:
Chantal D. Larose completed her PhD in Statistics at the University of Connecticut in 2015, with dissertation Model‐Based Clustering of Incomplete Data. As an Assistant Professor of Decision Science at SUNY, New Paltz, she helped develop the Bachelor of Science in Business Analytics. Now, as an Assistant Professor of Statistics and Data Science at Eastern Connecticut State University, she is helping to develop the Mathematical Science Department's data science curriculum.
Daniel T. Larose completed his PhD in Statistics at the University of Connecticut in 1996, with dissertation Bayesian Approaches to Meta‐Analysis. He is a Professor of Statistics and Data Science at Central Connecticut State University. In 2001, he developed the world's first online Master of Science in Data Mining. This is the 12th textbook that he has authored or coauthored. He runs a small consulting business, https://bcs.wiley.com/he-bcs/Books?action=index&itemId=1119526817&bcsId=11765 and https://bcs.wiley.com/he-bcs/Books?action=index&itemId=1119526817&bcsId=11712. He also directs the online Master of Data Science program at CCSU.
Deepest thanks to my father Daniel, for his corny quips when proofreading. His guidance and passion for the craft reflects and enhances my own, and makes working with him a joy. Many thanks to my little sister Ravel, for her boundless love and incredible musical and scientific gifts. My fellow‐traveler, she is an inspiration. Thanks to my brother Tristan, for all his hard work in school and letting me beat him at Mario Kart exactly once. Thanks to my mother Debra, for food and hugs. Also, coffee. Many, many thanks to coffee.
It is all about family. I would like to thank my daughter Chantal, for her insightful mind, her gentle presence, and for the joy she brings to every day. Thanks to my daughter Ravel, for her uniqueness, and for having the courage to follow her dream and become a chemist. Thanks to my son Tristan, for his math and computer skills, and for his help moving rocks in the backyard. I would also like to acknowledge my stillborn daughter Ellyriane Soleil. How we miss what you would have become. Finally, thanks to my loving wife, Debra, for her deep love and care for all of us, all these years. I love you all very much.