images

Additional Praise for Strategies in Biomedical
Data Science: Driving Force for Innovation

“The allure of data analytics is in knowing what is currently unknowable by identifying patterns in apparent chaos. If these insights could be applied in the healthcare field to individualized patient care, the improvement in outcomes could be profound indeed. This type of research and innovation right here in Tempe (ASU) demonstrates why ASU is ranked as the number one most innovative university in the nation.

“Industry analysts expect there will be three to four connected Internet of Things (IoT) devices for every person on the planet by 2020. Healthcare can and is leading the way in IoT adoption. To prepare for the coming deluge of IoT data, healthcare IT organizations should be investing in data analytics capability to convert that raw data flood into actionable information that delivers better healthcare outcomes.”

—Steve Phillips

Senior Vice President and Chief Information Officer, Avnet, Inc.

Twitter: @Avnet

“I think it is really great that Jay Etchings is working on this; the dearth of information for dealing with large, complex biomedical data sets makes building systems capable of supporting precision medicine very challenging. I would say that we are not yet at the “blueprint” stage, but we certainly can use help in getting the right people thinking about this, so we can build the recipes going forward. While true clinical application at scale is still not here, we are rapidly approaching that event horizon, and as we have learned in biomedical research, the infrastructure challenges alone require careful planning and very deliberate applications of the proper technologies to deal with the vast amount of data that is generated. The algorithms to automate things such as true clinical decision guidance have yet to be written, and although some approaches such as neuro-linguistic programming or machine learning look promising, actually creating a “doc in a box” is probably many years off. This does not mean we should not be striving to move forward as rapidly as possible, because the impact that can be had on a patient’s life is truly inspirational and that should always be remembered. This is not building systems to showcase technology or how smart we are, it is to help propel a truly world changing methodology of how medicine is practiced.”

—James Lowey, CIO

TGen, The Translational Genomics Research Institute

Twitter: @loweyj, @Tgen

“The journey to precision medicine will require the confluence and analysis of enormous amounts of data from genomics, clinical and fundamental research, clinical care, and environmental and lifestyle data, including connected health data from the “Internet of Medical Things.” The entire healthcare ecosystem needs to work together, along with the information and communications technology ecosystems, to collect, transport, analyze, and leverage the vast amount of data that can be honed to develop insights and recommendations for precision medicine. The opportunity to improve healthcare is compelling, the data is vast and will continue to grow, and we need to work together to realize improved outcomes. We need to build the technology and process-enabled capabilities to protect the data and the people. The need for increased TIPPSS—trust, identity, privacy, protection, safety, and security—mechanisms is critical to the success and safety in our ongoing healthcare journey.”

—Florence D. Hudson

Senior Vice President and Chief Innovation Officer, Internet2

Twitter: @FloInternet2

“In the last decade, the wave of data coming off modern sequencing instruments is transforming bioscience into a digital science. Not only are the data sets enormous, the need to work through them quickly to have a real-time impact on therapy is crucial, requiring all of the elements of high-performance computing: fast compute, storage and networking, sophisticated data management, and highly parallel application codes.

The ability to quickly crunch massive amounts of disease and patient data is at the heart of precision medicine. While much of the promise of precision medicine is still on the horizon, advances have already led to life-saving treatments for children and adults with lethal cancers and genetic diseases. At the Center for Pediatric Genomic Medicine (CPGM) at Children’s Mercy Hospital in Kansas City, MO, researchers used 25 hours of supercomputer time to decode the genetic variants of an infant suffering from liver failure. Thanks to the fast genomic diagnosis, doctors were able to proceed with the most effective treatment and the baby is alive and well.”

—Tiffany Trader

Managing Editor, HPCwire

Wiley & SAS Business Series

The Wiley & SAS Business Series presents books that help senior-level managers with their critical management decisions.

Titles in the Wiley & SAS Business Series include:

For more information on any of the above titles, please visit www.wiley.com.

Strategies in Biomedical Data Science

Driving Force for Innovation



Jay Etchings



















Wiley Logo

Foreword

The emergence of data science is radically transforming the biomedical knowledge generation paradigm. While modern biomedicine has been a pioneer in evidence-based science, its approach for decades has largely followed a well-worn path of experimental design, data collection, analysis, and interpretation. Data science introduces an alternative pathway—one that starts with the vast collections of diverse digital data increasingly accessible to the community.

While the data science evidence generation concept has many birth parents, Jim Gray of Microsoft best described the unique opportunity afforded by this new paradigm. In a 2007 address to the U.S. National Research Council, Gray argued: “With an exaflood of unexamined data and teraflops of cheap computing power, we should be able to make many valuable discoveries simply by searching all that information for unexpected patterns” [1]. Gray coined the phrase “data-intensive scientific discovery.” Notably, he broke with the high-performance computing “high priests” and advocated the adoption of new models of computing. Following Gray’s untimely death shortly after his address, his colleagues captured this concept in a collection of essays ultimately published as The Fourth Paradigm: Data-Intensive Scientific Discovery [2]. It was within these essays that the term “big data” was introduced.

“Data science” and “big data” are now overburdened terms with many meanings. The most useful definitions are operational in nature. One of the most colorful comes from John Myles of Facebook, who indicates that big data is any problem “so large that traditional approaches to data analysis are doomed to failure” [3]. I find the definition of the chief architect of Data.gov, Philip Ashlock, most elucidating: “Analysis that can help you find patterns, anomalies, or new structures amidst otherwise chaotic or complex data points” [3].

Data science remains controversial in biomedicine. Jeff Drazen, the editor in chief of the New England Journal of Medicine, has described data science practitioners as “research parasites” [4]. More subtly, Robert Weinberg openly questions whether such approaches have any potential to generate real insight in his article describing an emerging crisis in understanding cancer, “Coming Full Circle—From Endless Complexity to Simplicity and Back Again” [5].

I have been an eyewitness and co-conspirator in the data science transformation occurring in biomedicine. I grew up with the Human Genome Project and the rapid accumulation of large volumes of big data it generates. I have made contributions through the “Discovery Science” paradigm that the Genome Project made acceptable in biomedicine. For example, with my colleagues at the Cooperative Human Linkage Center, we were early adopters of computational science and the Internet (then NSFnet) in our efforts to construct the map of human inheritance [6]. For us at the time, big data topped out at a gigabyte! While serving as the founding director of the National Institutes of Health’s National Cancer Institute’s Center for Biomedical Informatics and Information Technology, I was tasked with helping bring data science to the cancer community. The charge was broad—including basic science, clinical research, and health encounter data. It was technologically challenging— predating many technology paradigms now taken for granted as standard in data science. Through these pioneering efforts, I experienced the aforementioned controversial nature of data science and the second of Arthur C. Clarke’s laws: “The only way of discovering the limits of the possible is to venture a little way past them into the impossible” [7].

Strategies in Biomedical Data Science is an ambitious attempt to look at “the limits of the possible” for data science in biomedicine. Unique in its scope, it takes a comprehensive look at all aspects of data science. Work in the sciences is routinely compartmentalized and segregated among specialists. This segregation is particularly true in biomedicine as it wrestles with the integration of data science and its underpinning in information technology. While such specialization is essential for progress within disciplines, the failure to have cross-cutting discussions results in lost opportunities. This book is significant in that it purposely embraces the “transdisciplinary” nature of biomedical data science. Transdisciplinary research (a foundational aspect of Arizona State University’s “New American University”) brings together different disciplines to create innovations that are beyond the capacity of any single specialty. Data science is definitionally transdisciplinary and somewhat ironically is discipline-agnostic.

Strategies in Biomedical Data Science unapologetically mixes biology, analytics, and information technology. Its transdisciplinary topics cover diverse data types—genomic, clinical encounter, personal monitoring devices—and the data science opportunities (and challenges) in each. Within each of these topics, it provides insights into the software capabilities that are used to wrangle Gray’s “exaflood” of data and to find his “unexpected patterns.” It provides insightful discussions of the underpinning computational and network infrastructure necessary to realize the potential of data science. More specifically, it provides practical blueprints that translate Gray’s suggested alternative to traditional high-performance computing paradigms into reality. Within each of these, it provides case studies written by experts that transition the topics from concept to real-world examples. Importantly, these case studies are provided by both academics and industry sources, demonstrating the importance of both to the biomedical data science progress as well as the need to blend these often-adversarial communities.

I have had the opportunity to know the author, Jay Etchings, for over three years. Jay is a true computational renaissance man, as reflected in the breadth of topics facilely presented in Strategies in Biomedical Data Science. I was first introduced to Jay when he was an architect for Dell. Jay translated ASU’s vision for a first-generation, purpose-built data science research platform into the operational Next Generation Cyber Capability (NGCC) described in the book. The NGCC is a physical instantiation of what Gray envisioned. Now at ASU as the director of Research Computing Operations, Jay and his team deliver biomedical data science to a diverse collection of international scientists.

Jay brings a fresh perspective and a diverse pedigree of work experiences to biomedical data science. He has been at the forefront of developing and deploying big data capabilities throughout his career. For example, Jay was on the leading edge in bringing big data infrastructure to the gaming industry— a community that is always an early adopter of breakthrough technology. Jay has hands-on experience in the complexities of biomedical data from his efforts to provide support for the Centers for Medicare and Medicaid Services. Jay’s commercial background brings with it a can-do approach to problems and a low tolerance for the arcane consternation that often paralyzes academics. This fresh perspective and his enthusiasm for biomedicine pervade his writing. Strategies in Biomedical Data Science is a one-stop shop of data science essentials and is likely to serve as the go-to resource for years to come.

Ken Buetow, Ph.D.,

Professor, Arizona State University

Director, Computational Science and Informatics Core Program

Director, Complex Adaptive Systems Initiative

NOTES

  1. David Snyder. 2016. “The Big Picture of Big Data—IEEE—The Institute.” http://theinstitute .ieee.org/ieee-roundup/members/achievements/the-big-picture-of-big-data.
  2. Anthony J. G. Hey, ed. 2009. The Fourth Paradigm: Data-Intensive Scientific Discovery. Redmond, WA: Microsoft Research.
  3. Jennifer Dutcher. 2014. “What Is Big Data?” September 3. https://datascience.berkeley.edu /what-is-big-data/.
  4. Dan L. Longo and Jeffrey M. Drazen. 2016. “Data Sharing.” New England Journal of Medicine 374, no. 3: 276–277. doi: 10.1056/NEJMe1516564.
  5. Robert A. Weinberg. 2014. “Coming Full Circle—From Endless Complexity to Simplicity and Back Again.” Cell 157 (1): 267–71. doi: 10.1016/j.cell.2014.03.004.
  6. J. C. Murray, K. H. Buetow, J. L. Weber, S. Ludwigsen, T. Scherpbier-Heddema, F. Manion, et al. 1994. “A Comprehensive Human Linkage Map with Centimorgan Density. Cooperative Human Linkage Center (CHLC).” Science 265, no. 5181: 2049–2054.
  7. Arthur C. Clarke. 1962. “Hazards of Prophecy: The Failure of Imagination” In Profiles of the Future: An Inquiry into the Limits of the Possible. New York: Harper & Row.

Acknowledgments

Most broadly, this book has been inspired by the need for a collaborative and multidisciplinary approach to solving the intricate puzzle that is cancer. Cancer poses a complex adaptive challenge that reaches across all domains: medicine, biology, technology, and the social sciences. Transdisciplinary collaboration is the only true path to the future. Ubiquitous research computing in support of “open science” and open big data has an essential role to play in this collaborative process.

More specifically, this book is dedicated to Sue Stigler and the family she leaves behind. Her three-and-a-half-year battle with cancer came to a close on December 7, 2015. Sue’s kindness and devotion, and her endless support for others even while ill, were remarkable; her selflessness will always be remembered. If you would like to donate to the Stigler family college fund, please visit their GoFundMe page, https://www.gofundme.com/bpebavas.

Author proceeds support childhood brain cancer research through an ASU Foundation account supporting Dr. Joshua LaBaer’s work in the Biodesign Institute. Dr LaBaer is conducting cutting-edge research on pediatric low-grade astrocytomas (PLGAs), which are the most common cancers of the brain in children.

In the research and discovery leading to this book, I have worked with more amazing and committed individuals than I could have ever imagined. My mentor and friend Ken Buetow is fond of saying, “If you’re the smartest person in the room, you are in the wrong room.” Time and again I have been in the right room. I am able to count some of the smartest people on the planet as colleagues and friends. Publication of this book was made a reality by their support and example.

A very special thanks to my good friend Phil Simon for convincing me to put thoughts, concepts, and theory on paper and share it with the world.

At Arizona State University I would like to thank Gordon Wishon, Dr. Elizabeth Cantwell, and Dr. Sethuraman Panchanathan (“Panch”) for giving me the opportunity to drive innovation at the university.

I would also like to recognize the dedication of our Research Computing team at Arizona State University for the continued commitment to our “commander’s intent” and to Christopher Myhill for sharing the commander’s intent with me while at Dell Enterprise.

Tremendous thanks to the teamwork of Jon McNally, Johnathan “Jr.” Lee, Lee Reynolds, Ram Polur, Daniel Penaherrera, Sheetal Shetty, James Napier, Tiffany Marin, Deborah Whitten, Curtis Thompson, Srinivasa Mathkur, Marisa Brazil, and of course Carol Schumacher, arguably the best administrative assistant alive. Special thanks also to Wendy “DigDug” Cegielski for her editing hours and continued motivation; next year you will be Dr. Wendy.

In no specific order I also would like to thank this list of super-smart and generous folks as well as our many terrific and invaluable partners: NimbleStorage, Brocade, Internet2, ESNET, Penguin Computing, TGEN, SwiftStack, MarkLogic, the Open Daylight Foundation, the Linux Foundation, Open Networking Foundation, IT Partners, friends at University of Arizona, Northern Arizona University, Dell Enterprise, University of Massachusetts-Lowell, Baylor University, Washington State University, Georgia Tech, Broad Institute of MIT and Harvard, University of Nevada Las Vegas, and the College of Southern Nevada (formally CCSN), and thanks for the support and mentorship from domain professionals both public and private like Mark “Pup” Roberts, Brandon Mikkelsen, Sean Dudley, Joel Dudley, James Lowey, Todd Decker, Jeff Creighton, Jim Scott, Gregory Palmer, Neela Jacques, Al Ritacco, and of course my engineer stepbrother Pedro Victor Gomes.

Last but certainly not least, I would like to recognize my awesome team of Jacob, Dixon, and Annika for their enduring patience throughout the never-ending collecting of the data and experience that comprises this text.

Heather, though you have departed from my arms, there is always a place for you in my heart.