No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifi cally disclaim all warranties, including without limitation warranties of fi tness for a particular purpose. No warranty may be created or extended by sales or promotional materials.
The advice and strategies contained herein may not be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services. If professional assistance is required, the services of a competent professional person should be sought. Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that an organization or Web site is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Web site may provide or recommendations it may make. Further, readers should be aware that Internet Web sites listed in this work may have changed or disappeared between when this work was written and when it is read.
For general information on our other products and services please contact our Customer Care Department within the United States at (877) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley publishes in a variety of print and electronic formats and by print-on-demand. Some material included with standard print versions of this book may not be included in e-books or in print-on-demand. If this book refers to media such as a CD or DVD that is not included in the version you purchased, you may download this material at http://booksupport.wiley.com. For more information about Wiley products, visit www.wiley.com.
Library of Congress Control Number: 2015956232
Trademarks: Wiley, and the Wiley logo are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affi liates, in the United States and other countries, and may not be used without written permission. All other trademarks are the property of their respective owners. John Wiley & Sons, Inc., is not associated with any product or vendor mentioned in this book.
About the Authors
Ralph Kimball founded the Kimball Group. Since the mid-1980s, he has been the DW/BI industry's thought leader on the dimensional approach and has trained more than 20,000 IT professionals. Prior to working at Metaphor and founding Red Brick Systems, Ralph co-invented the Star workstation at Xerox's Palo Alto Research Center (PARC). Ralph has a Ph.D. in Electrical Engineering from Stanford University.
Margy Ross is President of the Kimball Group and Decision Works Consulting. She has focused exclusively on data warehousing and business intelligence since 1982. Margy has consulted with hundreds of clients and taught DW/BI best practices to tens of thousands. Prior to working at Metaphor and co-founding DecisionWorks Consulting, she graduated with a B.S. in Industrial Engineering from Northwestern University.
Credits
Project Editor
Tom Dinse
Production Editor
Christine O'Connor
Copy Editor
Kim Cofer
Manager of Content Development & Assembly
Mary Beth Wakefield
Marketing Director
David Mayhew
Marketing Manager
Carrie Sherrill
Professional Technology & Strategy Director
Barry Pruett
Business Manager
Amy Knies
Associate Publisher
Jim Minatel
Project Coordinator, Cover
Brent Savage
Proofreader
Nancy Carrasco
Indexer
Johnna VanHoose Dinse
Cover Designer
Wiley
Cover Image
The Kimball Group
Warren Thornthwaite (1957–2014)
The Kimball Group lost Warren to a brain tumor in 2014. He wrote many insightful articles that appear in this Reader. All of us at the Kimball Group miss Warren dearly—his intellect, curiosity, creativity, and most especially, his friendship and sense of humor. As any of you who met Warren will attest, he was truly one of a kind!
Acknowledgments
First, we want to thank the 33,000 subscribers to the Kimball Design Tips, and the uncounted numbers who have visited the Kimball Group website to peruse our archive. This book brings the remastered Design Tips and articles together in what we hope is a very usable form.
The Kimball Group Reader would not exist without the assistance of our business partners. Kimball Group members Bob Becker, Joy Mundy, and Warren Thornthwaite wrote many of the valuable articles and Design Tips included in the book. Thanks to Julie Kimball for her insightful comments. Thanks also to former Kimball Group member Bill Schmarzo for his contributions on analytic applications.
Thanks to our clients and students who have embraced, practiced, and validated the Kimball methods with us. We have learned as much from you as you have from us!
Jim Minatel, our executive editor at Wiley Publishing, project editor Tom Dinse, and the rest of the Wiley team have supported this project with skill, encouragement, and enthusiasm. It has been a pleasure to work with them.
To our families, thank you for your support over the twenty year span during which we wrote these Design Tips and articles. Julie Kimball and Scott Ross: We couldn't have done it without you! And, of course, thanks to our children, Sara Kimball Smith, Brian Kimball, and Katie Ross, who have grown into adults over the same time!
Introduction
The Kimball Group's article and Design Tip archive has been the most popular destination on our website (www.kimballgroup.com). Stretching back twenty years to Ralph's original 1995 DBMS magazine articles, the archive explores more than 250 topics, sometimes in more depth than provided by our books or courses.
With The Kimball Group Reader, Second Edition, we have organized all of the articles in a coherent way. But The Reader is more than merely a collection of our past magazine articles and Design Tips verbatim. We have trimmed the redundancy, made sure all the articles are written with the same consistent vocabulary, and updated many of the figures. This is a new and improved remastered compilation of our writings.
After considerable discussion, we decided to update many time references and edit content throughout the book to provide the perspective of 2015 rather than leaving old dates or outdated concepts in the articles. Thus an article written in 2007 may use 2015 in an example! When articles refer to the number of years that have passed, we have updated these references relative to 2015. For example, if a 2005 article originally said “during the past five years,” the article now reads “during the past fifteen years.” Mentions regarding our years of experience, number of books sold, articles written, or students taught have also been updated to 2015 figures. Finally, we occasionally changed references from outmoded technologies such as “modems” to more modern technologies, especially “internet.” We trust these changes will not mislead or cause confusion, but rather make your reading experience more natural.
Intended Audience and Goals
The primary reader of this book should be the analyst, designer, modeler, or manager who is delivering a data warehouse in support of business intelligence. The articles in this book trace the entire lifecycle of DW/BI system development, from original business requirements gathering all the way to final deployment. We believe that this collection of articles serves as a superb reference-in-depth for literally hundreds of issues and situations that arise in the development of a DW/BI system.
The articles range from a managerial focus to a highly technical focus, although in all cases, the tone of the articles strives to be educational. These articles have been accessed thousands of times per day on the Kimball Group website over a span of 20 years, so we're confident they're useful. This book adds significant value by organizing the archive, and systematically editing the articles to ensure their consistency and relevance.
Preview of Contents
Following two introductory chapters, the book's organization will look somewhat familiar to readers of The Data Warehouse Lifecycle Toolkit, Second Edition (Wiley, 2008) because we've organized the articles topically to correspond with the major milestones of a data warehouse/business intelligence (DW/BI) implementation. Not surprisingly given the word “Kimball” is practically synonymous with dimensional modeling, much of The Reader focuses on that topic in particular.
Chapter 1: The Reader at a Glance. We begin the book with a series of articles written by Ralph several years ago for DM Review magazine. This series succinctly encapsulates the Kimball approach in a cohesive manner, so it serves as a perfect overview, akin to CliffsNotes, for the book.
Chapter 2: Before You Dive In. Long-time readers of Ralph's articles will find that this chapter is a walk down memory lane, as many of the articles are historically significant. Somewhat amazingly, the content is still very relevant even though most of these articles were written in the 1990s.
Chapter 3: Project/Program Planning. With an overview and history lesson under your belt, Chapter 3 moves on to getting the DW/BI program and project launched. We consider both the project team's and sponsoring stakeholders' responsibilities, and then delve into the Kimball Lifecycle approach.
Chapter 4: Requirements Definition. It is difficult to achieve DW/BI success in the absence of business requirements. This chapter delivers specific recommendations for effectively eliciting the business's needs. It stresses the importance of organizing the requirements findings around business processes, and suggests tactics for reaching organizational consensus on appropriate next steps.
Chapter 5: Data Architecture. With a solid understanding of the business requirements, we turn our attention to the data (where we will remain through Chapter 11). This chapter begins with the justification for dimensional modeling. It then describes the enterprise data warehouse bus architecture, discusses the agile development approach to support data warehousing, provides rationalization for the requisite integration and stewardship, and then contrasts the Kimball architecture with the Corporate Information Factory's hub-and-spoke.
Chapter 6: Dimensional Modeling Fundamentals. This chapter introduces the basics of dimensional modeling, starting with distinguishing a fact from a dimension, and the core activities of drilling down, drilling across, and handling time in a data warehouse. We also explore familiar fables about dimensional models.
Chapter 7: Dimensional Modeling Tasks and Responsibilities. While Chapter 6 covers the fundamental “what and why” surrounding dimensional modeling, this chapter focuses on the “how, who, and when.” Chapter 7 describes the dimensional modeling process and tasks, with the aim of organizing an effective team, whether starting with a blank slate or revisiting an existing model.
Chapter 8: Fact Table Core Concepts. The theme for Chapter 8 could be stated as “just the facts, and nothing but the facts.” We begin by discussing granularity and the three fundamental types of fact tables, and then turn our attention to fact table keys and degenerate dimensions. The chapter closes with a potpourri of common fact table patterns, including null, textual, and sparsely populated metrics, as well as facts that closely resemble dimension attributes.
Chapter 9: Dimension Table Core Concepts. We shift our focus to dimension tables in Chapter 9, starting with a discussion of surrogate keys and the ever-present time (or date) dimensions. We then explore role playing, junk, and causal dimension patterns, before launching into a thorough handling of slowly changing dimensions, including four new advanced dimension types. Hang onto your hats.
Chapter 10: More Dimension Patterns and Considerations. Chapter 10 complements the previous chapter with more meaty coverage of dimension tables. We describe snowflakes and outriggers, as well as a significantly updated section on bridges for handling both multi-valued dimension attributes and ragged variable hierarchies. We discuss nuances often encountered in customer dimensions, along with internationalization issues. The chapter closes with a series of case studies covering insurance, voyages and networks, human resources, finance, electronic commerce, text searching, and retail; we encourage everyone to peruse these vignettes as the patterns and recommendations transcend industry or application boundaries.
Chapter 11: Back Room ETL and Data Quality. We switch gears from designing the target dimensional model to populating it in Chapter 11. Be forewarned: This is a hefty chapter, as you'd expect given the subject matter. This updated edition of the Reader has a wealth of new material in this chapter. We start by describing the 34 subsystems required to extract, transform, and load (ETL) the data, along with the pros and cons of using a commercial ETL tool. From there, we delve into data quality considerations, provide specific guidance for building fact and dimension tables, and discuss the implications of real-time ETL.
Chapter 12: Technical Architecture Considerations. It's taken us until Chapter 12, but we're finally discussing issues surrounding the technical architecture, starting with server oriented architecture (SOA), master data management (MDM), and packaged analytics. A new section on big data features two in-depth Kimball Group white papers written by Ralph. Final sections in this chapter focus on the presentation server, including the role of aggregate navigation and online analytical processing (OLAP), user interface design, metadata, infrastructure, and security.
Chapter 13: Front Room Business Intelligence Applications. In Chapter 13, we step into the front room of the DW/BI system where business users are interacting with the data. We describe the lifecycle of a typical business analysis, starting with a review of historical performance but not stopping there. We then turn our attention to standardized BI reports before digging into data mining and predictive analytics. The chapter closes by exploring the limitations of SQL for business analysis.
Chapter 14: Maintenance and Growth Considerations. In this penultimate chapter, we provide recommendations for successfully deploying the DW/BI system, as well as keeping it healthy for sustained success.
Chapter 15: Final Thoughts. The Reader concludes with final perspectives on data warehousing and business intelligence from each Kimball Group principal. The insights range from the most important hard won lessons we have learned to some glimpses of what the future of data warehousing may hold.
Navigation Aids
Given the breadth and depth of the articles in The Kimball Group Reader, we have very deliberately identified over two dozen articles as “Kimball Classics” because they captured a concept so effectively that we, and many others in the industry, have referred to these articles repeatedly over the past twenty years. The classic articles are designated with a special icon that looks like this:
We expect most people will read the articles in somewhat random order, rather than digesting the book from front to back. Therefore, we have put special emphasis on The Reader's index as we anticipate many of you will delve in by searching the index for a particular technique or modeling situation.
Terminology Notes
We are very proud that the vocabulary established by Ralph has been so durable and broadly adopted. Kimball “marker words” including dimensions, facts, slowly changing dimensions, surrogate keys, fact table grains, factless fact tables, and degenerate dimensions, have been used consistently across the industry for more than twenty years. But in spite of our best intentions, a few terms have morphed since their introduction; we have retroactively replaced the old terms with the accepted current ones.
Artificial keys are now called surrogate keys.
Data mart has been replaced with business process dimensional model, business process subject area, or just subject area, depending on the context.
Data staging is now known as extract, transform, and load.
End user applications have been replaced by business intelligence applications.
Helper tables are now bridge tables.
Since most people won't read this book from cover to cover, we need to introduce some common abbreviations up front:
DW/BI is shorthand for the end-to-end data warehouse/business intelligence system. This abbreviation is useful for brevity, but it also explicitly links data warehousing and business intelligence as codependent. Finally, it reflects the shift of emphasis from the data warehouse being an end in itself to business intelligence (BI) really driving everything we do. After all, the data warehouse is the platform for all forms of BI.
Many figures in The Reader include the DD, FK, and PK abbreviations, which stand for degenerate dimension, foreign key, and primary key, respectively.
ETL means extract, transform, and load, the standard paradigm for acquiring data and making it ready for exposure to BI tools.
ER refers to entity-relationship. We frequently use ER when we discuss third normal form (3NF) or normalized data models, as opposed to dimensional data models.
OLAP stands for online analytical processing, typically used to differentiate dimensional models captured in a multidimensional database or cube from dimensional models in a relational DBMS called star schemas. These relational star schemas are sometimes referred to as ROLAP.
SCD is the abbreviation for slowly changing dimension, referring to the techniques we've established for handling dimension attribute changes.