Microsoft® SQL Server® 2008 Integration Services

Contents

Introduction

Chapter 1: SSIS Solution Architecture

Problem

Macro Challenge: Enterprise Data Spaghetti

Micro Challenge: Data-Processing Confusion

Problems with Execution and Troubleshooting

Infrastructure Challenges

Other Challenges

Design

Choosing the Right Tool

Overall Solution Architecture

Data Integration or Consolidation

Data Warehouse ETL

Project Planning and Team Development

Agile Benefits

Agile Cautions and Planning

Data Element Documentation

Package Design Patterns

Modular Packages

Master Packages

Server and Storage Hardware

Server Hardware

Development and Test Servers

ETL Collocation

Storage Hardware

Package Execution Location

Package Storage Location Versus Execution Location

Execute SQL Task and Bulk Insert Task Execution

Package Execution and the Data Flow

Design Review

Solution

Setting the Stage: Management and Deployment

Source Data: Files, Tables, and Data Cleansing

Data Warehouse ETL and Cube Processing

Advanced ETL: Scripting, High Availability, and Performance

Summary

Chapter 2: SSIS Management Framework Design

Problem

Challenges of Not Having a Framework Implementation

Different Development Methods

Changing Metadata

Getting the Right Information to the Right People at the Right Time

Reusability

Framework Implementation

Configuration Management Scheme

Logging and Auditing

Template Package

Framework Benefits and Value

Design

Configuration Management

Overall Design

Environment Variable Configuration

XML Configuration File

SQL Server Configuration

Logging and Auditing Mechanism

Storage

Tables

Stored Procedures

Template Package

Implementation Guidelines

Solution

Configuration Management

Logging and Auditing Mechanism

Storage and Tables

Stored Procedures

Template Package

Development

Installation

Other Considerations

Customizations

ETL Process Framework

Process Owner

Reporting

Summary

Chapter 3: Package Deployment and Storage Decisions

Problem

Standardization

Environment

Application

Desired Implementation

Design

Storage Methods

SSIS Service

File System Storage

SQL Server Storage

Storage Comparison

Deployment Methods

SSIS Deployment Wizard

Manual Deployment of Packages

DTUtil Scripting

Solution

Storage Methodology

Existing Infrastructure

Security

Package Makeup

Back to the Flowchart

Deployment Methodology

Storage Method

Existing Infrastructure

Deployment Needs

Back to the Flowchart

Total Methodology

Summary

Chapter 4: File-Handling and Processing Methods

Problem

Simple File Operations

Remote File Operations

File Criteria

File Iteration

File Notification

Design

File System Task

Source and Destination Connection

Attributes

Overwriting the Destination

Multiple Files

FTP Task

FTP Connection

HTTP

Foreach Loop

Applying Advanced Criteria

File Watcher Task

WMI Event Watcher Task

Solution

Summary

Chapter 5: Data Extraction Best Practices

Problem

Extraction Data Criteria

Source System Impact

Incremental Extraction

Deleted Rows

Staging Database

Data Lineage and Metadata

File Sources

Design

Package Connections and Source Adapters

Package Connections

Source Adapters

Incremental Data Extraction

Incremental Extraction Using a Change Identifier Value

Targeting Change Records through Dynamic Source Queries

Retrieving Incremental Identifier Values and Updating Package Variables

Capturing the Maximum Change Identifier Value

Incremental Extraction from SQL Server without a Trigger

Using SSIS to Handle All Aspects of an Incremental Extraction

Data Staging Method

Tracking Data Lineage Identifiers

Solution

Metadata-Driven Extraction Example

Metadata Tables

Control Flow

Looping through the Sources

Inside the Sources Loop

The Script

Read Variables

Open Connections

Get Max Change Indicator

Extract Changed Source Rows

Close Connections

Summary

Chapter 6: Data-Cleansing Design

Problem

Candidate Key Profiles

Column Length Distribution Profiles

Column Null Ratio Profiles

Column Pattern Profiles

Column Statistics Profiles

Column Value Distribution Profiles

Functional Dependency Profiles

Design

Using the Script Transform for Data Scrubbing

Using the Fuzzy Grouping Transform to De-duplicate Data

Using the Fuzzy Lookup Transform to Cleanse Data

Dealing with Multiple Record Types in a Single File

Using the Raw File

Solution

Summary

Chapter 7: Dimension Table ETL

Problem — Fundamental Dimension ETL

Dimensions: The Basics

Dimension ETL: The Challenge

Design — Fundamental Dimension ETL

Data Preparation

Dimension Change Types

Type 1 (Changing) Attribute: A Closer Look

Type 2 (Historical) Attribute: A Closer Look

Inferred Members

Solution — Fundamental Dimension ETL

Preparing Your Source Data for Dimension ETL

SSIS Slowly Changing Dimension Wizard

Advanced Properties and Additional Outputs of the SCD

Problem — Advanced Dimension ETL

SCD Wizard Advantages and Disadvantages

Dimension Volume and Complexity

Design — Advanced Dimension ETL

Optimizing the Built-in SCD

Index Optimizations

Update Optimizations

Snowflake Dimension Tables

Parent-Child Dimension ETL

Date Dimension ETL

Profile Dimension and Junk Dimension ETL

Creating a Custom Slowly Changing Package

Solution — Advanced Dimension ETL

Snowflake Dimensions

Parent-Child Dimensions

Profile and Junk Dimensions

Date Dimension ETL

Custom Dimension ETL

Determining Dimension Changes

Inserts and Updates

Summary

Chapter 8: Fact Table ETL

Problem

Fact Tables: The Basics

Fact Table ETL: The Challenge

Preparing the Data

Mapping Dimension Keys

Calculating Measures

Adding Metadata

Fact Table Updates

Fact Table Inserts

Design

Data Preparation

Data Preparation with SSIS Transformations

Data Preparation Examples

Acquiring the Dimension Surrogate Key in SSIS

Identifying the Dimension Surrogate Keys with SSIS

Surrogate Key Examples in SSIS

Measure Calculations

Measure Calculation Types

Handling Measure Calculations in SSIS

Managing Fact Table Changes

Approaches to Identifying Changed Fact Records

Fact Update Examples in SSIS

Optimizing Fact Table Inserts

Optimizing Inserts with Fast Load

Optimizing Inserts with Index Management

Solution

Internet and Reseller Sales Fact Table ETL

Fact Internet and Reseller Sales Extraction and Transform Process

Fact Internet and Reseller Sales Load Process

Snapshot Fact Table Example — Call Center Fact Table

Advanced Fact Table ETL Concepts

Handling Missing Dimension Lookups

Handling Late-Arriving Facts

Summary

Chapter 9: SSAS Processing Architecture

Problem

SSAS Object Population

Schedule

Partition Management

The Complete Package

Design

SSAS Objects and Processing Basics

Dimensions

Partitions

Mining Structures

SSIS Tasks and Components

Analysis Services Processing Task

Analysis Services Execute DDL Task

Execute Process Task with ASCMD

Data Flow Destinations for SSAS Objects

Script Task with AMO

Creating and Managing Partitions

Overall Design

Solution

Preparation for SSAS Integration

Process Dimensions Package

Process Task

Parallel XMLA

Process Partitions Package

Storing and Loading Metadata

SSAS Processing

Overall Solution

Summary

Chapter 10: Implementing Scale-Out ETL Process

Problem

Design

Design Components Overview

Central Common Services Server

File Processor and Pre-Aggregation Scale-Out Processes Servers

Design Details

File Management Tasks

Data File Management

Work Allocation Process

Scale-Out Source File Process

Work Reassignment Process

Data Aggregation Tasks

Hourly Data Pre-Aggregation Process

Hourly Data Aggregation Process

Daily Data Aggregation Process

Archival and Clean-up Processes

Data File Archival Process

Stage Table Clean-up Process

Design Conclusions

Solution

Central Server Services

Multiple File Processor and Pre-Aggregator Processes

Database Tables Required on the Central Server

Stored Procedures

Procedures on the Central Server

Procedures on the Staging Servers

SSIS Packages

File-Processing Server Packages

Central Server Packages

Summary

Chapter 11: Scripting Design Patterns

Problem — Advanced File Management

Script Task

Scenario

Design — Advanced File Management

Script Language

Accessing Objects

Custom Assemblies

Scripting

Solution — Advanced File Management

Create an External Assembly

Access the External Assembly

Archive the Files

Summary — Advanced File Management

Problem — Call Center Fact ETL

Reasons to Use Scripting

Scenario

Design — Call Center Fact ETL

Component Type

Output Type

Synchronous Outputs

Asynchronous Outputs

Other Component Types

Design Choice

Overridable Methods

Solution — Call Center Fact ETL

Package Setup

Script Component

Reusable Methods

Row-By-Row Processing

Process Input

Entire Script

Package Wrap-Up

Summary — Call Center Fact ETL

Summary

Chapter 12: SSIS Package Scaling

Problem

Identifying Task Durations

Identifying Data Flow Destination Issues

Identifying Transformation and Memory Bottlenecks

Identifying Data Flow Source Performance Issues

Design

Balancing SQL Operations with the Data Flow

Data Flow Advantages as Compared with SQL

SQL Advantages when Compared with the Data Flow

Applying SQL Code in Your SSIS Packages

SSIS Pipeline Architecture Concepts

Data Flow Design Example

SQL-Centric SSIS Process

Rewritten Data Flow–Centric SSIS Process

Solution

Tuning Your Data Flow

Use Blocking Transformations Sparingly

Limit Row-by-Row Operations

Manage Data Flow Backpressure

Pre-sort Sources as an Alternative to the Sort Transformation

Optimize the Lookup and Managing the Cache Size

Remove Unused Columns from the Data Flow

Be Mindful of the SSIS Logging Impact

Regulate Transactions

Setting Data Flow Properties

Up the EngineThreads Value

Optimize the Temporary Storage Locations

Leave RunInOptimizedMode as True

Tune Buffers

Database and Destination Optimization

Limiting Database Growth

Consider Dropping and Re-creating Indexes on Destination Tables

Using the OLE DB Destination Adapter

Use Advanced Oracle and Teradata Destination Adapters

Handling Partitioned Tables

Summary

Index

Microsoft® SQL Server® 2008 Integration Services: Problem–Design–Solution

Published by
Wiley Publishing, Inc.
10475 Crosspoint Boulevard
Indianapolis, IN 46256

www.wiley.com

Published by Wiley Publishing, Inc., Indianapolis, Indiana

Published simultaneously in Canada

ISBN: 978-0-470-52576-0

Manufactured in the United States of America

10 9 8 7 6 5 4 3 2 1

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.

Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose. No warranty may be created or extended by sales or promotional materials. The advice and strategies contained herein may not be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services. If professional assistance is required, the services of a competent professional person should be sought. Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that an organization or Web site is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Web site may provide or recommendations it may make. Further, readers should be aware that Internet Web sites listed in this work may have changed or disappeared between when this work was written and when it is read.

For general information on our other products and services please contact our Customer Care Department within the United States at (877) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.

Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.

Library of Congress Control Number: 2009935233

Trademarks: Wiley, the Wiley logo, Wrox, the Wrox logo, Programmer to Programmer, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates, in the United States and other countries, and may not be used without written permission. All other trademarks are the property of their respective owners. Wiley Publishing, Inc., is not associated with any product or vendor mentioned in this book.

To my family: Amy, Meg, Nate, Kate, and Caleb

—Erik Veerman

About the Authors

Erik Veerman is a Mentor for Solid Quality Mentors, focusing on training, mentoring, and architecting solutions on the SQL Server Business Intelligence (BI) platform. His industry recognition includes Microsoft’s Worldwide BI Solution of the Year and SQL Server Magazine’s Innovator Cup winner. Veerman has designed dozens of BI solutions across a broad business spectrum — telecommunications, marketing, retail, commercial real estate, finance, supply chain, and information technology. His experience with high-volume, multi-terabyte environments and SQL Server 64-bit has enabled clients to scale their Microsoft-based BI solutions for optimal potential. As an expert in on-line analytical processing (OLAP) design, extract, transform, and load (ETL) processing, and dimensional modeling, Veerman is a presenter, author, and instructor. He led the ETL architecture and design for the first production implementation of SQL Server Integration Services (SSIS), and helped to drive the ETL standards and best practices for SSIS on Microsoft’s SQL Server 2005 reference initiative, Project REAL. Veerman is also co-author of Professional SQL Server 2008 Integration Services (Indianapolis: Wiley, 2008), Expert SQL Server 2005 Integration Services (Indianapolis: Wiley, 2006), Professional SQL Server 2005 Integration Services (Indianapolis: Wiley, 2007), and lead author for the Microsoft Press SQL Server 2005 and SQL Server 2008 series, focusing on Business Intelligence Development and Implementation. As a resident of Atlanta, Georgia, Veerman participates in the local Atlanta SQL Server User’s Group, a Professional Association of SQL Server (PASS) chapter.

Jessica M. Moss (SQL Server MVP) is a well-known practitioner, trainer, author, and speaker for Microsoft SQL Server Business Intelligence (BI). As a Mentor with Solid Quality Mentors, she has created numerous data warehousing solutions for a variety of companies in the retail, Internet, health services, finance, and energy industries, and has delivered training courses on SQL Server Integration Services (SSIS), SQL Server Retail Services (SSRS), and SQL Server Analysis Services (SSAS). While working for a major clothing retailer, Moss participated in the SQL Server 2005 TAP program, where she developed best implementation practices for SSIS. Moss has authored technical content for multiple magazines and Web sites, and has spoken internationally at conferences such as the Professional Association for SQL Server (PASS) Community Summit, SQL Teach International Conference, and the devLink Technical Conference. As a strong proponent of developing user-to-user community relations, Moss organizes meetings for a User Group that discusses different technologies. She also actively participates in local User Groups and code camps in central Virginia. In addition, Moss volunteers her time to help educate people through the PASS organization.

Brian Knight (SQL Server MVP, MCITP, MCSE, MCDBA) is the owner and founder of Pragmatic Works. He is also the co-founder of SQLServerCentral.com and JumpstartTV.com. He runs the local SQL Server Users Group in Jacksonville, Florida (JSSUG). Brian is a contributing columnist for several technical magazines and does regular Webcasts at “Jumpstart TV.” He is the author of ten books on the SQL Server technology. Knight has spoken at conferences such as the Professional Association of SQL Server (PASS), SQL Connections, and TechEd, as well as many Code Camps. His blog can be found at www.pragmaticworks.com. Knight lives in Jacksonville, where he loves watching the National Football League Jaguars lose on a regular basis.

Jay Hackney is a mentor and consultant for Solid Quality Mentors, where he helps customers utilize Microsoft technology to build real-world solutions. He has 15 years of development and consulting experience on the Microsoft data platform, and has worked with SQL Server Integration Services (SSIS) since the earliest private beta. Since then, he has used SSIS to deliver data warehousing, extract, transform, and load (ETL), and data migration solutions for many well-known companies, and has presented on SSIS topics at events such as Professional Association for SQL Server (PASS) Summit, Microsoft Tech-Ed, and SQL Server product launches.

Rushabh Mehta is a Mentor for Solid Quality Mentors, and is also the Managing Director for Solid Quality India Pvt. Ltd. Mehta has actively worked with the Microsoft SQL Server and Business Intelligence (BI) platform since 1998. He has developed numerous BI solutions, including predictive-analysis systems, health-care solutions, multi-terabyte financial decision support systems, as well as designing and building an analytical platform to support the Microsoft Windows Azure platform. His experience spans a wide breadth of industrial sectors, including manufacturing, retail, finance, and government, working with clients such as Raymond James Financials, Jackson Hewitt, Publix, the U.S. Department of Veterans Affairs, the U.S. Department of Defense, and Microsoft. Mehta has also developed a number of best practices and standards for implementing BI solutions. He has also developed and delivered advanced training courses on Microsoft BI practices and technologies. An active speaker at a number of large conferences such as the Professional Association for SQL Server (PASS) and TechEd, Mehta also frequently speaks at SQL Server and BI User Groups around the world. Mehta is the Executive Vice President for PASS, the largest global user-driven community for SQL Server professionals.

Credits

Executive Editor
Robert Elliott

Project Editor
Kevin Shafer

Technical Editor
Paul S. Waters

Production Editor
Kathleen Wisor

Copy Editor
Paula Lowell

Editorial Director
Robyn B. Siesky

Editorial Manager
Mary Beth Wakefield

Production Manager
Tim Tate

Vice President and Executive Group Publisher
Richard Swadley

Vice President and Executive Publisher
Barry Pruett

Associate Publisher
Jim Minatel

Proofreader
Scott Klemp, Word One

Indexer
Robert Swanson

Acknowledgments

I want to first thank Jessica, Brian, and Jay for adding their valuable expertise and experience to several of the chapters of this book, and sharing their approach and code practices from the SSIS framework, to the dynamic extraction logic, to data cleansing best practices. A big collective thanks to our clients! The unique projects and challenges helped shape and refine this content. I added up the years of experience for all of us, and these pages represent about 50 years of work with data processing and ETL — and that doesn’t include late nights and long weekends.

I also want to thank Paul Waters and Rushabh Mehta for their roles in this book as Technical Editor and contributing author, respectively. A big thanks to the Wrox team for their support, hard work, and valuable feedback. And a thanks to you, the reader, because I actually read the Amazon feedback and try to address the constructive criticism in each new work. Hopefully, the diligence has paid off. If so, help us and let the world know!

Last, but not least, thanks to my family for their support. A special thanks to Amy and her encouragement and hard work in so many ways.

— Erik Veerman

As a first-time book author, this acknowledgment holds special meaning to me. I will try to refrain from mentioning everyone I have ever met, and focus on a few special people that truly affected the outcome of this book.

First of all, a huge thank you is due to Erik Veerman, the lead author on this book. He not only convinced me to join this troupe of authors, but also took the time to patiently answer all of my questions about the process. Next, thank you to Jay Hackney, Brian Knight, Rushabh Mehta, Paul Waters, and Wrox for their authoring, technical, and editing contributions. I truly appreciated your support and help.

On a personal note, I would like to thank my friends and family, Mom, Dad, and Lianna, for their never-ending support and patience as I disappeared for days at a time. Finally, a few other people who I would like to recognize for their previous support are Fred Creasy, Georgia Willis, Kathi Walker, and Paul Reynolds — thank you.

— Jessica M. Moss

Thanks to everyone who made this book possible. As always, I owe a huge debt to my wife, Jenn, for putting up with my late nights, and to my children, Colton, Liam, and Camille, for being so patient with their tired dad who has always overextended. Thanks also to the makers of Guinness and other hard liquors for sustaining my ability to keep on writing technical books. Finally, thanks for all the User Group leaders out there who work so hard to help others become proficient in technology. You make a huge difference!

—Brian Knight

I’d like to thank Erik for this opportunity and his hard work on this book.

— Jay Hackney

Introduction

Our world is a world full of data and, as developers, database administrators (DBAs), or data architects, your job will no doubt require processing data — consolidating data, integrating system data, migrating data, cleansing data, transforming data, or loading data into a data warehouse. This book is about solving problems and building solutions with SQL Server 2008 Integration Services (SSIS). It focuses on the design best practices for the top uses and challenge of SSIS.

In this book, you will learn about how to approach the following kinds of projects, and how to effectively work within SSIS to apply its functionality to these common project requirements and challenges:

Designing Extraction Transformation and Loading (ETL) solutions with SSIS to handle file management, data warehouse processing, systems integration, and data cleansing.
Resolving common SSIS package deployment and auditing challenges that often get in the way of designing a solid solution.
Building an SSIS-based solution that performs well, and can deal with failures or unplanned system outages

As an author team, our vision for this book is very different than the other books on the shelves, which focus mainly on how to use the knobs and buttons in SSIS. Books such as the Professional Microsoft SQL Server 2008 Integration Services (Indianapolis: Wiley, 2008) are very valuable to have on your bookshelf for reference and for learning SSIS. But this book is about going to the next level of solution design. Our goal in writing this book was to focus on solving problems, building solutions, and providing design best practices.

In summary, the difference between this SSIS books and all the others out there is that other books simply focus on the product features with little emphasis on solution design. If you go out and buy a new power saw, the manual is going to tell you how to angle the blade, set the right depth, and make a clean cut. But the manual doesn’t tell you how to build something. This book shows you how to build the furniture, not just how to use the saw.

To be sure, you must know how to use SSIS before you can build a solution. But going from knowledge to design requires guidance on the right approach, and how to avoid the common pitfalls.

This book empowers you with the confidence, the knowledge, and the understanding to make the right choices in your ETL design that enables easy administration, meets your data processing requirements, and performs well for current and future scalability.

Who This Book Is For

This book is targeted at three primary SSIS users: data architects, ETL developers, and DBAs responsible for ETL administration and support.

Because this book focuses on problems and solutions, a base understanding of SSIS is required. A couple of areas of the book walk you through the more advanced features of SSIS, but most of the book builds on top of a foundation of SSIS knowledge and experience. If you have taken an SSIS class, or have read another book and tried out SSIS, or you have built a few packages for various purposes, then that base knowledge will give you enough background for this book. But you should be up for the challenge of learning a new tool in the context of applying it!

The perfect reader of this book is someone who is in the early stages of a new ETL or data integration project (or a redesign project), and is eager to know how to approach the effort with the best practices when designing SSIS packages.

If you are supporting an existing project and must make some changes to aid in administration, deployment, or scalability, then you will also benefit from several of the chapters herein.

What This Book Covers

This book is an expanded, revised, and rewritten version of the Expert SQL Server 2005 Integration Services book (Indianapolis: Wiley, 2007). The authors of this book have expanded the coverage to address the current trends in ETL and SSIS, including creating a scaling-out execution model, performing advanced data profiling and cleansing, and handling file management and file processing.

This book also addresses some of the challenges in SSIS surrounding auditing, configurations, and execution management. Two chapters focus on solving these administrative challenges:

Chapter 2 covers building an SSIS package framework, and provides you with an auditing, configuration, and reporting structure that you can use in the packages of your solution.
Chapter 3 also helps you in the area of administration by addressing package deployment and storage. It answers the questions, “Where should I store my packages — in the file system or in SQL Server?” and “What is the best SSIS deployment approach for my environment?” The right answer is that “it depends,” and Chapter 3 walks you through all the dependencies, while leading you to the right answer for your environment.

Every ETL or data integration solution involves data extraction of one kind or another. Maybe you’re extracting data from flat files, or perhaps your source is coming from an enterprise resource planning (ERP) system, or a line-of-business application. Regardless, you must implement a data extraction methodology that is efficient, reduces the impact on the source, and adequately handles changes and data tracking. Chapter 5 dives into many of the data extraction areas, and even provides you with a dynamic data extraction approach.

Another area that this book covers is data warehouse ETL. In fact, one of the more common uses of SSIS is data warehousing ETL. Chapters 7, 8, and 9 are dedicated to solving the data warehouse ETL challenges all the way from dimension and fact table ETL, to processing cubes in SQL Server Analysis Services (SSAS).

Not to be overlooked, Chapter 11 examines control flow and data flow scripting, and provides examples of common (but complex) challenges that can be solved by scripting, including advanced file management and data processing. Chapter 12 provides performance troubleshooting steps and best practices on data flow design. This chapter also contrasts the right use of SQL commands versus the data flow.

Before you even begin diving into the details of an SSIS-based solution, you must start out on the right foot! Chapter 1 reviews solution architecture, project approach and methodology, package design patterns, the importance of I/O performance, and data element tracking. It answers the question, “Where do I begin and how do I approach an ETL solution?

In all, this book presents a comprehensive picture of SSIS solution challenges and design best practices.

How This Book Is Structured

Before reviewing the chapter organization, let’s first note how each chapter is internally structured. Every chapter contains three primary elements — a “Problem” section, a “Design” section, and a “Solution” section. In fact, some chapters are structured with more than one Problem–Design–Solution grouping. Each collection of Problem–Design–Solution addresses the following:

“Problem” — Discusses either the current challenges related to the area of focus, or the SSIS-specific problems when trying to design an aspect of the solution.
“Design” — Presents the SSIS best practices, or outlines the design aspects and recommendations for a specific solution area.
“Solution” — Ties all the aspects of the solution together, and gives either an example solution related to the area, or considers the SSIS solution approach, and factors involved in solving the problem outlined earlier.

This book is generally organized in the way that you would approach a data integration or ETL project. The chapter flow builds on the natural progression that a developer or administrator would go through when designing an SSIS solution.

After beginning with an overview of architecture, the book then moves into putting together the underlying support structure of a solution — the storage, deployment, and management framework. Next, the natural progression is to handle the source data, whether that is in files or extracted from a relational database system (RDBMS), and often requiring a data-cleansing process. Next, the chapters delve into dimension and fact table loading, as well as the cube-processing steps. The final chapters address advanced data handling through scripting, and provide the package availability and performance that many solutions require.

The following chapter provides a solution overview and discusses planning:

Chapter 1, “SSIS Solution Architecture,” addresses the overall solution challenges and planning best practices.

The following chapters examine SSIS administration and the deployment foundation:

Chapter 2, “SSIS Management Framework Design,” reviews and demonstrates the importance of building an SSIS auditing and execution framework.
Chapter 3, “Package Deployment and Storage Decisions,” discusses the practices and decisions related to SSIS package deployment and storage.

The following chapters discuss file management and data extraction:

Chapter 4, “File Handling and Processing Methods,” covers how to deal with different files types, file sources, and file management.
Chapter 5, “Data Extraction Best Practices,” dives into how to manage data source extractions, including incremental and change data management.

The following chapter discusses data cleansing:

Chapter 6, “Data-Cleansing Design,” reviews data profiling in SSIS, and how to perform data cleansing of complex and dirty data.

The following chapters cover data warehouse ETL:

Chapter 7, “Dimension Table ETL,” covers the challenges related to dimension table data changes, and loading for different dimension table types.
Chapter 8, “Fact Table ETL,” addresses ETL for data warehouse fact table.
Chapter 9, “SSAS Processing Architecture,” looks at how to handle SSAS cube processing in different ways, based on your processing requirements.

The following chapters examine advanced ETL concepts:

Chapter 10, “Implementing Scale-Out ETL Process,” looks at how to deal with package reliability, and provides an approach to scaling out an SSIS solution.
Chapter 11, “Scripting Design Patterns,” presents a few different solution examples geared toward handling advanced data and processing requirements with scripts.
Chapter 12, “SSIS Package Scaling,” presents troubleshooting steps, and provides best practices for package design with a focus on performance.

The Problem–Design–Solution format and order of the chapters together provide a well thought-out, organized, and systematic approach to SSIS solution design.

What You Need To Use This Book

Following is a breakdown of what you need to use this book:

Since this book is about SQL Server 2008 Integration Services, you will get more out of it if you have the Developer Edition of SQL Server 2008 installed, including the sample applications and the tools.
If you don’t have a licensed copy of SQL Server 2008 Developer Edition, you can download a trial version from the Microsoft Web site, http://download.microsoft.com. Search for “SQL Server 2008 Evaluation.”
The Adventure Works 2008 samples are required for several of the examples in the chapters. The samples are available on Microsoft Open Source community site, www.codeplex.com. Search for “SQL Server 2008 Product Samples.”
The source code for the chapters is posted on the Wrox Web site (www.wrox.com). See the later section, “Source Code,” for directions on how to access to the material. Each chapter has the source code. In addition, a database called SSIS_PDS is included in the source code download, and has all the database objects required for the examples.

This book has been written for SQL Server 2008 with SP1 installed (which was released in February 2007).

Conventions

To help you get the most from the text and keep track of what’s happening, a number of conventions have been used throughout the book.

Boxes like this one hold important, not-to-be forgotten information that is directly relevant to the surrounding text.

Tips, hints, tricks, and asides to the current discussion are offset and placed in italics like this.

As for styles in the text:

Important new terms and important words are highlighted when we introduce them.
Keyboard strokes are shown like this: Ctrl+A.
File names, URLs, and code within the text are shown like this: persistence.properties.
Code is presented in the following two ways:

In code examples we highlight new and important code with boldfaced text.
The boldface is not used for code that's less important in the present context,
or has been shown before.

Source Code

As you work through the examples in this book, you may choose either to type in all the code manually, or use the source code files that accompany the book. All of the source code used in this book is available for download at www.wrox.com. Once at the site, simply locate the book’s title (either by using the Search box, or by using one of the title lists), and click the Download Code link on the book’s detail page to obtain all the source code for the book.

Because many books have similar titles, you may find it easiest to search by ISBN; for this book, the ISBN is 978-0-470-52576-0.

Once you download the code, just decompress it with your favorite compression tool. Alternately, you can go to the main Wrox code download page at www.wrox.com/dynamic/books/download.aspx to see the code available for this book and all other Wrox books.

Errata

We make every effort to ensure that there are no errors in the text or in the code. However, no one is perfect, and mistakes do occur. If you find an error in one of our books (such as a spelling mistake or faulty piece of code), we would be very grateful for your feedback. By sending in errata, you may save another reader hours of frustration, and, at the same time, you will be helping us provide even higher-quality information.

To find the errata page for this book, go to www.wrox.com and locate the title using the Search box or one of the title lists. Then, on the book details page, click the Book Errata link. On this page, you can view all errata that has been submitted for this book and posted by Wrox editors. A complete book list including links to each book’s errata is also available at www.wrox.com/misc-pages/booklist.shtml.

If you don’t spot “your” error on the Book Errata page, go to www.wrox.com/contact/techsupport.shtml and complete the form there to send us the error you have found. We’ll check the information and, if appropriate, post a message to the book’s errata page and fix the problem in subsequent editions of the book.

p2p.wrox.com

For author and peer discussion, join the P2P forums at p2p.wrox.com. The forums are a Web-based system for you to post messages relating to Wrox books and related technologies, and to interact with other readers and technology users. The forums offer a subscription feature to email you topics of interest of your choosing when new posts are made to the forums. Wrox authors, editors, other industry experts, and your fellow readers are present on these forums.

At http://p2p.wrox.com you will find a number of different forums that will help you not only as you read this book, but also as you develop your own applications. To join the forums, just follow these steps:

1. Go to p2p.wrox.com and click the Register link.

2. Read the terms of use and click Agree.

3. Complete the required information to join, as well as any optional information you wish to provide, and click Submit.

4. You will receive an email with information describing how to verify your account and complete the joining process.

You can read messages in the forums without joining P2P, but to post your own messages, you must join.

Once you join, you can post new messages and respond to messages other users post. You can read messages at any time on the Web. If you would like to have new messages from a particular forum emailed to you, click the “Subscribe to this Forum” icon by the forum name in the forum listing.

For more information about how to use the Wrox P2P, be sure to read the P2P FAQs for answers to questions about how the forum software works, as well as many common questions specific to P2P and Wrox books. To read the FAQs, click the FAQ link on any P2P page.

SSIS Solution Architecture

Imagine that this is the first day of a new internal or client project. You will have responsibility on the data and processing layer of the solution, which involves processing data — a lot of data — from several sources, and then either integrating systems, or maybe consolidating data for reporting. Maybe your background is a developer, or a database administrator (DBA), or a data designer, and you know SSIS fairly well. But now they are calling you the “data and ETL architect.”

ETL is the acronym for Extraction, Transformation, and Loading, which is used to describe the data-processing layer in a data-integration or data warehouse solution.

The project manager approaches you and says that the Vice President of Technology has asked the team to give him an estimate of the infrastructure needed. Furthermore, the business owner wants a high-level overview of how the solution architecture will help the company achieve the business need most efficiently. The project manager also wants your thoughts on the best way to approach the solution, how the development should be coordinated between team members, and how deployment should be handled.

Three meetings are scheduled for tomorrow to address these things, and you’ve been asked to provide your feedback in those areas.

Where do you start? How should you approach the solution design with SSIS as the main technology? How should all the pieces work together?

This chapter examines how to lay the foundation for successful solutions based on SQL Server Integration Services (SSIS). And, in fact, this whole book is about SSIS solutions to real-world requirements and challenges. It addresses questions such as the following:

What are the problems and issues?
What are the design principles?
How do you put it all together for a complete and successful solution?

Before you dive into the technical challenges of a project, you must first step back and ensure that you are laying the right foundation. Jumping right in is tempting! But resist the urge, because you want to (and need to) set the precedence and patterns for the solution upfront. If you don’t, chances are you won’t be able to go back and make changes down the road.

As with all chapters in this book, this chapter is organized into the following three major sections:

“Problem” — Coordinating and architecting an SSIS solution is not an easy task. The “Problem” section reveals some common challenges and common mistakes when planning and extending your ETL environment.
“Design” — The “Design” section in this chapter examines the right approach to a project, and the long-term project aspects that you should set in motion early in a project.
“Solution” — In many ways, the remainder of this book provides you with the solutions to make it all work together. This section launches you into the rest of the book, and shows how you can follow the chapters to build or redesign your SSIS solution.

Problem

Data and ETL projects have many challenges. Some challenges relate to data, some to enterprise integration, some to project coordination, and some to general expectations. This section begins by looking at the bigger picture of data within an organization, but then quickly looks at ETL projects and SSIS packages and execution.

Macro Challenge: Enterprise Data Spaghetti

Maybe your SSIS project is only a small solution in a bigger enterprise pond. The problem is that it can still cause a ripple effect when you tie it into your environment. Or, you can have challenges caused by an unwieldy enterprise environment when you try to implement your solution.

Figure 1-1 shows a not-so-nice telephone/electricity pole that illustrates the data nervous system of many organizations.

Figure 1-1

The problem with Figure 1-1 is that this mess didn’t happen overnight! It grew into this twisted unorganized process because of poor planning, coordination, and execution. However, be aware that, a lot of the time, a corporation’s politics may lead to this type of situation. Departments hire their own technical people and try to go around IT. Systems don’t talk to each other nicely. Project pressures (such as time and budget) cause designers to cut corners.

Following are a few reasons why this kind of tangled mess happens in an organization’s data processing, and examples of the many problems that this “unorganization” causes:

Debilitating dependency chains — The biggest problem is that often systems are built on top of systems on top of systems. The core source data has been connected through so many precedence links that it takes more time and administrative and development overhead. Systems at the source and in the middle of dependencies become un-replaceable because of the amount of effort that switching to a new system would take.
Unknown and uncontrollable data processing — The operations that process data are not centralized, and, in many cases, unknown because of department applications that are created without coordination with other applications. Processes run at uncontrolled times, and may impact systems within the processes even during peak times, which affects work efficiency.
Fragile enterprise changes — Changes to applications are difficult and costly. They may break processes, or cause data integration or reporting applications to be inaccurate.
Delayed data access — Even when the processes are somewhat controlled, the complicated system dependencies cause delays in data availability and nightly overhead processes that often run into mid-morning schedules. When they break, customer perception and employee efficiency are affected.

The “Design” section later in this chapter discusses how to approach your SSIS-based ETL project in the right way, and ensure that you are helping to solve the problem, rather than adding to it.

Micro Challenge: Data-Processing Confusion

Another common problem with data processing is when the logic contained to process data is overly complicated and confusing. Just like the macro enterprise problem, this problem usually is the result of changes over time where logic is modified and appended. It usually comes in one of two ways:

Runaway stored procedures — Procedures that run with complicated logic and lots of temporary tables, inserts, updates, and/or deletes can be difficult to manage, and are often inefficient. Supporting the procedures is also very difficult because the logic is difficult to follow, and, many times, the developers or DBAs who wrote the code are unavailable. Overall, this type of process requires a lot of administration and wasted time spent on following and learning the process.
Unmanageable packages — SSIS packages can also be designed with difficult-to-follow logic and sometimes complex precedence with hundreds of components used in a single package. These kinds of packages have challenges similar to those of runaway stored procedures, such as troubleshooting and the learning curve required for the process. Figure 1-2 shows the control flow of a package that has too many components to effectively manage. (The SSIS designer is zoomed in at 50 percent to fit on the screen.)

Figure 1-2

The overly complex control flow shown in Figure 1-2 is similar to an overly complex data flow, where too many components are used, thus making the development, troubleshooting, and support difficult to manage. The “Design” section later in this chapter proposes a better approach for SSIS packages called the modular package approach.

In summary, both of these types of processes (runaway procedures and unmanageable packages) are very difficult to support, and not suited to team development, error handling, and scalability (all of which are addressed in Chapter 12).

Problems with Execution and Troubleshooting

A couple of other issues that often come up in an ETL or data-integration solution are poor process coordination and difficulty doing root cause analysis. If the “what happened?” question can’t be answered quickly and with confidence, then likely there is a problem with the overall solution execution and logging strategy.

Figure 1-3 shows the command-line output of an example SSIS package execution.

Figure 1-3

If you were to consider spending time trying to work through this output when trying to figure out what went wrong, then you should consider implementing a better execution and auditing structure. This includes package execution in your development environment.

If you have just turned on the out-of-the-box SSIS logging and are capturing results to output to a table, it still may not be enough. If you write custom queries every time against the SSIS logging table to figure out what happened, then you also need a better strategy.

Infrastructure Challenges

When designing an SSIS ETL solution, how do you determine the infrastructure components such as server hardware, memory, processor cores, network switches, disk arrays, storage networks, and I/O controllers? Related to that, where should you run your SSIS packages taking into consideration sources, destinations, and other applications, while balancing hardware scalability and location within your network topology?

These questions are not trivial, and the answers depend on a lot of factors, including processing windows, source and destination availability, application impact and availability, network bandwidth, fault-tolerance requirements, and so on.

I/O is usually the biggest bottleneck, and the one most often overlooked. I/O (or, more precisely, disk I/O) is the throughput that the system can handle on the drive volumes. And this challenge is not just about trying to get the greatest throughput on a single drive. You must consider staging and temporary environments, logging, and current and historical data. And you must balance it all with hardware availability and budget.

The reason disk I/O is so important when considering a data-integration or ETL effort is because of the nature of what you are doing, including the following:

Bulk operations — ETL and integration efforts typically process data in bulk, meaning that, when a process is kicked off (hourly, daily, or weekly), data is moved from sources to destinations with some transformation processes in-between. The processes usually move or integrate thousands or millions of records. That can be a lot of data that moves between systems, and it generates a lot of disk activity.
Source databases — Processes that are extracting a lot of data from sources incur disk overhead either by the sheer volume of the data, or when complicated queries against large tables require temporary operations that use disks (such as the SQL Server TempDB).
Destination databases — The nature of relational databases requires that data be stored to disk before a transaction is marked as complete. When inserting or updating a lot of data, the server must wait until the data is committed for the process to be complete.
Staging databases
File management — A lot of ETL operations deal with files such as delimited files, XML, EDI, and so on. Large files require file management, archiving, and sometimes processing, and, therefore, incur disk I/O overhead.