Contents
Introduction
Chapter 1: SSIS Solution Architecture
Problem
Macro Challenge: Enterprise Data Spaghetti
Micro Challenge: Data-Processing Confusion
Problems with Execution and Troubleshooting
Infrastructure Challenges
Other Challenges
Design
Choosing the Right Tool
Overall Solution Architecture
Data Integration or Consolidation
Data Warehouse ETL
Project Planning and Team Development
Agile Benefits
Agile Cautions and Planning
Data Element Documentation
Package Design Patterns
Modular Packages
Master Packages
Server and Storage Hardware
Server Hardware
Development and Test Servers
ETL Collocation
Storage Hardware
Package Execution Location
Package Storage Location Versus Execution Location
Execute SQL Task and Bulk Insert Task Execution
Package Execution and the Data Flow
Design Review
Solution
Setting the Stage: Management and Deployment
Source Data: Files, Tables, and Data Cleansing
Data Warehouse ETL and Cube Processing
Advanced ETL: Scripting, High Availability, and Performance
Summary
Chapter 2: SSIS Management Framework Design
Problem
Challenges of Not Having a Framework Implementation
Different Development Methods
Changing Metadata
Getting the Right Information to the Right People at the Right Time
Reusability
Framework Implementation
Configuration Management Scheme
Logging and Auditing
Template Package
Framework Benefits and Value
Design
Configuration Management
Overall Design
Environment Variable Configuration
XML Configuration File
SQL Server Configuration
Logging and Auditing Mechanism
Storage
Tables
Stored Procedures
Template Package
Implementation Guidelines
Solution
Configuration Management
Logging and Auditing Mechanism
Storage and Tables
Stored Procedures
Template Package
Development
Installation
Other Considerations
Customizations
ETL Process Framework
Process Owner
Reporting
Summary
Chapter 3: Package Deployment and Storage Decisions
Problem
Standardization
Environment
Application
Desired Implementation
Design
Storage Methods
SSIS Service
File System Storage
SQL Server Storage
Storage Comparison
Deployment Methods
SSIS Deployment Wizard
Manual Deployment of Packages
DTUtil Scripting
Solution
Storage Methodology
Existing Infrastructure
Security
Package Makeup
Back to the Flowchart
Deployment Methodology
Storage Method
Existing Infrastructure
Deployment Needs
Back to the Flowchart
Total Methodology
Summary
Chapter 4: File-Handling and Processing Methods
Problem
Simple File Operations
Remote File Operations
File Criteria
File Iteration
File Notification
Design
File System Task
Source and Destination Connection
Attributes
Overwriting the Destination
Multiple Files
FTP Task
FTP Connection
HTTP
Foreach Loop
Applying Advanced Criteria
File Watcher Task
WMI Event Watcher Task
Solution
Summary
Chapter 5: Data Extraction Best Practices
Problem
Extraction Data Criteria
Source System Impact
Incremental Extraction
Deleted Rows
Staging Database
Data Lineage and Metadata
File Sources
Design
Package Connections and Source Adapters
Package Connections
Source Adapters
Incremental Data Extraction
Incremental Extraction Using a Change Identifier Value
Targeting Change Records through Dynamic Source Queries
Retrieving Incremental Identifier Values and Updating Package Variables
Capturing the Maximum Change Identifier Value
Incremental Extraction from SQL Server without a Trigger
Using SSIS to Handle All Aspects of an Incremental Extraction
Data Staging Method
Tracking Data Lineage Identifiers
Solution
Metadata-Driven Extraction Example
Metadata Tables
Control Flow
Looping through the Sources
Inside the Sources Loop
The Script
Read Variables
Open Connections
Get Max Change Indicator
Extract Changed Source Rows
Close Connections
Summary
Chapter 6: Data-Cleansing Design
Problem
Candidate Key Profiles
Column Length Distribution Profiles
Column Null Ratio Profiles
Column Pattern Profiles
Column Statistics Profiles
Column Value Distribution Profiles
Functional Dependency Profiles
Design
Using the Script Transform for Data Scrubbing
Using the Fuzzy Grouping Transform to De-duplicate Data
Using the Fuzzy Lookup Transform to Cleanse Data
Dealing with Multiple Record Types in a Single File
Using the Raw File
Solution
Summary
Chapter 7: Dimension Table ETL
Problem — Fundamental Dimension ETL
Dimensions: The Basics
Dimension ETL: The Challenge
Design — Fundamental Dimension ETL
Data Preparation
Dimension Change Types
Type 1 (Changing) Attribute: A Closer Look
Type 2 (Historical) Attribute: A Closer Look
Inferred Members
Solution — Fundamental Dimension ETL
Preparing Your Source Data for Dimension ETL
SSIS Slowly Changing Dimension Wizard
Advanced Properties and Additional Outputs of the SCD
Problem — Advanced Dimension ETL
SCD Wizard Advantages and Disadvantages
Dimension Volume and Complexity
Design — Advanced Dimension ETL
Optimizing the Built-in SCD
Index Optimizations
Update Optimizations
Snowflake Dimension Tables
Parent-Child Dimension ETL
Date Dimension ETL
Profile Dimension and Junk Dimension ETL
Creating a Custom Slowly Changing Package
Solution — Advanced Dimension ETL
Snowflake Dimensions
Parent-Child Dimensions
Profile and Junk Dimensions
Date Dimension ETL
Custom Dimension ETL
Determining Dimension Changes
Inserts and Updates
Summary
Chapter 8: Fact Table ETL
Problem
Fact Tables: The Basics
Fact Table ETL: The Challenge
Preparing the Data
Mapping Dimension Keys
Calculating Measures
Adding Metadata
Fact Table Updates
Fact Table Inserts
Design
Data Preparation
Data Preparation with SSIS Transformations
Data Preparation Examples
Acquiring the Dimension Surrogate Key in SSIS
Identifying the Dimension Surrogate Keys with SSIS
Surrogate Key Examples in SSIS
Measure Calculations
Measure Calculation Types
Handling Measure Calculations in SSIS
Managing Fact Table Changes
Approaches to Identifying Changed Fact Records
Fact Update Examples in SSIS
Optimizing Fact Table Inserts
Optimizing Inserts with Fast Load
Optimizing Inserts with Index Management
Solution
Internet and Reseller Sales Fact Table ETL
Fact Internet and Reseller Sales Extraction and Transform Process
Fact Internet and Reseller Sales Load Process
Snapshot Fact Table Example — Call Center Fact Table
Advanced Fact Table ETL Concepts
Handling Missing Dimension Lookups
Handling Late-Arriving Facts
Summary
Chapter 9: SSAS Processing Architecture
Problem
SSAS Object Population
Schedule
Partition Management
The Complete Package
Design
SSAS Objects and Processing Basics
Dimensions
Partitions
Mining Structures
SSIS Tasks and Components
Analysis Services Processing Task
Analysis Services Execute DDL Task
Execute Process Task with ASCMD
Data Flow Destinations for SSAS Objects
Script Task with AMO
Creating and Managing Partitions
Overall Design
Solution
Preparation for SSAS Integration
Process Dimensions Package
Process Task
Parallel XMLA
Process Partitions Package
Storing and Loading Metadata
SSAS Processing
Overall Solution
Summary
Chapter 10: Implementing Scale-Out ETL Process
Problem
Design
Design Components Overview
Central Common Services Server
File Processor and Pre-Aggregation Scale-Out Processes Servers
Design Details
File Management Tasks
Data File Management
Work Allocation Process
Scale-Out Source File Process
Work Reassignment Process
Data Aggregation Tasks
Hourly Data Pre-Aggregation Process
Hourly Data Aggregation Process
Daily Data Aggregation Process
Archival and Clean-up Processes
Data File Archival Process
Stage Table Clean-up Process
Design Conclusions
Solution
Central Server Services
Multiple File Processor and Pre-Aggregator Processes
Database Tables Required on the Central Server
Stored Procedures
Procedures on the Central Server
Procedures on the Staging Servers
SSIS Packages
File-Processing Server Packages
Central Server Packages
Summary
Chapter 11: Scripting Design Patterns
Problem — Advanced File Management
Script Task
Scenario
Design — Advanced File Management
Script Language
Accessing Objects
Custom Assemblies
Scripting
Solution — Advanced File Management
Create an External Assembly
Access the External Assembly
Archive the Files
Summary — Advanced File Management
Problem — Call Center Fact ETL
Reasons to Use Scripting
Scenario
Design — Call Center Fact ETL
Component Type
Output Type
Synchronous Outputs
Asynchronous Outputs
Other Component Types
Design Choice
Overridable Methods
Solution — Call Center Fact ETL
Package Setup
Script Component
Reusable Methods
Row-By-Row Processing
Process Input
Entire Script
Package Wrap-Up
Summary — Call Center Fact ETL
Summary
Chapter 12: SSIS Package Scaling
Problem
Identifying Task Durations
Identifying Data Flow Destination Issues
Identifying Transformation and Memory Bottlenecks
Identifying Data Flow Source Performance Issues
Design
Balancing SQL Operations with the Data Flow
Data Flow Advantages as Compared with SQL
SQL Advantages when Compared with the Data Flow
Applying SQL Code in Your SSIS Packages
SSIS Pipeline Architecture Concepts
Data Flow Design Example
SQL-Centric SSIS Process
Rewritten Data Flow–Centric SSIS Process
Solution
Tuning Your Data Flow
Use Blocking Transformations Sparingly
Limit Row-by-Row Operations
Manage Data Flow Backpressure
Pre-sort Sources as an Alternative to the Sort Transformation
Optimize the Lookup and Managing the Cache Size
Remove Unused Columns from the Data Flow
Be Mindful of the SSIS Logging Impact
Regulate Transactions
Setting Data Flow Properties
Up the EngineThreads Value
Optimize the Temporary Storage Locations
Leave RunInOptimizedMode as True
Tune Buffers
Database and Destination Optimization
Limiting Database Growth
Consider Dropping and Re-creating Indexes on Destination Tables
Using the OLE DB Destination Adapter
Use Advanced Oracle and Teradata Destination Adapters
Handling Partitioned Tables
Summary
Index

Microsoft® SQL Server® 2008 Integration Services: Problem–Design–Solution
Published by
Wiley Publishing, Inc.
10475 Crosspoint Boulevard
Indianapolis, IN 46256
www.wiley.com
Copyright © 2010 by Wiley Publishing, Inc., Indianapolis, Indiana
Published by Wiley Publishing, Inc., Indianapolis, Indiana
Published simultaneously in Canada
ISBN: 978-0-470-52576-0
Manufactured in the United States of America
10 9 8 7 6 5 4 3 2 1
No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except as permitted under Sections 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 646-8600. Requests to the Publisher for permission should be addressed to the Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011, fax (201) 748-6008, or online at http://www.wiley.com/go/permissions.
Limit of Liability/Disclaimer of Warranty: The publisher and the author make no representations or warranties with respect to the accuracy or completeness of the contents of this work and specifically disclaim all warranties, including without limitation warranties of fitness for a particular purpose. No warranty may be created or extended by sales or promotional materials. The advice and strategies contained herein may not be suitable for every situation. This work is sold with the understanding that the publisher is not engaged in rendering legal, accounting, or other professional services. If professional assistance is required, the services of a competent professional person should be sought. Neither the publisher nor the author shall be liable for damages arising herefrom. The fact that an organization or Web site is referred to in this work as a citation and/or a potential source of further information does not mean that the author or the publisher endorses the information the organization or Web site may provide or recommendations it may make. Further, readers should be aware that Internet Web sites listed in this work may have changed or disappeared between when this work was written and when it is read.
For general information on our other products and services please contact our Customer Care Department within the United States at (877) 762-2974, outside the United States at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not be available in electronic books.
Library of Congress Control Number: 2009935233
Trademarks: Wiley, the Wiley logo, Wrox, the Wrox logo, Programmer to Programmer, and related trade dress are trademarks or registered trademarks of John Wiley & Sons, Inc. and/or its affiliates, in the United States and other countries, and may not be used without written permission. All other trademarks are the property of their respective owners. Wiley Publishing, Inc., is not associated with any product or vendor mentioned in this book.
To my family: Amy, Meg, Nate, Kate, and Caleb
—Erik Veerman

Erik Veerman is a Mentor for Solid Quality Mentors, focusing on training, mentoring, and architecting solutions on the SQL Server Business Intelligence (BI) platform. His industry recognition includes Microsoft’s Worldwide BI Solution of the Year and SQL Server Magazine’s Innovator Cup winner. Veerman has designed dozens of BI solutions across a broad business spectrum — telecommunications, marketing, retail, commercial real estate, finance, supply chain, and information technology. His experience with high-volume, multi-terabyte environments and SQL Server 64-bit has enabled clients to scale their Microsoft-based BI solutions for optimal potential. As an expert in on-line analytical processing (OLAP) design, extract, transform, and load (ETL) processing, and dimensional modeling, Veerman is a presenter, author, and instructor. He led the ETL architecture and design for the first production implementation of SQL Server Integration Services (SSIS), and helped to drive the ETL standards and best practices for SSIS on Microsoft’s SQL Server 2005 reference initiative, Project REAL. Veerman is also co-author of Professional SQL Server 2008 Integration Services (Indianapolis: Wiley, 2008), Expert SQL Server 2005 Integration Services (Indianapolis: Wiley, 2006), Professional SQL Server 2005 Integration Services (Indianapolis: Wiley, 2007), and lead author for the Microsoft Press SQL Server 2005 and SQL Server 2008 series, focusing on Business Intelligence Development and Implementation. As a resident of Atlanta, Georgia, Veerman participates in the local Atlanta SQL Server User’s Group, a Professional Association of SQL Server (PASS) chapter.

Jessica M. Moss (SQL Server MVP) is a well-known practitioner, trainer, author, and speaker for Microsoft SQL Server Business Intelligence (BI). As a Mentor with Solid Quality Mentors, she has created numerous data warehousing solutions for a variety of companies in the retail, Internet, health services, finance, and energy industries, and has delivered training courses on SQL Server Integration Services (SSIS), SQL Server Retail Services (SSRS), and SQL Server Analysis Services (SSAS). While working for a major clothing retailer, Moss participated in the SQL Server 2005 TAP program, where she developed best implementation practices for SSIS. Moss has authored technical content for multiple magazines and Web sites, and has spoken internationally at conferences such as the Professional Association for SQL Server (PASS) Community Summit, SQL Teach International Conference, and the devLink Technical Conference. As a strong proponent of developing user-to-user community relations, Moss organizes meetings for a User Group that discusses different technologies. She also actively participates in local User Groups and code camps in central Virginia. In addition, Moss volunteers her time to help educate people through the PASS organization.

Brian Knight (SQL Server MVP, MCITP, MCSE, MCDBA) is the owner and founder of Pragmatic Works. He is also the co-founder of SQLServerCentral.com and JumpstartTV.com. He runs the local SQL Server Users Group in Jacksonville, Florida (JSSUG). Brian is a contributing columnist for several technical magazines and does regular Webcasts at “Jumpstart TV.” He is the author of ten books on the SQL Server technology. Knight has spoken at conferences such as the Professional Association of SQL Server (PASS), SQL Connections, and TechEd, as well as many Code Camps. His blog can be found at www.pragmaticworks.com. Knight lives in Jacksonville, where he loves watching the National Football League Jaguars lose on a regular basis.
Jay Hackney is a mentor and consultant for Solid Quality Mentors, where he helps customers utilize Microsoft technology to build real-world solutions. He has 15 years of development and consulting experience on the Microsoft data platform, and has worked with SQL Server Integration Services (SSIS) since the earliest private beta. Since then, he has used SSIS to deliver data warehousing, extract, transform, and load (ETL), and data migration solutions for many well-known companies, and has presented on SSIS topics at events such as Professional Association for SQL Server (PASS) Summit, Microsoft Tech-Ed, and SQL Server product launches.

Rushabh Mehta is a Mentor for Solid Quality Mentors, and is also the Managing Director for Solid Quality India Pvt. Ltd. Mehta has actively worked with the Microsoft SQL Server and Business Intelligence (BI) platform since 1998. He has developed numerous BI solutions, including predictive-analysis systems, health-care solutions, multi-terabyte financial decision support systems, as well as designing and building an analytical platform to support the Microsoft Windows Azure platform. His experience spans a wide breadth of industrial sectors, including manufacturing, retail, finance, and government, working with clients such as Raymond James Financials, Jackson Hewitt, Publix, the U.S. Department of Veterans Affairs, the U.S. Department of Defense, and Microsoft. Mehta has also developed a number of best practices and standards for implementing BI solutions. He has also developed and delivered advanced training courses on Microsoft BI practices and technologies. An active speaker at a number of large conferences such as the Professional Association for SQL Server (PASS) and TechEd, Mehta also frequently speaks at SQL Server and BI User Groups around the world. Mehta is the Executive Vice President for PASS, the largest global user-driven community for SQL Server professionals.
Credits
Executive Editor
Robert Elliott
Project Editor
Kevin Shafer
Technical Editor
Paul S. Waters
Production Editor
Kathleen Wisor
Copy Editor
Paula Lowell
Editorial Director
Robyn B. Siesky
Editorial Manager
Mary Beth Wakefield
Production Manager
Tim Tate
Vice President and Executive Group Publisher
Richard Swadley
Vice President and Executive Publisher
Barry Pruett
Associate Publisher
Jim Minatel
Proofreader
Scott Klemp, Word One
Indexer
Robert Swanson
Cover Image
©iStockphoto.com/wolv
Acknowledgments
I want to first thank Jessica, Brian, and Jay for adding their valuable expertise and experience to several of the chapters of this book, and sharing their approach and code practices from the SSIS framework, to the dynamic extraction logic, to data cleansing best practices. A big collective thanks to our clients! The unique projects and challenges helped shape and refine this content. I added up the years of experience for all of us, and these pages represent about 50 years of work with data processing and ETL — and that doesn’t include late nights and long weekends.
I also want to thank Paul Waters and Rushabh Mehta for their roles in this book as Technical Editor and contributing author, respectively. A big thanks to the Wrox team for their support, hard work, and valuable feedback. And a thanks to you, the reader, because I actually read the Amazon feedback and try to address the constructive criticism in each new work. Hopefully, the diligence has paid off. If so, help us and let the world know!
Last, but not least, thanks to my family for their support. A special thanks to Amy and her encouragement and hard work in so many ways.
— Erik Veerman
As a first-time book author, this acknowledgment holds special meaning to me. I will try to refrain from mentioning everyone I have ever met, and focus on a few special people that truly affected the outcome of this book.
First of all, a huge thank you is due to Erik Veerman, the lead author on this book. He not only convinced me to join this troupe of authors, but also took the time to patiently answer all of my questions about the process. Next, thank you to Jay Hackney, Brian Knight, Rushabh Mehta, Paul Waters, and Wrox for their authoring, technical, and editing contributions. I truly appreciated your support and help.
On a personal note, I would like to thank my friends and family, Mom, Dad, and Lianna, for their never-ending support and patience as I disappeared for days at a time. Finally, a few other people who I would like to recognize for their previous support are Fred Creasy, Georgia Willis, Kathi Walker, and Paul Reynolds — thank you.
— Jessica M. Moss
Thanks to everyone who made this book possible. As always, I owe a huge debt to my wife, Jenn, for putting up with my late nights, and to my children, Colton, Liam, and Camille, for being so patient with their tired dad who has always overextended. Thanks also to the makers of Guinness and other hard liquors for sustaining my ability to keep on writing technical books. Finally, thanks for all the User Group leaders out there who work so hard to help others become proficient in technology. You make a huge difference!
—Brian Knight
I’d like to thank Erik for this opportunity and his hard work on this book.
— Jay Hackney
Introduction
Our world is a world full of data and, as developers, database administrators (DBAs), or data architects, your job will no doubt require processing data — consolidating data, integrating system data, migrating data, cleansing data, transforming data, or loading data into a data warehouse. This book is about solving problems and building solutions with SQL Server 2008 Integration Services (SSIS). It focuses on the design best practices for the top uses and challenge of SSIS.
In this book, you will learn about how to approach the following kinds of projects, and how to effectively work within SSIS to apply its functionality to these common project requirements and challenges:
As an author team, our vision for this book is very different than the other books on the shelves, which focus mainly on how to use the knobs and buttons in SSIS. Books such as the Professional Microsoft SQL Server 2008 Integration Services (Indianapolis: Wiley, 2008) are very valuable to have on your bookshelf for reference and for learning SSIS. But this book is about going to the next level of solution design. Our goal in writing this book was to focus on solving problems, building solutions, and providing design best practices.
In summary, the difference between this SSIS books and all the others out there is that other books simply focus on the product features with little emphasis on solution design. If you go out and buy a new power saw, the manual is going to tell you how to angle the blade, set the right depth, and make a clean cut. But the manual doesn’t tell you how to build something. This book shows you how to build the furniture, not just how to use the saw.
To be sure, you must know how to use SSIS before you can build a solution. But going from knowledge to design requires guidance on the right approach, and how to avoid the common pitfalls.
This book empowers you with the confidence, the knowledge, and the understanding to make the right choices in your ETL design that enables easy administration, meets your data processing requirements, and performs well for current and future scalability.
Who This Book Is For
This book is targeted at three primary SSIS users: data architects, ETL developers, and DBAs responsible for ETL administration and support.
Because this book focuses on problems and solutions, a base understanding of SSIS is required. A couple of areas of the book walk you through the more advanced features of SSIS, but most of the book builds on top of a foundation of SSIS knowledge and experience. If you have taken an SSIS class, or have read another book and tried out SSIS, or you have built a few packages for various purposes, then that base knowledge will give you enough background for this book. But you should be up for the challenge of learning a new tool in the context of applying it!
The perfect reader of this book is someone who is in the early stages of a new ETL or data integration project (or a redesign project), and is eager to know how to approach the effort with the best practices when designing SSIS packages.
If you are supporting an existing project and must make some changes to aid in administration, deployment, or scalability, then you will also benefit from several of the chapters herein.
What This Book Covers
This book is an expanded, revised, and rewritten version of the Expert SQL Server 2005 Integration Services book (Indianapolis: Wiley, 2007). The authors of this book have expanded the coverage to address the current trends in ETL and SSIS, including creating a scaling-out execution model, performing advanced data profiling and cleansing, and handling file management and file processing.
This book also addresses some of the challenges in SSIS surrounding auditing, configurations, and execution management. Two chapters focus on solving these administrative challenges:
Every ETL or data integration solution involves data extraction of one kind or another. Maybe you’re extracting data from flat files, or perhaps your source is coming from an enterprise resource planning (ERP) system, or a line-of-business application. Regardless, you must implement a data extraction methodology that is efficient, reduces the impact on the source, and adequately handles changes and data tracking. Chapter 5 dives into many of the data extraction areas, and even provides you with a dynamic data extraction approach.
Another area that this book covers is data warehouse ETL. In fact, one of the more common uses of SSIS is data warehousing ETL. Chapters 7, 8, and 9 are dedicated to solving the data warehouse ETL challenges all the way from dimension and fact table ETL, to processing cubes in SQL Server Analysis Services (SSAS).
Not to be overlooked, Chapter 11 examines control flow and data flow scripting, and provides examples of common (but complex) challenges that can be solved by scripting, including advanced file management and data processing. Chapter 12 provides performance troubleshooting steps and best practices on data flow design. This chapter also contrasts the right use of SQL commands versus the data flow.
Before you even begin diving into the details of an SSIS-based solution, you must start out on the right foot! Chapter 1 reviews solution architecture, project approach and methodology, package design patterns, the importance of I/O performance, and data element tracking. It answers the question, “Where do I begin and how do I approach an ETL solution?
In all, this book presents a comprehensive picture of SSIS solution challenges and design best practices.
How This Book Is Structured
Before reviewing the chapter organization, let’s first note how each chapter is internally structured. Every chapter contains three primary elements — a “Problem” section, a “Design” section, and a “Solution” section. In fact, some chapters are structured with more than one Problem–Design–Solution grouping. Each collection of Problem–Design–Solution addresses the following:
This book is generally organized in the way that you would approach a data integration or ETL project. The chapter flow builds on the natural progression that a developer or administrator would go through when designing an SSIS solution.
After beginning with an overview of architecture, the book then moves into putting together the underlying support structure of a solution — the storage, deployment, and management framework. Next, the natural progression is to handle the source data, whether that is in files or extracted from a relational database system (RDBMS), and often requiring a data-cleansing process. Next, the chapters delve into dimension and fact table loading, as well as the cube-processing steps. The final chapters address advanced data handling through scripting, and provide the package availability and performance that many solutions require.
The following chapter provides a solution overview and discusses planning:
The following chapters examine SSIS administration and the deployment foundation:
The following chapters discuss file management and data extraction:
The following chapter discusses data cleansing:
The following chapters cover data warehouse ETL:
The following chapters examine advanced ETL concepts:
The Problem–Design–Solution format and order of the chapters together provide a well thought-out, organized, and systematic approach to SSIS solution design.
What You Need To Use This Book
Following is a breakdown of what you need to use this book:
This book has been written for SQL Server 2008 with SP1 installed (which was released in February 2007).
Conventions
To help you get the most from the text and keep track of what’s happening, a number of conventions have been used throughout the book.
Boxes like this one hold important, not-to-be forgotten information that is directly relevant to the surrounding text.
Tips, hints, tricks, and asides to the current discussion are offset and placed in italics like this.
As for styles in the text:
In code examples we highlight new and important code with boldfaced text.
The boldface is not used for code that's less important in the present context,
or has been shown before.
Source Code
As you work through the examples in this book, you may choose either to type in all the code manually, or use the source code files that accompany the book. All of the source code used in this book is available for download at www.wrox.com. Once at the site, simply locate the book’s title (either by using the Search box, or by using one of the title lists), and click the Download Code link on the book’s detail page to obtain all the source code for the book.
Because many books have similar titles, you may find it easiest to search by ISBN; for this book, the ISBN is 978-0-470-52576-0.
Once you download the code, just decompress it with your favorite compression tool. Alternately, you can go to the main Wrox code download page at www.wrox.com/dynamic/books/download.aspx to see the code available for this book and all other Wrox books.
Errata
We make every effort to ensure that there are no errors in the text or in the code. However, no one is perfect, and mistakes do occur. If you find an error in one of our books (such as a spelling mistake or faulty piece of code), we would be very grateful for your feedback. By sending in errata, you may save another reader hours of frustration, and, at the same time, you will be helping us provide even higher-quality information.
To find the errata page for this book, go to www.wrox.com and locate the title using the Search box or one of the title lists. Then, on the book details page, click the Book Errata link. On this page, you can view all errata that has been submitted for this book and posted by Wrox editors. A complete book list including links to each book’s errata is also available at www.wrox.com/misc-pages/booklist.shtml.
If you don’t spot “your” error on the Book Errata page, go to www.wrox.com/contact/techsupport.shtml and complete the form there to send us the error you have found. We’ll check the information and, if appropriate, post a message to the book’s errata page and fix the problem in subsequent editions of the book.
p2p.wrox.com
For author and peer discussion, join the P2P forums at p2p.wrox.com. The forums are a Web-based system for you to post messages relating to Wrox books and related technologies, and to interact with other readers and technology users. The forums offer a subscription feature to email you topics of interest of your choosing when new posts are made to the forums. Wrox authors, editors, other industry experts, and your fellow readers are present on these forums.
At http://p2p.wrox.com you will find a number of different forums that will help you not only as you read this book, but also as you develop your own applications. To join the forums, just follow these steps:
1. Go to p2p.wrox.com and click the Register link.
2. Read the terms of use and click Agree.
3. Complete the required information to join, as well as any optional information you wish to provide, and click Submit.
4. You will receive an email with information describing how to verify your account and complete the joining process.
You can read messages in the forums without joining P2P, but to post your own messages, you must join.
Once you join, you can post new messages and respond to messages other users post. You can read messages at any time on the Web. If you would like to have new messages from a particular forum emailed to you, click the “Subscribe to this Forum” icon by the forum name in the forum listing.
For more information about how to use the Wrox P2P, be sure to read the P2P FAQs for answers to questions about how the forum software works, as well as many common questions specific to P2P and Wrox books. To read the FAQs, click the FAQ link on any P2P page.
1
SSIS Solution Architecture
Imagine that this is the first day of a new internal or client project. You will have responsibility on the data and processing layer of the solution, which involves processing data — a lot of data — from several sources, and then either integrating systems, or maybe consolidating data for reporting. Maybe your background is a developer, or a database administrator (DBA), or a data designer, and you know SSIS fairly well. But now they are calling you the “data and ETL architect.”
ETL is the acronym for Extraction, Transformation, and Loading, which is used to describe the data-processing layer in a data-integration or data warehouse solution.
The project manager approaches you and says that the Vice President of Technology has asked the team to give him an estimate of the infrastructure needed. Furthermore, the business owner wants a high-level overview of how the solution architecture will help the company achieve the business need most efficiently. The project manager also wants your thoughts on the best way to approach the solution, how the development should be coordinated between team members, and how deployment should be handled.
Three meetings are scheduled for tomorrow to address these things, and you’ve been asked to provide your feedback in those areas.
Where do you start? How should you approach the solution design with SSIS as the main technology? How should all the pieces work together?
This chapter examines how to lay the foundation for successful solutions based on SQL Server Integration Services (SSIS). And, in fact, this whole book is about SSIS solutions to real-world requirements and challenges. It addresses questions such as the following:
Before you dive into the technical challenges of a project, you must first step back and ensure that you are laying the right foundation. Jumping right in is tempting! But resist the urge, because you want to (and need to) set the precedence and patterns for the solution upfront. If you don’t, chances are you won’t be able to go back and make changes down the road.
As with all chapters in this book, this chapter is organized into the following three major sections:
Problem
Data and ETL projects have many challenges. Some challenges relate to data, some to enterprise integration, some to project coordination, and some to general expectations. This section begins by looking at the bigger picture of data within an organization, but then quickly looks at ETL projects and SSIS packages and execution.
Macro Challenge: Enterprise Data Spaghetti
Maybe your SSIS project is only a small solution in a bigger enterprise pond. The problem is that it can still cause a ripple effect when you tie it into your environment. Or, you can have challenges caused by an unwieldy enterprise environment when you try to implement your solution.
Figure 1-1 shows a not-so-nice telephone/electricity pole that illustrates the data nervous system of many organizations.
Figure 1-1

The problem with Figure 1-1 is that this mess didn’t happen overnight! It grew into this twisted unorganized process because of poor planning, coordination, and execution. However, be aware that, a lot of the time, a corporation’s politics may lead to this type of situation. Departments hire their own technical people and try to go around IT. Systems don’t talk to each other nicely. Project pressures (such as time and budget) cause designers to cut corners.
Following are a few reasons why this kind of tangled mess happens in an organization’s data processing, and examples of the many problems that this “unorganization” causes:
The “Design” section later in this chapter discusses how to approach your SSIS-based ETL project in the right way, and ensure that you are helping to solve the problem, rather than adding to it.
Micro Challenge: Data-Processing Confusion
Another common problem with data processing is when the logic contained to process data is overly complicated and confusing. Just like the macro enterprise problem, this problem usually is the result of changes over time where logic is modified and appended. It usually comes in one of two ways:
Figure 1-2

The overly complex control flow shown in Figure 1-2 is similar to an overly complex data flow, where too many components are used, thus making the development, troubleshooting, and support difficult to manage. The “Design” section later in this chapter proposes a better approach for SSIS packages called the modular package approach.
In summary, both of these types of processes (runaway procedures and unmanageable packages) are very difficult to support, and not suited to team development, error handling, and scalability (all of which are addressed in Chapter 12).
Problems with Execution and Troubleshooting
A couple of other issues that often come up in an ETL or data-integration solution are poor process coordination and difficulty doing root cause analysis. If the “what happened?” question can’t be answered quickly and with confidence, then likely there is a problem with the overall solution execution and logging strategy.
Figure 1-3 shows the command-line output of an example SSIS package execution.
Figure 1-3

If you were to consider spending time trying to work through this output when trying to figure out what went wrong, then you should consider implementing a better execution and auditing structure. This includes package execution in your development environment.
If you have just turned on the out-of-the-box SSIS logging and are capturing results to output to a table, it still may not be enough. If you write custom queries every time against the SSIS logging table to figure out what happened, then you also need a better strategy.
Infrastructure Challenges
When designing an SSIS ETL solution, how do you determine the infrastructure components such as server hardware, memory, processor cores, network switches, disk arrays, storage networks, and I/O controllers? Related to that, where should you run your SSIS packages taking into consideration sources, destinations, and other applications, while balancing hardware scalability and location within your network topology?
These questions are not trivial, and the answers depend on a lot of factors, including processing windows, source and destination availability, application impact and availability, network bandwidth, fault-tolerance requirements, and so on.
I/O is usually the biggest bottleneck, and the one most often overlooked. I/O (or, more precisely, disk I/O) is the throughput that the system can handle on the drive volumes. And this challenge is not just about trying to get the greatest throughput on a single drive. You must consider staging and temporary environments, logging, and current and historical data. And you must balance it all with hardware availability and budget.
The reason disk I/O is so important when considering a data-integration or ETL effort is because of the nature of what you are doing, including the following: