The author has taken care in the preparation of this book but makes no expressed or implied warranty of any kind and assumes no responsibility for errors or omissions. No liability is assumed for incidental or consequential damages in connection with or arising out of the use of the information or programs contained herein.
All rights reserved. This publication is protected by copyright, and permission must be obtained from the publisher prior to any prohibited reproduction, storage in a retrieval system, or transmission in any form or by any means, electronic, mechanical, photocopying, recording, or likewise. For information regarding permissions, write to:
Craig S. Mullins
c/o Mullins Consulting, Inc.
15 Coventry Court
Sugar Land TX 77479
ISBN: 978-1-54-394916-2
First printing
© Copyright 2019 by Craig S. Mullins.
All rights reserved.
The following terms are trademarks of International Business Machines Corporation in many jurisdictions worldwide:
IBM, Db2, z/OS, CICS, IMS, Parallel Sysplex, Optim, MQSeries. DS6000, DS8000, FlashCopy, QMF, MVS, Db2 Universal Database, developerWorks, RACF, InfoSphere, WebSphere, Db2 Connect, POWER, pureQuery, pureData, pureScale, pureXML, Redbooks, VTAM, Resource Measurement Facility, GDPS, System z, Distributed Relational Database Architecture, iSeries, AS/400, AIX, z Systems, Cognos, SPSS, THINK, Watson.
Other product and service names might be trademarks of IBM or other companies.
A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at:
www.ibm.com/legal/copytrade.shtml.
Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates. UNIX is a registered trademark of The Open Group in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, or service names may be trademarks or service marks of others.
This book is dedicated to my wife,
Beth Mullins,
who is loving, hard-working, and always
supportive of me and my writing projects.
Introduction
Welcome and thank you for purchasing this book.
The purpose of this book is to give advice and direction to Db2 application developers and programmers on writing efficient, well-performing programs. If you write code and access data in a Db2 database, then this book is for you.
The material is written for all Db2 professionals. It will be useful for users of Db2 for Linux, Unix and Windows as well as for users of Db2 for z/OS. When there are pertinent differences between the two it will be pointed out in the text.
Much of the material in this book will apply to any relational database system, not just Db2; nevertheless, the book is written and intended specifically for Db2 users.
The focus of the book is on programming, coding and developing applications. As such, it will not focus on database administration, design, and data modeling, nor will it cover most Db2 utilities, DDL, and other non-programming related details.
If you are a DBA, most of the material will still be of interest to you because DBAs are responsible for overall Db2 performance. Therefore, it makes sense to understand the programming aspect of performance. That said, the book does not discuss DBA and system administration level performance.
It is important also to understand that the book is not about performance monitoring and tuning. These activities are important but are typically not the domain of application developers. Instead, the book offers guidance on application development procedures, techniques, and philosophies. The goal of the book is to educate developers on how to write good application code that lends itself to optimal performance. Of course, this does not mean that every program you write will perform at top speed. But by following the principles in this book you should be able to write code that does not require significant remedial, after-the-fact modifications by performance analysts. If you follow the guidelines in this book your DBAs and performance analysts will love you!
The assumption is made that the reader has some level of basic SQL knowledge and therefore it will not cover how to write Db2 SQL code or code a Db2 program.
It is also important to point out that the book will not rehash material that is freely available in Db2 manuals that can be downloaded or read online. It is assumed that the reader has access to the Db2 manuals for their environment (Linux, Unix, Windows, z/OS).
What you will get from reading this book is a well-grounded basis for designing and developing efficient Db2 applications that perform well.
About the Author
Craig S. Mullins is a data management strategist, researcher, and consultant. He is currently president and principal consultant of Mullins Consulting, Inc.
Craig has been named by IBM as a Gold Consultant and an IBM Champion for Analytics. He was also named one of the Top 200 Thought Leaders in Big Data & Analytics by Analytics Week magazine. He is a former Gartner analyst, having covered the topic area of database administration as a research director.
Craig has over three decades of experience in all facets of database systems development and has worked with Db2
since V1. Through Mullins Consulting, Inc., Craig offers data management services and solutions that meet the demanding requirements of all size organizations.
Service offerings available through Mullins Consulting, Inc. include:
Implementation and migration assistance
Database design
Performance audits and reviews
Application performance tuning and SQL optimization
Security and governance audits and assessments
Specialty Db2 services
Craig has written two other books:
DB2 Developer's Guide, 6th edition (IBM Press, ISBN: 978-0132836425): this best-selling book contains over 1,500 pages of easy-to-read, yet in-depth coverage of all essential Db2 for z/OS topics.
Database Administration: The Complete Guide to DBA Practices and Procedures, 2nd edition (Addison-Wesley, ISBN: 978-0321822949); the industry's only comprehensive, technology-independent guide to the modern discipline of database administration.
You can follow Craig on Twitter (@craigmullins) and view his professional profile and recommendations on LinkedIn at www.linkedin.com/in/craigmullins/
Additional details and information available at Craig’s web site: www.mullinsconsulting.com.
Acknowledgements
Writing a book is a lengthy process that involves a lot of solitary time at the keyboard. But it also requires a community of people providing assistance, both personally and professionally.
Personally, I would like to thank my wife, Beth Mullins, who is always there for me and supportive in all that I choose to do.
Professionally, I would like to thank all my co-workers over the years. I have worked at many different companies, both as a full-time employee and a consultant, and I have been lucky to have many co-workers, managers, and clients that I can call friends. And many of them have helped to me achieve the success I enjoy today.
Finally, I want to specifically thank the technical reviewers of this book. Bill Arledge helped me to review the initial concept for the book, as well as reading the entire manuscript and providing helpful input from a Db2 for z/OS perspective. Michael Krafick lent his Db2 for LUW expertise to the review process and his input was essential to ensuring the accuracy of the content for non-mainframe platforms. Doug Partch provided his in-the-trenches experience to the entire review process, as well as providing the framework for the chapter on SQL paging.
This book is surely better given the guidance and advice provided by the technical reviewers. Of course, any oversights or errors are completely my responsibility.
Table of Contents
1 Db2 Performance Essentials
2 Code Relationally
3 Minimize Passes Through the Data
4 Black Boxes and Code Generators
5 Code for Concurrency
6 Locking and Isolation Levels
7 Load, Unload, Merge and
8 The Importance of Indexes
9 Data Clustering
10 Relational Optimization
11 Sequential vs. Random Data Access
12 Types of Joins
13 Stages, Indexability, and Sargability
14 Filter Factors and Statistics
15 Access Paths and Explain
16 Sorting and Grouping
17 Parallelism
18 Functions
19 Stored Procedures
20 Static versus Dynamic SQL
21 Writing SQL: Multiple Ways to Code a Statement
22 SQL Tips and Techniques for Performance
23 Tweaking SQL
24 Using Optimization Hints and Profiles
25 SQL Paging in Db2 Applications
26 Working with Views
27 Advanced SQL Topics
28 Testing Considerations
29 DevOps, Agile and Continuous Delivery
30 Additional Resources
Chapter 1
Db2 Performance Essentials
We must first understand what “performance” means before discussing optimal development techniques
If you work with Db2 (or indeed, any DBMS), you most likely deal with performance related issues. But have you ever stopped for a moment and tried to define what "database performance" means? Doing so can be a worthwhile exercise to help organize your thinking and behavior.
Defining Database Performance
To better understand the concept of database performance, consider the familiar concepts of supply and demand. Users demand information from Db2 and Db2 supplies information to those requesting it. The rate at which Db2 supplies the demand for information can be loosely thought of as "database performance."
But let’s investigate at a deeper level. There are five factors that influence database performance: workload, throughput, resources, optimization, and contention.
The workload requested of the DBMS defines the demand. It is a combination of transactions, web requests, batch jobs, ad hoc queries, business intelligence queries, analytics requests, utilities, and system commands directed through the DBMS at any given time. Workload can fluctuate drastically from day to day, hour to hour, minute to minute, and even second to second. Sometimes workload can be predicted (such as heavy month-end payroll processing, or very light access after 6 pm, when most users have left for the day). At other times it is unpredictable. The overall workload can have a major impact on database performance.
Throughput defines the overall capability of the computer to process data. It is a composite of I/O speed, CPU speed, parallel capabilities, and the efficiency of the operating system and system software. Do not just base your throughput assumptions on hardware capacity figures (e.g., MHz for Wintel boxes, MSUs and MIPS for mainframes). Perhaps you have installed hard or soft capping on your box, which can impact throughput. And don't forget about those specialty processors (such as zIIPs) if you program on IBM mainframes.
The hardware and software tools at the system’s disposal are called the resources of the system. Examples include memory (such as that allocated to buffer pools or address spaces), disk, cache controllers, and microcode.
The fourth defining element of database performance is optimization. All types of systems can be optimized, but relational database systems are unique in that query optimization is primarily accomplished internal to the DBMS. Ensuring that you have provided up-to-date and accurate database statistics for the query optimizer is of the utmost importance in achieving optimal SQL queries. Keep in mind, though, that there are other factors that need to be optimized (SQL, database parameters, system parameters, etc.) to enable the relational optimizer to create the most efficient access paths. And there are optimization aspects that are outside the scope and control of the relational optimizer, too, such as efficient program and script coding, proper application design, coding efficient utility options, and so on.
When the demand (workload) for a resource is high, contention can result. Contention is the condition in which two or more components of the workload are attempting to use a single resource simultaneously in a conflicting way (for example, dual updates to the same piece of data). When one program tries to read data that is in the process of being changed by another, the DBMS must prohibit access until the modification is complete to ensure the integrity and accuracy of that data. Db2 uses a locking mechanism to enable multiple, concurrent users to access and modify data in the database. Using locks, Db2 automatically guarantees the integrity of data (at least in terms of what you have requested). DBMS locking strategies permit multiple users from multiple environments to access and modify data in the database at the same time. As contention increases, locks are taken and throughput can decrease.
So, putting these factors together: the definition of database performance is the optimization of resource usage to maximize throughput, minimize contention, and process the largest possible workload.
In addition, database applications regularly communicate with other system software, which must also be factored into performance planning. Many factors influence not only the performance of the DBMS and applications accessing its databases, but also the performance of the other system components (e.g., transactions processor, network software, application servers, etc.)
The Three Aspects of Database Performance
Now that we understand the high-level definition of database performance let’s discuss the three aspects of database systems where performance must be managed:
• the application,
• the database structures, and
• the system.
The Application
Application code must be designed and coded appropriately and efficiently. Many performance problems are caused by improperly coded applications. SQL is the primary culprit; coding efficient SQL statements can be complicated. Developers need to be taught how to properly formulate, monitor, and tune SQL statements.
Not all application problems are due to improperly coded SQL. The host language application code in which the SQL has been embedded may be causing the problem. For example, Java, COBOL, C++, Python, and many others supported by Db2. The host language code may be inefficient, causing database application performance to suffer.
Techniques for improving your host language and SQL coding is the primary focus of this book.
Database Structures
The physical design of your database structures can also have a significant impact on performance. Important factors in this category include normalization, disk storage, number of tables, index design, and use of DDL and its associated parameters.
Design is not the only component of database performance. The organization of the database will change over time. As data is inserted, updated, and deleted from the database, the efficiency of the database will degrade. Moreover, the files that hold the data may need to expand as more data is added. Perhaps additional files, or file extents, will need to be allocated. Both disorganization and file growth can degrade performance.
Indexes also need to be monitored, analyzed, and tuned to optimize data access and to ensure that they are not having a negative impact on data modification.
It is important to understand these issues but controlling and managing the performance of database structures is a task for DBAs, not application programmers and developers.
The System
System tuning occurs at the highest level and has the greatest impact on the overall health of database applications because every application depends on the system. For the purposes of this discussion, we will define the system as comprising the DBMS itself and all the related components on which it relies. No amount of tuning is going to help a database or application when the server it is running on is short on resources or improperly installed.
The Db2 system and its environment can and must be tuned to assure optimum performance. The way in which the DBMS software is installed, its memory, disk, CPU, other resources, and any configuration options can impact database application performance.
The other systems software with which the DBMS interacts includes the operating system, networking software, message queueing systems, middleware, and transaction processors. System tuning comprises installation, configuration, and integration issues, as well as ensuring connectivity of the software to the DBMS and database applications.
Again, though it is important to acknowledge and understand the importance of tuning the DBMS system, this task typically falls to a system administrator or DBA, not the application development staff.
Three Primary Performance Indicators
When it all boils down to it, though, there are three primary things that impact the performance of Db2 applications: CPU, I/O, and concurrency.
Figure 1. The Three Primary Performance Indicators
CPU is the amount of machine processor power that is consumed to perform an activity. Simply activities generally consume less CPU than more complex activities. The fewer CPU cycles required to achieve a task, the more efficient the process will be.
I/O involves reading data from and writing data to disk. Disk drives are mechanical and therefore it takes time for the physical parts to move to the proper location to retrieve or write the requested data. This latency – the time between when the data is requested and when it is obtained – contributes to performance degradation. Therefore, the fewer disk reads required to achieve a task the better performance will be.
Concurrency is the ability to perform more than one task at the same time. In the context of database processing, concurrency requires a lock manager. The lock manager controls which processes are accessing and updating which pieces of data. This allows multiple tasks to all access the same database table at the same time without corrupting the data. The more tasks that can be accomplished in the same time window the better throughput will be.
I/O and CPU are computing resources to be optimized. The transactions and SQL of your applications constitute the workload that needs to be processed. Throughput defines the capacity of the system to process workload. When the demand (workload) for a resource is high, contention can result. Contention is the condition in which two or more components of the workload are attempting to use a single resource at the same time in a conflicting way.
As Figure 1 shows, the goal is for your programs to minimize the amount of CPU and I/O they consume, while also maximizing the concurrency of all programs. These concepts are somewhat inter-related, too. When the same amount of work is performed using less I/O, CPU savings occur. This is so because there are many system level processes required to perform an I/O operation. Furthermore, note that how you code your programs will impact the concurrency of other programs, so just like grade school it is important to write code that “works well with others.”
Keep these high-level concepts in mind as you read through the rest of this book. We will regularly reference these three performance indicators as we discuss the various aspects of efficient Db2 application development.
The Rest of the Book
As you progress through the rest of this book there will be three broad sections.
In the first section, general application development guidelines and techniques will be presented. For these chapters, very little actual code will be addressed, instead the content focuses on approaches, mindsets and philosophies of development that result in efficient Db2 applications.
The second section moves into discussing data access methods and techniques and how Db2 accomplishes them. It will give guidance on how many types of application development requirements translate into Db2 and how to tackle them effectively.
Finally, the third part will give guidance on Db2 SQL development for performance. There will be a lot of SQL code and examples in this section.
When you understand each of these areas you will be well on your way to becoming a Db2 application developer that writes efficient programs… and that earns the respect of your peers and managers.
Chapter 2
Code Relationally
It is necessary to develop a relational mindset to become a successful Db2 programmer who codes with performance in mind. But what does this mean? First, we must understand what a relational database system is and how that differs from other types of data storage.
What is a Database?
Before we talk about relational database, let’s first answer the question: “What is a database?” A database is a large structured set of persistent data. So, a phone book is a database. But within the world of IT, a database usually is associated with software. A simple database might be a single file containing many records, each of which contains the same fields having the same data type and length. In short, a database is an organized store of data wherein the data is accessible by named data elements.
A Database Management System (DBMS) is a software package designed to create, store, and manage databases. The DBMS software enables end users or application programmers to share data. It provides a systematic method of creating, updating, retrieving and storing information in a database. DBMSs are generally responsible for data integrity, data access control, and automated rollback, restart and recovery.
In layman’s terms, you can think of a database as a file of information. You can think of the filing cabinet itself along with the file folders and labels as the DBMS. A DBMS manages databases. You implement and access database instances using the capabilities of the DBMS.
Db2 is a database management system. Your payroll application uses the payroll database, which may be implemented using Db2 (or some other DBMS). It is important to understand this distinction to avoid confusion as we move forward.
Relational Database Systems
Relational database systems became the norm in IT in the 1980s as low-cost servers became powerful enough to make them widely practical and relatively affordable. There are other types of database systems available (such as NoSQL, hierarchical, and network) but the RDBMS, of which Db2 is one of the leading offerings, continues to be the leader in terms of usage, revenue, and installed base.
Relational technology is based on the mathematics of set theory. Relational databases provide data storage, access and protection with reasonable performance for most applications, whether operational or analytical in nature. The RDBMS is adaptable to most use cases in a reliable and efficient way.
The term relational comes from the mathematical term relation. In set theory, a relation is a set of unordered elements — all of the same type. A relational DBMS is based on relations. This overview of relational theory offers a quick introduction.
It is important to note that today’s database systems that are referred to as relational do not conform to all the requirements and definition of relational theory. For additional references that can offer more details, consult the Bibliography at the end of this chapter.
How to Think About Data in a Db2 Database
Working with a relational mindset is an important requirement for writing efficient Db2 application programs. Doing so requires an understanding of how data is stored and referenced by a relational DBMS like Db2.
A database is a complex set of inter-related data designed for a specific intent. Do not think of the database as a set of files. Files have no relationships set within and among them, whereas your database does.
Furthermore, do not think of tables as files because they are based on sets: Sets are not ordered whereas files have a specific order to them. Although there are performance-specific physical storage details that are important to learn (and we will cover them later) your relational mindset should be that tables are unordered sets of data. And members of each set are all of the same type. That means that each row has the same number of columns each of the same data type.
When you perform an operation on a set, the action happens "all at once" to all the members of the set. Programmers tend to think in terms of sequential operations such as: Read x, multiply it by 2, save it to a new location, read another x until there are no more. We can accomplish all of this in one SQL statement with something like this:
There is no looping and all the actions are contained in the single SQL statement. It is imperative to be able to think this way to write Db2 applications programs that perform well.
Additionally, rows are not records. Records contained in files or data sets are sequential, stored in the order they were written. Db2 rows have no specific physical order and can be accessed by coding the appropriate SQL WHERE clauses.
Finally, columns are not fields. Columns are typed and can be null. This is not so for fields. Without a program, a field has no meaning.
How This Should Impact Your Coding
Application developers accustomed to processing data a record-at-a-time will make very poor Db2 programmers without some training in the set-at-a-time nature of accessing data in a relational database.
If you have experience programming with flat files you must unlearn the “flat file” mentality. Forget sequentially accessing data record by record. Access what you need using the features of SQL.
Master file processing is not appropriate for optimal Db2 applications. With master file processing two or more files are read with one driving reads to the other. For example, consider a program designed to send offers to all customers who purchased dairy items in November.
The master file approach would read the customer purchase history file looking for dairy items purchased in November. When it finds one it will take the customer id read from the history file and use it to read from the customer file to gather the customer address.
The SQL approach simply joins the two tables (customer purchase history and customer) using the customer id with where conditions to limit the output to dairy items in November. Here is what the SQL solution looks like:
Another aspect of coding relationally is to understand cursors. Remember, a SQL select statement can return multiple rows. A cursor is used to enable application programs to access individual rows. The select statement is assigned to a cursor, which is opened by the program, and then rows are fetched from the cursor one by one. For example:
When you open a cursor in your program to process a SQL statement it is not the same as opening a file. Opening a cursor can cause a lot of activity to occur (e.g. sorting) whereas opening a file is a benign operation.
Set-at-a-Time Processing and Relational Closure
Every operation performed on a relational database operates on a table (or set of tables) and results in another table. This feature of relational databases is called relational closure.
All SQL data manipulation operations—Select, Insert, Update, Delete—are performed at a set level. One retrieval statement can return multiple rows; one modification can modify multiple rows.
Application developers must learn a different way of interacting with data because of the set-at-a-time nature of SQL. Most programming languages operate on data one record-at-a-time. When a program requires relational data, though, it must request the data using SQL, creating an impedance mismatch. The program expects data to be returned a single row-at-a-time, but SQL returns data a set-at-a-time. Db2 provides a feature called a cursor that accepts the input from a SQL request and provides a mechanism to fetch individual rows of the results set. Some programming tools automatically transform multi-row sets to single rows when communicating with Db2.
Furthermore, many programmers are accustomed to hard-wiring data-navigational instructions into their programs. SQL specifies what to retrieve but not how to retrieve it. Db2 determines how best to retrieve the data based on the request. Programmers unaccustomed to database processing are unlikely to grasp this concept without some training.
Relational Optimization
Db2 determines the best method for accessing and modifying data based on your SQL statements, information about your system and database statistics. The same SQL statement can be optimized by Db2 to do the same work in many ways. This is a key benefit of using SQL instead of writing host language code.
There is a component of Db2, known as the Optimizer, that processes SQL and creates executable code for it. The Optimizer is very complex and understanding all the nuances of how it works is not something most Db2 programmers need to know. There are important aspects of optimization, such as types of access paths, filter factors, and indexing, that will be covered later in this text.
Let Db2 Do the Work
An important guideline for coding relationally is to let Db2 do as much work as possible by coding as many of your requirements into the SQL as you can. The more work that can be done in Db2, the more efficient your program will tend to be.
Why is this so? Remember from Chapter 1 the three components of performance: I/O, CPU and concurrency. The more work that Db2 can do without moving data to your program the more we can reduce I/O operations. Many programmers without Db2 experience tend to revert to their earlier programming practices.
For example, instead of coding appropriate SQL WHERE clauses a novice programmer may choose to open a cursor using a simple Select statement, read every row of data, and examine it using If-Then-Else logic in their program. This is inefficient because it causes more I/O than coding a SQL statement with all the needed WHERE clauses. The SQL/WHERE approach can minimize I/O operations because rows that do not qualify may not need to be read and then returned to the program.
Consider our earlier example joining customer and purchase history information. The two WHERE clauses in the SQL limit the rows that need to be processed by your program to only those required. If instead we were to remove the two WHERE clauses, then after fetching each row from the cursor the program would have to check if the purchase was a dairy item and made in November. If not, the row is not processed and the I/O operation for that data essentially was wasted.
Another common mistake for the novice SQL developer is to read each table separately, whereas joining tables together is more efficient. Consider changing the prior example SQL to split it into two separate statements. This would require more CPU, I/O, and application code to retrieve the same results. For example:
Since the first select of the history table can retrieve many rows, looping logic is required to process each row returned. And for each history row returned, Db2 must fetch the customer address from the customer table.
On the other hand, if the program requirements are not read only, but we need to update the Customer address for some reason, this example may be optimal. It all depends upon your specific requirements and situation.
For the most part, the general rule of thumb is to put the work into the SQL, not into the application logic. SQL and Db2 code will be more efficient than anything you can write. And SQL can be re-optimized by Db2 if data patterns change, whereas your program would have to be re-written to take advantage of changes.
Do Not Fear Complex SQL
Although simple SQL statements obviously are easier to code and understand, you should not strive to keep your SQL simple. If we follow the previous guidance and put as much work as possible into our SQL statements, then the SQL can become quite complex. It is not uncommon for a well-written SQL statement to span multiple screens or printed pages.
From a performance-perspective, complex SQL is good!
Ask for Only What is Needed
When you write your SQL statements to access Db2 data be sure to define requests as narrowly as possible. In other words, ask for only the data that you need. This makes sense, right? Why code your program to return data that it does not need? Unfortunately, this simple requirement is something that SQL programmers regularly forgo.
It is critically important that Db2 SQL programmers embrace the following three guidelines. If you do so, you are well on your way to building efficient Db2 applications. Of course, SQL performance is a complex topic and to understand every nuance of how SQL performs can take a lifetime. That said, adhering to the following simple rules puts you on the right track to achieving high-performing Db2 applications.
The first rule is to always provide only the exact columns that you need to retrieve in the Select-list of each SQL Select statement. If there are 25 columns in a table that your program is accessing, and you only need 24 of them, only specify those 24.
Db2 will consume additional resources for every column that is requested to be returned. If the program does not need the data, it should not ask for it. Even if the program needs every column, it is better to explicitly ask for each column by name in the SQL statement for clarity and maintainability. Another way of stating this is: Do not use SELECT *. The shorthand SELECT * means retrieve all columns from the table(s) being accessed. This is fine for quick and dirty queries but is bad practice for inclusion in application programs because Db2 tables may need to be changed in the future to include additional columns. SELECT * will retrieve those new columns, too, and your program may not be capable of handling the additional data without requiring time-consuming changes.
The second rule is: Do not ask for what you already know. This may sound simplistic, but most programmers violate this rule at one time or another. For a typical example, consider what is wrong with the following SQL statement:
Give up? The problem is that the Job and Workdept columns are included in the Select-list. You already know the values of Job and Workdept because they are referenced in WHERE clauses. This is the case if you use literals, like in the example, or host variables.
But with Job and Workdept listed in the WHERE clause, Db2 will dutifully retrieve those columns. Each column retrieved causes additional overhead to be incurred thereby degrading performance. Although the overhead is minimal, if your program runs hundreds or thousands of times a day, that overhead occurs hundreds or thousands of times. And that overhead, which can easily be avoided, adds up.
The third rule is one we have already discussed: Use the WHERE clause to filter data in the SQL instead of bringing it all into your program to filter. This too is a common rookie mistake. It is much better for Db2 to filter the data before returning it to your program. The fewer rows passed to your program, the more efficient your SQL will be.
Consider the following SQL:
This SQL is better than simply reading all the data without the WHERE clause and then programming to check each row to see if the Salary is greater than 50000.00.
These rules, though, are not the be-all, end-all of SQL performance tuning – not by a long shot. Additional, in-depth tuning may be required. But following the above rules will ensure that you are not making “rookie” mistakes that can kill application performance.
Design for Performance from the Start
All programmers should be well-versed in the programming languages they use as well as the system software (such as Db2) used to support them. This knowledge is essential if you want to write efficient code. Optimal performance is only possible if you understand the tools at your disposal.
It is better to design for performance from the start. Such design is difficult, if not impossible, if you do not have a firm grasp on the technology. If you make a design mistake, correcting it later in the development process is difficult. The further into the development process you are, the more painful it becomes to make program changes. It is even harder to change production code.
Proper design based on appropriate knowledge is a requirement for any type of development project, even for modern agile development teams.
Code Modular Db2 Programs
Although not specifically a relational database issue, it is a good idea to design your Db2 programs to be modular. Work with the philosophy of “One program accomplishes a single, well-defined task.”
If you need to execute multiple tasks, structure the programs so that tasks can be strung together by having the programs call one another. This approach is preferable to a single, large program that accomplishes many tasks for multiple reasons. Single tasks in separate programs make the programs easier to understand and maintain.
If each task can be executed either alone or with other tasks, isolating the tasks in a program enables easier execution of any single task or list of tasks.
Performance monitoring and tuning becomes easier. Many reports generated by performance monitors can be produced by package, thereby simplifying performance measurement and reporting.
It is also a good philosophy to work on designing your Db2 programs to be as small as possible. Streamlining your application code to remove unnecessary statements results in better performance because CPU is avoided when code is eliminated.
Bibliography
Celko, Joe (2008). Thinking in Sets: Auxiliary, Temporal and Virtual Tables in SQL. Morgan Kaufmann. ISBN 978-0-12-374137-0
Codd, E.F. (1990). The Relational Model for Database Management, Version 2. Addison-Wesley. ISBN 0-201-14192-2
Date, C.J. (2007). Logic and Databases: The Roots of Relational Theory. Trafford Publishing. ISBN 1-4251-2290-6
Date, C.J. (2015). SQL and Relational Theory: How to Write Accurate SQL Code, 3rd edition. O’Reilly. ISBN 978-1-491-94117-1
Chapter 3
Minimize Passes Through the Data
There are many important aspects to consider as you design your program with performance in mind. Minimizing the number of times that you access the same data should be one of the most important factors to consider. Eliminating unnecessary passes through the data will reduce I/O operations and CPU time; thereby improving performance.
That said, it is not uncommon for programmers to build in multiple passes through the same data as they design application workflows. This chapter will look at several examples and examine the impact of unnecessarily reading the same data multiple times.
Application Design
Before writing any code, a seasoned developer will map out the requirements of their application. As part of this up-front design process you must identify and document where data will be accessed and modified. Seek to minimize or eliminate steps that access the same data multiple times during design to reduce your coding effort.
As you create and review your application design, look for places where the same table is accessed more than once in a single program. Of course, this may not be a problem, depending on the predicates specified in your WHERE clauses. For example, consider the following two statements:
Although the same table is being accessed, different data is being requested. The first is looking for employees that do not receive a bonus, whereas the second is looking for employees that receive a bonus larger than their base salary.
On the other hand, consider these two SQL statements: