As one of Europe’s leading suppliers of DB2 services, my organization has been involved in a vast number of technical support calls and incidents with our customers. These situations range from the tragic to the downright bizarre, but each and every one can be avoided (or at least mitigated) with the right processes and procedures in place.
In this column, I’d like to take you on a quick tour of some of my favourite DB2 support incidents and war stories for Linux, UNIX and Windows. All of these actually happened, but I have omitted some details in order to keep them suitably anonymous. With thanks to the many Triton consultants that contributed, and still bear the scars of their involvement.
Issue 1: The Law of Unintended Consequences
This one started (as many do) with a call for help from one of our DB2 customers. All of the rows had been deleted from a critical table in a pre-production environment. The junior DBA had bravely admitted to accidentally connecting to the wrong system to clear down the table, but couldn’t understand why he’d been able to delete the rows as he shouldn’t have had the authority to do so.
In all such circumstances, the priority is always to get the data back and the developers working again. This was quickly accomplished with a standard recovery using a recent backup and no problems were encountered. We then turned our attention to the task of figuring out how the junior DBA had been able to delete the data in the first place.
After much investigation, we discovered the following:
- The junior DBA has connected to the wrong DB2 instance by mistake, using an instance owner userid that was common to all of the development, pre-production and production systems. Not good practice, but at least the passwords were different on each — which still left the question of why he’d been able to connect to the pre-production instance with the password of the development user.
- Some months earlier, a system administrator had been trying to get federation to work and in addition to enabling the FEDERATED database manager parameter, the FED_NOAUTH (bypass federated authentication) parameter had also been set to YES.
- When FED_NOAUTH is set to YES, FEDERATED is set to YES, and authentication is set to SERVER or SERVER_ENCRYPT, authentication at the instance is bypassed, as it is assumed that authentication will happen at the data source. So, it was possible to connect to the database as any user without having to get the password right! Once connected to the database, you only had access to the tables that the user (group) had access to. However, this meant if anyone got the right username for the DB2 instance owner then they could select/add/delete any data they liked.
Needless to say, the FED_NOAUTH parameter was rapidly set back to NO and normal authentication checking was reinstated.
Lessons learned:
- Beware of the law of unintended consequences: DB2 can be a complex beast, and there is a danger in fiddling with system settings you don’t fully understand. If in doubt, seek expert help.
- Change the default instance owner, and use different owners for development, pre-production and production environments to avoid accidentally connecting to the wrong system.
Issue 2: DBA Performance is Important Too
Many of us spend a lot of time tuning databases and applications, but how often do we think about our own performance? Time and again, we see examples of DBAs and support staff that are taking the long way round and using more time than necessary to perform routine tasks. One particular aspect of this is the use of the command line versus a GUI such as the DB2 Control Center.
I know plenty of people who are passionate supporters of the GUI, and indeed it can be a very good way to quickly accomplish one-off activities such as an ad-hoc reorg. However, there are plenty of situations where use of the command line (or better still, DB2 commands embedded in a simple script) will make much more sense. The task is then easily repeatable, actions and their results can be recorded (by redirecting the output to a text file), and a script is a lot easier to hand over to a colleague than pages of instructions and GUI screenshots!
A common example is the routine process of promoting database objects between development, test and production environments. A good script can accomplish that task in a fraction of the time needed to do the same thing from the GUI, and utilities such as db2look can be used to compare database structures to ensure consistency.
Lessons learned:
- With staff costs forming an increasingly large proportion of IT budgets, productivity is becoming increasingly important. Use the DB2 command line and scripting to fully or partially automate common activities to reduce human error and drive down the elapsed time needed to complete them.
Issue 3: Beware of the SQL Generator
Many of us are working in environments where the SQL presented to DB2 is not written directly by a human being, but generated automatically by a tool. Common examples include Java frameworks, 3GLs and business intelligence tools such as Cognos in this example.
In this example, the customer reported a long elapsed time for a Cognos-generated query (over 30 minutes) running against a DB2 9.5 database, with the query consistently failing due to the DB2 temp space being exceeded. The users repeatedly increased the temp space and re-tried the query, but with exactly the same result each time. In desperation they even ran the DB2 Index Advisor and created additional indexes, but temp space still blew before the query finished.
A support incident was opened, and analysis using a third-party query monitoring tool showed a very high query execution cost. Further examination of the actual SQL being generated by Cognos showed a Cartesian join of two tables, each with more than 25M rows in, with the new indexes not being used at all. Going back to the Cognos definitions showed a key database relationship had not been defined to Cognos, so it had to filter vast amounts of data locally.
The solution was simply to re-work the Cognos report definitions to add the missing database relationships. This caused Cognos to generate correct JOIN predicates and the reports ran to successful completion in under 2 minutes. The developers were also educated to look at both the Cognos and generated “native” SQL in the future.
Lessons learned:
- Many applications (such as BI, ERP and Java frameworks) are generating their own SQL – it can be difficult to know exactly what’s getting thrown at DB2.
- Correct tooling (and skills to interpret them) is essential.
Issue 4: We Don’t Need a DBA!
This particular issue happened in a development environment where DB2 for LUW 9.7 was the back-end database, but it could really have been any other RDBMS. A highly-skilled offshore team was developing a new application, with a local system administration team looking after the OS support.
Triton was originally contacted as a one-off exercise, as the developers were complaining about consistently poor query performance and asking to move to another RDBMS as DB2 was clearly not up to the job. We found a surprisingly common situation: the team of developers, highly skilled in SQL and able to formulate complex queries, were completely focused on the application code and no one was looking after the data. Consequently, there was no backup regime in place, no RUNSTATS or other housekeeping, and the only indexes were primary ones created when the database was first designed. Not surprising then that the query performance had been so poor — none of the tables had ever had stats collected.
Further questioning revealed that the developers had no DB2-specific knowledge whatsoever, and no desire to learn. As far as they were concerned, the database was a black box that they could throw SQL at and get data back.
We quickly implemented a basic set of automated housekeeping procedures that resolved the performance issues and made the developers happy. However, in the process we uncovered a lot of data quality issues (such as developers coding inefficient SQL to get around duplicate data that shouldn’t exist). A more formal design review of the database was conducted, and we were asked to provide an ongoing remote DBA service to avoid the same issues arising in the future.
Lessons learned:
- Developers can rarely get away with looking at the database as a generic black box — specific knowledge is needed to obtain the best performance.
- Skilled developers are not enough — a DBA is needed for most applications, otherwise performance or data integrity will usually suffer.
Issue 5: When the Senior DBA is Away…
This is a great example of how “finger trouble” can cause red faces and application outage. The Senior DBA for a major public services application took his annual vacation, leaving a recently-trained Junior DBA in charge. Responding to a request to clear down a development database, the Junior DBA fired up his telnet session, dropped and re-created the relevant database and congratulated himself on a job well done … until the phones started ringing with users complaining that they couldn’t use the production application. The DBA had accidentally dropped a critical production database, thinking that he was connected to the UAT system.
To his credit, the DBA was quick to own up to the mistake and Triton was called in to assist with the recovery. This was conducted without further drama, but the best part of a business day had been lost by the time everything was back to normal.
Lessons learned:
- Always make sure you’re connected to the right server. It’s a good idea to color code your GUI/Telnet sessions so you have an additional visual cue to help ensure you’re working on the right environment.
- Consider restricting day-to-day access to production data to prevent accidents
- Get proper DBA cover for vacation/sickness/pregnancy etc. — relying on inexperienced staff can result in major incidents.
Issue 6: Don’t Touch the Logs!
We received a frantic call one evening from a customer asking for immediate help. DB2 had “hung” and no activity could be performed on the production database. As the OLTP system had a 24×7 online SLA, it was not surprising that the senior management was waiting when we arrived at the customer site. Upon investigation, it became apparent that DB2 was looking for an active transaction log file that had gone “missing” and was nowhere to be found.
After investigation, the story started to become clearer. In a desperate attempt to create some space on the production database server, a junior sysadm had stumbled upon the DB2 transaction log directory (which in this case housed both the active and archived logs) and thought, “Hmm, this directory could do with some clean-up.” Using his judgment of the age of the files, he deleted some of them. He gained space but unfortunately lost an active transaction log file in the process.
Despite our advice that a restore from a previous backup was the only solution to their problem, the local Oracle DBA with some DB2 knowledge tried various methods to “trick” DB2, like creating a dummy log file with the same name as the missing one, not knowing that the DB2 transaction log files have header information within them. After a lot of delay, it was finally agreed to carry out a restore from the most recent backup. This proved somewhat of a challenge since no backups were stored on disk. So, the correct tape had to be found and mounted. The restore and subsequent rollforward to a consistent point in time did successfully take place and sighs of relief could be heard in the early morning hours. Even though some hours of business had been lost, jobs had been saved! And yes, even the junior sysadm was allowed to stay on since he owned up to his mistake…
Lessons learned:
- Do not touch the active log directory.
- Configure a log archiving strategy that archives log files to a different location than that of the active log directory.
- Have a scheduled clean up procedure for the archive log directory.
- The most recent active transaction log file can be found by listing the database configuration parameters. Make absolutely sure you don’t need a log file before deleting it.
- If possible, keep at least one backup image on disk.
- Do not give permissions to any Joe Blow to access tablespace data, active transaction logs, etc., on the production database server.
Issue 7: As Bad as it Gets
Here’s a nightmare support scenario: a broken database on an unsupported version of DB2, with no backups or log files to recover the database.
An erroneous script had deleted a few transaction log files that had a ‘last changed’ date of more than 45 days. The same script had caused other errors and a database restart was required, but the database did not start. The database was looking for an old log file, which had just been deleted by the script. As the policy was to retain the backups and archive logs for 30 days, this log file was deleted from the archive logs too.
The database was tiny — less than 50GB. Nevertheless, it was a very important one, with a number of web facing apps relying on it for important features. To make matters worse, the version of DB2 in use had passed its “End of Service” date, so DB2 support was not willing to investigate (though they were happy to guide).
When we got involved a few hours after the incident, panic had set in. Based on information available (saved snapshots and db2diag.log file), we were able to conclude that there was a transaction which started in the log file the database was looking for. This transaction was never completed. The rate of change of data was so small, the configured log numbers could go on for more than 45 days.
The options available were to extract the data from the latest backup image (using tools like High Performance Unload) or extract the data from the damaged database (using db2dart). The latter option was chosen as this would allow us to recover the most recent data.
Without further delay, we ran db2dart on the database to check for any errors and to get the Tablespaceid, Tableid and the total number of pages allocated to each table. We were then able to use the information to build the db2dart command with the DDEL option to extract the data in delimited format. db2dart with the DDEL option is interactive (i.e., when the command is run, it prompts for the tablesapce id, tableid and the page range to extract the data). This meant that the extract could not be scripted but had to be done manually for each of the 300+ tables.
Once that mind-numbing task was complete, we created a new database with the DDL that was available (thankfully, they had a db2look output from the production database less than a week before the incident). Finally, we loaded the extracted data to the new database and ran runstats on the table and indexes.
After a few hiccups and 15 hrs of db2dart, import/load, runstats and data fixes, the database was available for the application. The database was down for more than 20 hours, but it was back in once piece with nearly no data loss. Quite an achievement under the circumstances!
Lessons learned:
- See previous issue for advice on log management.
- Be aware of advanced recovery tools such as db2dart — they can be lifesavers in extreme situations.
Issue 8: Handling Diagnostic Data
We often have to ask customers for the DB2 diagnostic log file (db2diag.log), only to be told “it’s too large to send” or “it’s taking too long to open.” This is because the DB2 diagnostic files have been appended only since time immemorial, their growth only restricted by the maximum space available in the file system they reside in. The only way to curb this growth would be to rename the files, which would then force the creation of new ones. Scripts had to be written to automate this process and to delete old files after a certain period. Many customers didn’t have these processes in place.
With the advent of DB2 9.7, all this is now history! The new DIAGSIZE database manager parameter allows a DBA to control the maximum sizes of the DB2 diagnostic log and administration notification log files. When this parameter is set to a non-zero value (which is the default pre-DB2 9.7 behaviour), a series of 10 rotating diagnostic log files and rotating administration notification log files (only on UNIX and Linux) are used. It is also smart enough to clean up old log files from the diagnostic log directory. When the 10th file is full, the oldest file is deleted, and a new file is created.
Lessons learned:
- New features are delivered in DB2 all the time, via fixpaks and new releases. Make sure you stay on top of them, as they can make a big difference.
Issue 9: HADR Complications
The customer was using HADR to a remote server for resilience, running in ASYNC mode to minimize the performance impact. No cluster software was used for monitoring/failover, but the HADR state was being monitored on a regular basis using a shell script.
On the primary site, a few disks failed, which caused some of the tablespaces to be put in “Rollforward Pending” State. Transactions accessing data in these tablespaces failed but the others were successful. The last run of the HADR State monitoring script indicated a Peer State, and therefore it was decided to issue a TAKEOVER command on the DR Site to switch roles. When the application started, some transactions failed with the same error as on the primary site — not good!
A list tablespaces command showed a number of tables in “Rollforward Pending” state. To get out of the pending state, the ROLLFORWARD command was issued with the list of affected tablespaces. The rollforward was trying a retrieve a log, which was a few thousand logs older than the current one. Not surprisingly, this log was not available in the archive. After a few more tries, the ROLLFORWARD option was given up and the database was restored from the latest backup image and the application started.
So what happened? We went through the db2diag.log and the notification logs. We could see that there were physical errors reported in some of the tablespaces on the DR site (HADR Standby) around 100 days prior to the incident. This was reported in the db2diag.log, and the affected tablespaces were “excluded from the rollforward set.” Based on other entries in the db2diag file, we were able to confirm that the log file requested for rollforward on the DR site was used at the time the physical errors occurred there. HADR continued to apply logs for the other tablespaces and was reporting to be in “Peer” State. In reality, some of the tablespaces were being ignored.
Lessons learned:
- Regular monitoring on the log files is essential to identify and resolve any issues on the DR site in advance of any incident. Failure to do so may mean that your “hot standby” system may be useless just when you need it most.
Issue 10: IBM Support Procedures
I’ll round off my top ten with one of the most basic issues we encounter, but unfortunately one of the most common. It’s amazing how many DB2 sites don’t have basic procedures in place for contacting IBM support in the event of a serious DB2 issue.
If you’re having a serious production problem and you need IBM’s assistance, every minute is vital. That is not the time to be scrabbling around to try to dig out customer numbers and URLs for the IBM Support Portal. We have even seen situations where the IBM support site is blocked by the organisation’s web browsing policy!
IBM is generally happy for you to open “test” PMRs to prove that the process works and ensure you can contact them quickly and efficiently if the need arises. Finally, don’t forget that it’s possible for your IBM Business Partner to open and manage your PMRs for you — we do this for many of our customers as a matter of course.
Lessons learned:
- Know your IBM customer number and how to raise and track PMRs using the Support Portal.
- If necessary, open a “test” PMR to prove the process works smoothly — especially after changes to your web infrastructure (browser upgrades, new internet firewall policy, etc.).
- Have a proven process in place for transmitting large files (dumps, diagnostic logs, etc.) to IBM.
I hope you’ll be able to learn something from at least some of these scenarios, and take steps to avoid similar situations happening in your environment.