To solve this problem, someone brilliantly comes up with the idea of creating a "reporting" database to support the reporting needs. A simple solution in theory, but not nearly as simple in practice.
I've now been involved in these types of tasks numerous times and am always suprised at how non-trivial it is to setup correctly. My most recent reporting project involved the following working constraints:
· No performance impact to current production OLTP transactions
· Maximum of 15 minute lag time between reporting system and transactional system
· Dramatic improvement of report performance
· Cannot use SQL Server Reporting Services (don't ask why)
· OLTP system must be purged on a regular basis
· Reporting database must not be purged
These seemingly simple requirements added more complexity to the project than one might think. We mulled over numerous ideas and varying solutions to satisfy the business needs. Our final solution involved the use of a staging database setup on the reporting server (a seperate box) that is populated via SQL Server mirroring. ETL Scripts are then written in SSIS that utilize a snapshot of the mirror database to denormalize the data and populate two reporting databases (details below). The solution turns out to be elegant, to perform well and has basically zero negative impact on the current OLTP system.
In order to migrate these various data changes to the reporting system, our initial thought was to use SSIS to pump data directly out of the OLTP system and right into the reporting database (sounds simple enough). However, our primary concern was the health and performance of the transactional database. Remember, a critical requirement was that the performance of the OLTP system could not be negatively impacted. If we were to write SSIS packages that read directly from the OLTP database, we would be taxing the OLTP system unnecessarily and would compete for resources on the OLTP box. Not only would we have to worry about record locking issues on the data itself, we would also have to worry about competing for IO resources, RAM, CPU, etc. These concerns prompted us to create a staging database on a seperate physical box so that we could then use the staged data as the source for our ETL extracts. This solution effectively alleviated any contention with the OLTP environment.
Once we were able to get a staging database setup on the "Reporting Server", the next question then, was how to get the staged data copied over on a regular basis. We considered various options including restored backups, log shipping, replication and finally SQL Server 2005 mirroring. Each of the alternatives posed their own specific issues and the debates here lengthy and heated. We ultimately decided on SQL Server Mirroring due to its low impact on the OLTP system, its ease of setup and its ease of maintenance.
Mirrored databases - Unavailable
To create a snapshot of an existing database, you just use the CREATE DATABASE T-SQL statement as follows:
CREATE DATABASE MySnapshot ON
Every time we run the ETL Scripts, we just re-create the snapshot. This is a very fast operation even though our database is a fairly sizable 50GB.
In order to relieve this pressure, we decided to create 2 seperate reporting databases on the reporting server; "ReportingDB1" and "ReportingDB2". By having seperate databases, we would alleviate blocking issues between our ETL jobs and people running reports. Although we didn't have money in the budget for seperate physical servers, we could seperate these 2 databases on seperate Luns on the SAN to at least relieve IO contention.
The architecture includes logic in the ETL loaders to keep track of the 2 database statuses. One database is always "live" and the other is always "loading". The 2 databases are updated in round robin fashion every 15 minutes. The reporting UI also monitors these status values to determine which of the two databases to point to when a user requests a report.
In the future, if budgets allow, we can easily move one or both of the databases to it's own box.
Our Resulting architecture looks as follows:
Zoom in Open in new window
Issues to consider
Record Deltas
One of the challenges was in determining what data had to be migrated from the OLTP system and into the Reporting system with each ETL iteration. This turned out to be fairly simple. We just had to add auditing columns (CreateDate, and LastEditDate) to every table in the OLTP system. These columns were not in place prior to this endeavor so we had concerns that adding columns could break middle tier code. Luckily all of our code was protected due to the use of Stored Procedures and the impact here was minimal. Once we put the auditing columns in place, our ETL logic then would look at the values in these columns and compare them to the last known run date of the ETL jobs and would act accordingly.
ETL Load
If you've never used DTS or SSIS before please beware. There is a large learning curve to it. using these tools is kind of like boxing. Anyone can throw a pair of gloves on and jump into the ring with little or no training. Howver, it takes a long time to get good at it and getting your butt kicked while learning really sucks. Let's just say, we got punched in the gut on more then one occasion even though I have had years of experience with DTS (We used SSIS). Some of the issues we ran into were fixed with SP2 of SQL Server 2005.
IO IO IO
The completion of data intensive ETL jobs is greatly the function of IO resources. Certainly our ETL jobs are CPU intensive as well. But, the bottleneck for us is always IO. I cannot stress enough that you really need to plan accordingly for the IO load when architecting this type of system. Currently our architecture is holding its own in production, but the IO subsystem (an EMC Symmetrix SAN) is really kicking up a sweat. Do your research on SQL Server file layout best practices and work with your IO system vendor for best practices and optimization.
Summary
Creating a reporting system to alleviate load in your OLTP environment is one of the most common requests in the life of a DBA or Data Architect. There's no single solution as to how to do this. You cant just click an "Easy Button" in SQL Server and have your reporting environment all setup for you. My hope in writing this article is to describe one solution in order to get you heading in the direction of implementing your own custom solution. I look forward to the numerous "why didn't you use replication" questions and other questions\recommendations. I look forward to the feedback on how to improve what we have....maybe version 2 of our reporting system will be even better.
By Gregory Jackson, 2008/01/21
Link for the ERL framwork article