White Paper:
Data Warehouse
Overview
Where Is the Value in a Data Warehouse?
What Does a Data Warehouse Look Like?
How Do We Build a Data Warehouse?
Overview
Some databases are optimized to enter the data while other databases are optimized to get the data out again. Databases to enter data are generally the OLTP (On-Line Transaction Processing) type. Databases designed to help users access and analyze data are generally a type of data warehouse.
OLTP databases are usually surrounded by programs that do a large number of small write actions to the database. The database assumes some of these programs may be ill-behaved, and the database contains processes for checking and enforcing data integrity. These databases are usually designed with many small interconnected tables that are "normalized" to remove redundancy. This design supports the needs for very fast writes and database integrity.
A data warehouse has only a few programs that load data. They usually run no more than once per day, and they load thousands of rows at a time. So, unlike an OLTP database, a data warehouse is not faced with load program security or load write speed issues. Accessing data that is scattered in many tables requires high user programming skills and significant computing resources. To avoid this, data warehouse data is "denormalized" into a small number of tables. The purpose of a data warehouse is to make it fast and easy for users to get to the data.
Efficiency
The original data may be located on different types of computers and in different types of databases. In any one database the data might be in many small normalized tables. The data may be in different formats, such as financial data in multiple currencies. Every time a study is done, it is necessary to gather and prepare data from these many sources. If the same preparation work is being done repeatedly, then it makes sense to try to find a way to do most of the preparation phase once and then reuse it.
Time is invested in building and maintaining a data warehouse. Time is saved when reports are run against the warehouse. If the preparatory work is costly and done frequently, then it is easy to find a positive return on investment (ROI ) in a data warehouse.
Quality Assurance
Loading data into the data warehouse is a good opportunity to ensure that the data is complete and accurate.
Consistent Viewpoint
A data warehouse provides one consistent assembled view of the data, rather than a collection of scattered files. There is less chance for confusion and misunderstanding, and different studies will return consistent results.
Accessible
Some of the data may not even be known to exist. And if it has been located, different platforms and databases may make the data almost inaccessible. Only a small number of staff will have the specialized knowledge to get answers. Every request has to be passed through them.
A data warehouse places the data in one place in a form that is accessible to most users. If the data is then also made available in Excel spreadsheets or exposed through a browser, it becomes accessible to almost everyone in the organization.
Data Exploration
Information in the data warehouse is organized to facilitate data exploration. It becomes easy for users to get answers to their questions.
User Perspective
From the user's perspective the data warehouse detail data looks like a table. The columns are the variables that experience has shown are used repeatedly to answer questions about a particular subject area. The rows represent detail activities such as individual sales. This central detail table is called a "Fact Table."
Some of the columns in the Fact Table can be joined with matching columns in "Dimension Tables." The Dimension Tables allow additional information to be joined to the detail data when needed.
The Fact Table and Dimension Tables are referred to as a "Star Schema." (Imagine a star with radiating lines.) This is a very common structure for a subject area in a data warehouse. There may also be other tables in the warehouse such as a monthly summary table.
Designer Perspective
From the designer's perspective, the data warehouse starts with raw feed data that may be on different computer platforms and in several different types of databases. These files must first be accessed, followed by transformation in preparation for loading into the warehouse detail table. Data transformation can involve finding and fixing invalid entries and converting data attributes and values to a standard consistent form.
New data is loaded into the detail tables according to a schedule. Some data is further processed to populate summary tables. Finally, a data exploitation layer is established that can include approaches such as producing regular reports, delivering data subsets in Excel spreadsheets, and exposing data through a browser interface.
The entire process is documented by the metadata.
Metadata
Metadata is data about the data - in other words, it is all of the documentation describing the data and the processes that create it. This can include both technical and business information:
Business metadata:
- the logic used in computing a variable
- business and financial definitions
- notes and pointers to business design documents
- who are the clients
- who are the subject matter experts
- quality assurance tests to be applied
Technical metadata:
- variable name
- variable description
- data warehouse subject area
- data vendor
- data warehouse table type (fact, dimension, summary)
- table/directory location
- variable type (numeric or character) and variable length
- programming logic
- notes and pointers to technical design documents
- name and location of source files
- programs creating each variable
Metadata exists to answer questions such as "What is the business logic used to identify senior shoppers? Where is the table containing monthly sports equipment sales? How do we do QA on refunds at the stores? What program loads the actual inventory into the data warehouse?"
The metadata helps to document the production process. It is accessible to both technical and nontechnical staff for use in answering a wide variety of questions. The metadata is largely completed before any programming is done, and it serves as a design document guiding programming.
We use a Rapid Application Development (RAD) methodology. This is a cyclical process that touches on each subtask during each cycle. RAD is a good approach for a data warehouse since it recognizes that management information projects are dynamic and iterative.
There are at least seven significant subtasks in a data warehouse project:
- Directory Structure: Design the directories that will hold the feed files and fact, dimension, and summary tables. This structure needs to accommodate monthly data loads and processing.
- Fact Table: Identify the variables that will be in the Fact Table.
- Dimension Tables: Identify the Dimension Tables and the variables they will contain.
- Metadata: Document the tables and variables in the data warehouse.
- Quality Assurance: Design the quality assurance programs and write the code.
- Data Loading: Design the programs to load data into the Fact, Dimension, and Summary Tables and write the code.
- Access Methods: Identify and program the reports and other methods to provide results to the warehouse clients.
The table below can be used for time estimates to complete the first cycle through the seven tasks. At the end of this first cycle the data warehouse would exist and be able to provide basic monthly reporting. The table also includes some estimates for what percentage of the "final" warehouse would have been completed by the end of the first RAD cycle. Several more cycles would be required. Some subtasks, such as building the directory structures, would likely reach completion relatively soon, while other subtasks might undergo continuing refinement throughout the project.
First RAD Cycle Estimates for a Data Warehouse
(The Hours column is for estimates to complete the subtasks in the first
RAD cycle. The Percent of Final column is for estimates how much of
the total warehouse project is completed in the first RAD cycle.)
| Subtask |
Hours |
Percent of Final |
| |
|
|
| Directory Structure |
|
|
| Fact Table |
|
|
| Dimension Tables |
|
|
| Metadata |
|
|
| Quality Assurance |
|
|
| Data Loading |
|
|
| Access Methods |
|
|
| |
|
|
| Total |
|
|
Any of the books by Ralph Kimball can serve as a good reference for data warehousing.
A reference for the rapid application development is:
McConnell, Steve, Rapid Development: Taming Wild Software Schedules, Redmond, WA: Microsoft Press, 1996.
A data warehousing reference in the SAS environment is:
Welbrock, P.R., Strategic Data Warehousing Principles Using SAS® Software, Cary, NC: SAS Institute Inc., 1998.
|