Developing A Data Warehouse
A Data Warehouse is designed to help organizations analyze data,
for instance, to learn more about sales or how profitable different
customer groups are. It is distinct from transactional technology
used in order and sales processing.
A data warehouse defined by subject matter, such as sales, makes
the data warehouse subject oriented.
A data warehouse consists of a lot of pieces of technology; not
one single technology. It is not possible to buy a data warehouse
out of the box, even with SAP, it needs customization.
Data warehouses also need to evolve over time, so they are incrementally
built.
Data Warehouse Architecture
A data warehouse architecture is a description of the elements
and services of the warehouse. It includes details of the elements
and services of the warehouse, how components fit together and how
the system will grow over time.
The most successful data warehouses are developed using an explicit
architectural plan, to ensure flexibility, ease of management, growth
and change. A data warehouse is the framework for product selection.
Explicity design documents include: plans, models, drawings, and
specifications, with separate sections for each key component area
and enough detail to allow their implementation by skilled professionals.
This is not a requirements document. The requirements document says
what the architecture needs to do. It also isn't a project plan.
Currently there is little documentation of what systems are available
and data warehouse terminology is extremely loose.
The architecture has to be driven by the business. For instance,
if your requirement is to have nightly updates, this has implications
for the architecture, and you must understand the technical requirements
to achieve what you want to have. Once you understand the business
requirements, you can detail the general technical considerations
for each:
- Nightly updates - adequate staging horsepower.
- Worldwide availability - parallel or distributed servers.
- Customer-level analysis - [large] server size.
- New data sources - flexible tools with support for meta data.
- Reliability - job control features.
Key Component Areas
A complete data warehouse architecture includes data and technical
elements. This can be broken into three broad areas.
- Data architecture - centered on business processes.
- Infrastructure - includes hardware, networking,
operating systems, and desktop machines.
- Technical area - the decision-making technologies
needed by the users, as well as their supporting structures.
Data Architecture
The data architecture part of the overall data warehouse architecture
is driven by business processes. In a manufacturing environment
the data model might include orders, shipping, and billing. Each
functional area draws on a different set of dimensions. But where
dimensions intersect in the data model the definitions have to be
the same— to give a single total view of the customer. data
items should have a common structure and content, and involve a
single process to create and maintain.
Data Representation
How data is represented in the warehouse—entity/relationship
or dimensional?
If you have a star schema then use dimensional.
Is your detail normalized or dimensional? Will users be querying
detail? Then use dimensional.
Star schema designs produce a large fact table and many, smaller
dimension tables which extend the different aspects of the facts.
By processing the dimension tables first, fewer of the detailed
records in the fact table need to be scanned to complete the query.
Normalization
Normalization is the process of removing all model structures that
provide multiple ways to know the same fact; a method of controlling
and eliminating redundancy in data storage.
Most data warehousing experts are in substantial agreement; the
[data] sources are typically entity/relationship models and the
front end is a dimensional model. The only issue is where you draw
the line between the warehouse itself and the data staging area.
Data Presentation
Presenting data to users entails a choice of tools, also defined
by business requirements. Product capabilities are beginning to
merge, like MOLAP and ROLAP.
MOLAP works well if you stay within the cube built. It is fast
and allows for flexible querying—within the confines of the
cube. Its weaknesses are size (overall and within a dimension),
design constraints (limited by the cube structure), and the need
for a proprietary data base.
Infrastructure Architecture
Hardware - the hardware platform and boxes required
for a data warehouse can be vast. Most boxes are used for databases
and application servers.
The issues with hardware and DBMS choices are size, scalability,
and flexibility. In about 80 percent of data warehousing projects,
obtaining enough power to handle the DW needs is not an issue.
Network - check the data sources, the warehouse
staging area, and everything in between to ensure there's enough
bandwidth to move data around.
Desktop - run the tools and actually get some
data through them to determine if there's enough power for retrieval.
Desktops must be powerful enough to run current-generation access
tools.
Implement a software distribution mechanism.
Technical Architecture
The technical architecture is driven by the meta data catalog.
Everything should be meta data-driven.
The services should draw the needed parameters from tables, rather
than hard-coding them.
An important component of technical architecture is the data staging
process, which covers five major areas:
- Extract - data comes from multiple sources and is of
multiple types. Data compression and encryption handling must
be considered at this area, if it applies.
- Transform - data transformation includes surrogate key
management, integration, de-normalization, cleansing, conversion,
aggregation, and auditing.
- Load - loading is often done to multiple targets, with
load optimization and support for the entire load cycle.
- Security - administrator access and data encryption policies.
- Job control - this includes job definition, job scheduling
(time and event), monitoring, logging, exception handling, error
handling, and notification.
Staging Box
The staging box needs to be able to extract data from multiple
sources, like MVS, Oracle, VM, and others, so be specific when you
choose your products.
It must handle data compression and encryption, transformation,
loading (possibly to multiple targets), and security (at the front
end this is challenging, Thornthwaite says). In addition, the staging
activities need to be automated. Many vendors' offerings do different
things, so he advises that most organizations will need to use multiple
products.
Data Monitoring - A system for monitoring data
warehouse use is valuable for capturing queries and tracking usage,
and performance tuning is also helpful.
Performance optimization - includes cost estimation
through a governor tool, and should include ad hoc query
scheduling.
Query management services - provided by middleware
.
Tools for all of these and other related tasks are available for
the front end, for server-based query management, and for data from
multiple sources.
Tools are also available for reporting, connectivity, and infrastructure
management.
Data access - should include reporting services
(such as publish and subscribe), a report library, a scheduler,
and a distribution manager.
Meta Data
The creation and management of data has the following steps
in the data warehouse process:
- warehouse model
- source definitions
- table definitions
- source-to-target maps
- map and transformation information
- physical information (table spaces, etc.)
- extracted data
- transformed data
- load statistics
- business descriptions
- query requests
- the data itself
- query statistics
To show how important meta data is, of the steps listed above only
three involve real data—7, 8, and 12. Everything
else is meta data, says Thornthwaite, and the whole
data warehouse process relies on it. The major technical elements
of a meta data catalog include:
- Business rules - includes definitions, derivations, related
items, validation, and hierarchy information (versions, dates,
etc.).
- Movement/transformation information - source/destination
information, as well as DDL (data types, names, etc.).
- Operations information - data load job schedules, dependencies,
notification, and reliability information (such as host redirects
and load balancing).
- Tool-specific information - graphic display information
and special function support.
- Security rules - authentication and authorization.
Developing an Architecture
To develop the technical architecture model:
- Draft the architecture requirements document first.
- Map each business requirement to architecture implications.
- Group these implications according to architecture areas (remote
access, staging, data access tools, etc.)
- Identify how it fits in with the other areas.
- Capture the definition of the area and its contents.
- Then refine and document the model.
Developing a data warehouse architecture is difficult, ajust
do it approach is a dangerous approach. Using the Zachman
framework may be overkill for most organisations need for data warehousing,
so a reasonable compromise consisting of a four-layer process fits
most purposes:
- business requirements
- technical architecture
- standards
- products
Business Requirements
Business requirements essentially drive the architecture, so talk
to business managers, analysts, and power users. Identify major
business issues, as well as indicators of business strategy, direction,
frustrations, business processes, timing, availability, and performance
expectations. Document everything well.
Technical Architecture
From an IT perspective, talk to existing data warehouse/DSS support
staff, OLTP application groups, and DBAs; as well as networking,
OS, and desktop support staff. Also speak with architecture and
planning professionals. Here you want to get their opinions on data
warehousing considerations from the IT viewpoint. Learn if there
are existing architecture documents, IT principles, standards statements,
organizational power centers, etc.
Standards
Few standards exist for data warehousing, but there are standards
for most of the components:
- Middleware - ODBC, OLE, OLE DB, DCE, ORBs, and JDBC.
- Data base connectivity - ODBC, JDBC, OLE DB, and others.
- Data management - ANSI SQL and FTP.
- Network access - DCE, DNS, and LDAP.
Products
Regardless of the standards they support, major data warehousing
tools are meta data-driven. However, they don't often share meta
data with each other and vary in terms of openness, so selection
of tools must be done carefully, using the architecture as the guide.
Development Phasing
To define the level of detail required for a data warehouse architecture
ask "Is this enough information to allow a competent team to
build a warehouse that meets the needs of the business?"
The architecture effort will grow exponentially as more people
are added for its development, and more complex the resulting system
needs to be [functional complexity].
Data warehousing is best approached as an iterative process. It
is too big to do all at once, in spite of any impatience on behalf
of the company. The data warehouse market is still emerging, so
begin with high leverage, high-value parts of the process. Then,
use success to make a case for additional phases.
Conclusions
To sum up, the benefits of having a data warehouse architecture
are as follows:
- Provides an organizing framework - the architecture draws
the lines on the map in terms of what the individual components
are, how they fit together, who owns what parts, and priorities.
- Improved flexibility and maintenance - allows you to
quickly add new data sources, interface standards allow plug and
play, and the model and meta data allow impact analysis and single-point
changes.
- Faster development and reuse - warehouse developers are
better able to understand the data warehouse process, data base
contents, and business rules more quickly.
- Management and communications tool - define and communicate
direction and scope to set expectations, identify roles and responsibilities,
and communicate requirements to vendors.
- Coordinate parallel efforts - multiple, relatively independent
efforts have a chance to converge successfully. Also, data marts
without architecture become the stovepipes of tomorrow.
Companies must align with business requirements but be practical.
Need to keep up with advances in the data warehouse industry. Planned
and explicit aarchitectures have a better chance of succeeding.
Back To Top
Business Intelligence | Augmented
BI | BI Governance | BI
Architecture | BI Interfaces | BI
Tools | BI 2.0 | BI
1/2 Comparison | Using BI |
BI Program Initiation | BI
Implementation | BI Products
| Intelligent Networks | Intelligent
Information Networks
|