The IQ Exchange For Intelligence Systems, AI, Expert System, Virtual Reality
HOME
BLOG
 
The Brain
The Brain Functions
Brain-Like Computers
The Brain & DNA
Intelligence Quotes
 
Personal Intelligence
Personal Intelligence
Personal IQ
 
Business Intelligence
Business Intelligence
Dashboards
Data Warehouses
 
Expert Systems
Expert Systems
Artificial Intelligence [AI]
Aviation Expert Systems
 
Virtual Reality
Augmented Reality
Virtual Reality
Virtuality
 
Virtualization
 
ONLINE STORE
 

 

Developing A Data Warehouse


A Data Warehouse is designed to help organizations analyze data, for instance, to learn more about sales or how profitable different customer groups are. It is distinct from transactional technology used in order and sales processing.

A data warehouse defined by subject matter, such as sales, makes the data warehouse subject oriented.

A data warehouse consists of a lot of pieces of technology; not one single technology. It is not possible to buy a data warehouse out of the box, even with SAP, it needs customization.

Data warehouses also need to evolve over time, so they are incrementally built.

 

Data Warehouse Architecture

A data warehouse architecture is a description of the elements and services of the warehouse. It includes details of the elements and services of the warehouse, how components fit together and how the system will grow over time.

The most successful data warehouses are developed using an explicit architectural plan, to ensure flexibility, ease of management, growth and change. A data warehouse is the framework for product selection.

Explicity design documents include: plans, models, drawings, and specifications, with separate sections for each key component area and enough detail to allow their implementation by skilled professionals. This is not a requirements document. The requirements document says what the architecture needs to do. It also isn't a project plan.

Currently there is little documentation of what systems are available and data warehouse terminology is extremely loose.

The architecture has to be driven by the business. For instance, if your requirement is to have nightly updates, this has implications for the architecture, and you must understand the technical requirements to achieve what you want to have. Once you understand the business requirements, you can detail the general technical considerations for each:

  • Nightly updates - adequate staging horsepower.
  • Worldwide availability - parallel or distributed servers.
  • Customer-level analysis - [large] server size.
  • New data sources - flexible tools with support for meta data.
  • Reliability - job control features.

 

Key Component Areas

A complete data warehouse architecture includes data and technical elements. This can be broken into three broad areas.

  1. Data architecture - centered on business processes.
  2. Infrastructure - includes hardware, networking, operating systems, and desktop machines.
  3. Technical area - the decision-making technologies needed by the users, as well as their supporting structures.

Data Architecture

The data architecture part of the overall data warehouse architecture is driven by business processes. In a manufacturing environment the data model might include orders, shipping, and billing. Each functional area draws on a different set of dimensions. But where dimensions intersect in the data model the definitions have to be the same— to give a single total view of the customer. data items should have a common structure and content, and involve a single process to create and maintain.

Data Representation

How data is represented in the warehouse—entity/relationship or dimensional?

If you have a star schema then use dimensional.

Is your detail normalized or dimensional? Will users be querying detail? Then use dimensional.”

Star schema designs produce a large fact table and many, smaller dimension tables which extend the different aspects of the facts.

By processing the dimension tables first, fewer of the detailed records in the fact table need to be scanned to complete the query.

Normalization

Normalization is the process of removing all model structures that provide multiple ways to know the same fact; a method of controlling and eliminating redundancy in data storage.

Most data warehousing experts are in substantial agreement; the [data] sources are typically entity/relationship models and the front end is a dimensional model. The only issue is where you draw the line between the warehouse itself and the data staging area.

Data Presentation

Presenting data to users entails a choice of tools, also defined by business requirements. Product capabilities are beginning to merge, like MOLAP and ROLAP.

MOLAP works well if you stay within the cube built. It is fast and allows for flexible querying—within the confines of the cube. Its weaknesses are size (overall and within a dimension), design constraints (limited by the cube structure), and the need for a proprietary data base.

Infrastructure Architecture

Hardware - the hardware platform and boxes required for a data warehouse can be vast. Most boxes are used for databases and application servers.

The issues with hardware and DBMS choices are size, scalability, and flexibility. In about 80 percent of data warehousing projects, obtaining enough power to handle the DW needs is not an issue.

Network - check the data sources, the warehouse staging area, and everything in between to ensure there's enough bandwidth to move data around.

Desktop - run the tools and actually get some data through them to determine if there's enough power for retrieval. Desktops must be powerful enough to run current-generation access tools.

Implement a software distribution mechanism.

Technical Architecture

The technical architecture is driven by the meta data catalog. Everything should be meta data-driven.

The services should draw the needed parameters from tables, rather than hard-coding them.

An important component of technical architecture is the data staging process, which covers five major areas:

  • Extract - data comes from multiple sources and is of multiple types. Data compression and encryption handling must be considered at this area, if it applies.
  • Transform - data transformation includes surrogate key management, integration, de-normalization, cleansing, conversion, aggregation, and auditing.
  • Load - loading is often done to multiple targets, with load optimization and support for the entire load cycle.
  • Security - administrator access and data encryption policies.
  • Job control - this includes job definition, job scheduling (time and event), monitoring, logging, exception handling, error handling, and notification.

Staging Box

The staging box needs to be able to extract data from multiple sources, like MVS, Oracle, VM, and others, so be specific when you choose your products.

It must handle data compression and encryption, transformation, loading (possibly to multiple targets), and security (at the front end this is challenging, Thornthwaite says). In addition, the staging activities need to be automated. Many vendors' offerings do different things, so he advises that most organizations will need to use multiple products.

Data Monitoring - A system for monitoring data warehouse use is valuable for capturing queries and tracking usage, and performance tuning is also helpful.

Performance optimization - includes cost estimation through a “governor” tool, and should include ad hoc query scheduling.

Query management services - provided by middleware .

Tools for all of these and other related tasks are available for the front end, for server-based query management, and for data from multiple sources.

Tools are also available for reporting, connectivity, and infrastructure management.

Data access - should include reporting services (such as publish and subscribe), a report library, a scheduler, and a distribution manager.

 

Meta Data

The creation and management of data has the following “steps” in the data warehouse process:

  1. warehouse model
  2. source definitions
  3. table definitions
  4. source-to-target maps
  5. map and transformation information
  6. physical information (table spaces, etc.)
  7. extracted data
  8. transformed data
  9. load statistics
  10. business descriptions
  11. query requests
  12. the data itself
  13. query statistics

To show how important meta data is, of the steps listed above only three involve “real” data—7, 8, and 12. “Everything else is meta data,” says Thornthwaite, “and the whole data warehouse process relies on it.” The major technical elements of a meta data catalog include:

  • Business rules - includes definitions, derivations, related items, validation, and hierarchy information (versions, dates, etc.).
  • Movement/transformation information - source/destination information, as well as DDL (data types, names, etc.).
  • Operations information - data load job schedules, dependencies, notification, and reliability information (such as host redirects and load balancing).
  • Tool-specific information - graphic display information and special function support.
  • Security rules - authentication and authorization.

 

Developing an Architecture

To develop the technical architecture model:

  1. Draft the architecture requirements document first.
  2. Map each business requirement to architecture implications.
  3. Group these implications according to architecture areas (remote access, staging, data access tools, etc.)
  4. Identify how it fits in with the other areas.
  5. Capture the definition of the area and its contents.
  6. Then refine and document the model.

Developing a data warehouse architecture is difficult, a“just do it” approach is a dangerous approach. Using the Zachman framework may be overkill for most organisations need for data warehousing, so a reasonable compromise consisting of a four-layer process fits most purposes:

  1. business requirements
  2. technical architecture
  3. standards
  4. products

Business Requirements

Business requirements essentially drive the architecture, so talk to business managers, analysts, and power users. Identify major business issues, as well as indicators of business strategy, direction, frustrations, business processes, timing, availability, and performance expectations. Document everything well.

Technical Architecture

From an IT perspective, talk to existing data warehouse/DSS support staff, OLTP application groups, and DBAs; as well as networking, OS, and desktop support staff. Also speak with architecture and planning professionals. Here you want to get their opinions on data warehousing considerations from the IT viewpoint. Learn if there are existing architecture documents, IT principles, standards statements, organizational power centers, etc.

Standards

Few standards exist for data warehousing, but there are standards for most of the components:

  • Middleware - ODBC, OLE, OLE DB, DCE, ORBs, and JDBC.
  • Data base connectivity - ODBC, JDBC, OLE DB, and others.
  • Data management - ANSI SQL and FTP.
  • Network access - DCE, DNS, and LDAP.

Products

Regardless of the standards they support, major data warehousing tools are meta data-driven. However, they don't often share meta data with each other and vary in terms of openness, so selection of tools must be done carefully, using the architecture as the guide.

 

Development Phasing

To define the level of detail required for a data warehouse architecture ask "Is this enough information to allow a competent team to build a warehouse that meets the needs of the business?"

The architecture effort will grow exponentially as more people are added for its development, and more complex the resulting system needs to be [functional complexity].

Data warehousing is best approached as an iterative process. It is too big to do all at once, in spite of any impatience on behalf of the company. The data warehouse market is still emerging, so begin with high leverage, high-value parts of the process. Then, use success to make a case for additional phases.

 

Conclusions

To sum up, the benefits of having a data warehouse architecture are as follows:

  • Provides an organizing framework - the architecture draws the lines on the map in terms of what the individual components are, how they fit together, who owns what parts, and priorities.
  • Improved flexibility and maintenance - allows you to quickly add new data sources, interface standards allow plug and play, and the model and meta data allow impact analysis and single-point changes.
  • Faster development and reuse - warehouse developers are better able to understand the data warehouse process, data base contents, and business rules more quickly.
  • Management and communications tool - define and communicate direction and scope to set expectations, identify roles and responsibilities, and communicate requirements to vendors.
  • Coordinate parallel efforts - multiple, relatively independent efforts have a chance to converge successfully. Also, data marts without architecture become the stovepipes of tomorrow.

Companies must align with business requirements but be practical. Need to keep up with advances in the data warehouse industry. Planned and explicit aarchitectures have a better chance of succeeding.

 

Back To Top