1.An Introduction to Data Warehousing
2.2 Data, Data everywhere yet ... I can’t find the data I need data is scattered over the network many versions, subtle differences I can’t get the data I need need an expert to get the data I can’t understand the data I found available data poorly documented I can’t use the data I found results are unexpected data needs to be transformed from one form to other
3.So What Is a Data Warehouse? Definition: A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context. [Barry Devlin] By comparison: an OLTP (on-line transaction processor) or operational system is used to deal with the everyday running of one aspect of an enterprise. OLTP systems are usually designed independently of each other and it is difficult for them to share information.
4.Why Do We Need Data Warehouses? Consolidation of information resources Improved query performance Separate research and decision support functions from the operational systems Foundation for data mining, data visualization, advanced reporting and OLAP tools
5.5 Which are our lowest/highest margin customers ? Who are my customers and what products are they buying? Which customers are most likely to go to the competition ? What impact will new products/services have on revenue and margins? What product prom- -otions have the biggest impact on revenue? What is the most effective distribution channel? Why Data Warehousing?
6.What Is a Data Warehouse Used for? Knowledge discovery Making consolidated reports Finding relationships and correlations Data mining Examples Banks identifying credit risks Insurance companies searching for fraud Medical research
7.Goals Structure Size Performance optimization Technologies used How Do Data Warehouses Differ From Operational Systems?
8.Comparison Chart of Database Types Data warehouse Operational system Subject oriented Transaction oriented Large (hundreds of GB up to several TB) Small (MB up to several GB) Historic data Current data De-normalized table structure (few tables, many columns per table) Normalized table structure (many tables, few columns per table) Batch updates Continuous updates Usually very complex queries Simple to complex queries
9.Design Differences Star Schema Data Warehouse Operational System ER Diagram
10.Supporting a Complete Solution Operational System- Data Entry Data Warehouse- Data Retrieval
11.Data Warehouses, Data Marts, and Operational Data Stores Data Warehouse – The queryable source of data in the enterprise. It is comprised of the union of all of its constituent data marts. Data Mart – A logical subset of the complete data warehouse. Often viewed as a restriction of the data warehouse to a single business process or to a group of related business processes targeted toward a particular business group. Operational Data Store (ODS) – A point of integration for operational systems that developed independent of each other. Since an ODS supports day to day operations, it needs to be continually updated.
12.12 Decision Support Used to manage and control business Data is historical or point-in-time Optimized for inquiry rather than update Use of the system is loosely defined and can be ad-hoc Used by managers and end-users to understand the business and make judgements
13.13 What are the users saying... Data should be integrated across the enterprise Summary data had a real value to the organization Historical data held the key to understanding data over time What-if capabilities are required
14.14 Data Warehousing -- It is a process Technique for assembling and managing data from various sources for the purpose of answering business questions. Thus making decisions that were not previous possible A decision support database maintained separately from the organization’s operational database
15.15 Data Warehouse Architecture Relational Databases Legacy Data Purchased Data Data Warehouse Engine Optimized Loader Extraction Cleansing Analyze Query Metadata Repository
16.16 From the Data Warehouse to Data Marts Departmentally Structured Individually Structured Data Warehouse Organizationally Structured Less More History Normalized Detailed Data Information
17.17 Users have different views of Data Organizationally structured OLAP Explorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed data Farmers: Harvest information from known access paths Tourists: Browse information harvested by farmers
18.17 Users have different views of Data Organizationally structured OLAP Explorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed data Farmers: Harvest information from known access paths Tourists: Browse information harvested by farmers
19.19 Old Retail Paradigm Wal *Mart Inventory Management Merchandise Accounts Payable Purchasing Supplier Promotions: National, Region, Store Level Suppliers Accept Orders Promote Products Provide special Incentives Monitor and Track The Incentives Bill and Collect Receivables Estimate Retailer Demands
20.20 New (Just-In-Time) Retail Paradigm No more deals Shelf-Pass Through (POS Application) One Unit Price Suppliers paid once a week on ACTUAL items sold Wal*Mart Manager Daily Inventory Restock Suppliers (sometimes SameDay) ship to Wal*Mart Warehouse-Pass Through Stock some Large Items Delivery may come from supplier Distribution Center Supplier’s merchandise unloaded directly onto Wal*Mart Trucks
21.21 Information as a Strategic Weapon Daily Summary of all Sales Information Regional Analysis of all Stores in a logical area Specific Product Sales Specific Supplies Sales Trend Analysis, etc. Wal*Mart uses information when negotiating with Suppliers Advertisers etc.
22.22 Schema Design Database organization must look like business must be recognizable by business user approachable by business user Must be simple Schema Types Star Schema Fact Constellation Schema Snowflake schema
23.23 Star Schema A single fact table and for each dimension one dimension table Does not capture hierarchies directly T i m e p r o d c u s t c i t y f a c t date, custno, prodno, cityname, sales
24.24 Dimension Tables Dimension tables Define business in terms already familiar to users Wide rows with lots of descriptive text Small tables (about a million rows) Joined to fact table by a foreign key heavily indexed typical dimensions time periods, geographic region (markets, cities), products, customers, salesperson, etc.
25.25 Fact Table Central table Typical example: individual sales records mostly raw numeric items narrow rows, a few columns at most large number of rows (millions to a billion) Access via dimensions
26.26 Snowflake schema Represent dimensional hierarchy directly by normalizing tables. Easy to maintain and saves storage T i m e p r o d c u s t c i t y f a c t date, custno, prodno, cityname, ... r e g i o n
27.27 Fact Constellation Fact Constellation Multiple fact tables that share many dimension tables Booking and Checkout may share many dimension tables in the hotel industry Hotels Travel Agents Promotion Room Type Customer Booking Checkout
28.28 Data Granularity in Warehouse Summarized data stored reduce storage costs reduce cpu usage increases performance since smaller number of records to be processed design around traditional high level reporting needs tradeoff with volume of data to be stored and detailed usage of data
29.29 Granularity in Warehouse Solution is to have dual level of granularity Store summary data on disks 95% of DSS processing done against this data Store detail on tapes 5% of DSS processing against this data