1.Chapter 25, 6e - 24, 5e Distributed Databases Prof. Steven A. Demurjian , Sr. Computer Science & Engineering Department The University of Connecticut 191 Auditorium Road, Box U-155 Storrs, CT 06269-3155 firstname.lastname@example.org http://www.engr.uconn.edu/~steve (860) 486 - 4818 A portion of these slides are being used with the permission of Dr. Ling Lui , Associate Professor, College of Computing, Georgia Tech. Remaining slides represent new material.
2.Classical and Distributed Architectures Classic/Centralized DBMS Dominated the Commercial Market from 1970s Forward Problems of this Approach Difficult to Scale w.r.t. Performance Gains If DB Overloaded, replace with a Faster Computer this can Only Go So Far - Disk Bottlenecks Distributed DBMS have Evolved to Address a Number of Issues Improved Performance Putting Data “Near” Location where it is Needed Replication of Data for Fault Tolerance Vertical and Horizontal Partitioning of DB Tuples
3.Common Features of Centralized DBMS Data Independence High-Level Representation via Conceptual and External Schemas Physical Representation (Internal Schema) Hidden Program Independence Multiple Applications can Share Data Views/External Schema Support this Capability Reduction of Program/Data Redundancy Single, Unique, Conceptual Schema Shared Database Almost No Data Redundancy Controlled Data Access Reduces Inconsistencies Programs Execute with Consistent Results
4.Common Features of Centralized DBMS Promote Sharing: Automatically Provided via CC No Longer Programmatic Issue Most DBMS Offer Locking for Key Shared Data Oracle Allows Locks on Data Item (Attributes) For Example, Controlling Access to Shared Identifier Coherent and Central DB Administration Semantic DB Integrity via the Automatic Enforcement of Data Consistency via Integrity Constraints/Rules Data Resiliency Physical Integrity of Data in the Presence of Faults and Errors Supported by DB Recovery Data Security: Control Access for Authorized Users Against Sensitive Data
5.Shared Nothing Architecture In this Architecture, Each DBMS Operates Autonomously There is No Sharing Three Separate DBMSs on Three Different Computers Applications/Clients Must Know About the External Schemas of all Three DBMSs for Database Retrieval Client Processing Complicates Client Different DBMS Platforms (Oracle, Sybase, Informix, ..) Different Access Modes (Query, Embedded, ODBC) Difficult for SWE to Code
6.Difficulty in Access – Manage Multiple APIs Each Platform has a Different API API 1 , API 3 , …. , API n An App Programmer Must Utilize All three APIs which could differ by PL – C++, C, Java, REST, etc. Any interactions Across 3 DBs – must be programmatically handled without DB Capabilities API 1 API 2 API n
7.NW Architecture with Centralized DB High-Speed NWs/WANs Spawned Centralized DB Accessible Worldwide Clients at Any Site can Access Repository Data May be “Far” Away - Increased Access Time In Practice, Each Remote Site Needs only Portion of the Data in DB1 and/or DB2 Inefficient, no Replication w.r.t. Failure
8.Fully Distributed Architecture The Five Sites (Chicago, SF, LA, NY, Atlanta) each have a “Portion” of the Database - its Distributed Replication is Possible for Fault Tolerance Queries at one Site May Need to Access Data at Another Site (e.g., for a Join) Increased Transaction Processing Complexity
9.Distributed Database Concepts A transaction can be executed by multiple networked computers in a unified manner. A distributed database (DDB) processes a Unit of execution (a transaction) in a distributed manner. A distributed database (DDB) can be defined as Collection of multiple logically related database distributed over a computer network Distributed database management system as a software system that manages a distributed database while making the distribution transparent to the user.
10.Goals of DDBMS Support User Distribution Across Multiple Sites Remote Access by Users Regardless of Location Distribution and Replication of Database Content Provide Location Transparency Users Manipulate their Own Data Non-Local Sites “Appear” Local to Any User Provide Transaction Control Akin to Centralized Case Transaction Control Hides Distribution CC and Serializability - Must be Extended Minimize Communications Cost Optimize Use of Network - a Critical Issue Distribute DB Design Supported by Partitioning (Fragmentation) and Replication
11.Goals of DDBMS Improve Response Time for DB Access Use a More Sophisticated Load Control for Transaction Processing However, Synchronization Across Sites May Introduce Additional Overhead System Availability Site Independence in the Presence of Site Failure Subset of Database is Always Available Replication can Keep All Data Available, Even When Multiple Sites Fail Modularity Incremental Growth with the Addition of Sites Dedicate Sites to Specific Tasks
12.Advantages of DDBMS There are Four Major Advantages Transparency Distribution/NW Transparency User Doesn’t Know about NW Configuration (Location Transparency) User can Find Object at any Site (Naming Transparency) Replication Transparency (see next PPT) User Doesn’t Know Location of Data Replicas are Transparently Accessible Fragmentation Transparency Horizontal Fragmentation (Distribute by Row) Vertical Fragmentation (Distribute by Column)
13.Data Distribution and Replication
14.Other Advantages of DDBMS Increased Reliability and Availability Reliability - System Always Running Availability - Data Always Present Achieved via Replication and Distribution Ability to Make Single Query for Entire DDBMS Improved Performance Sites Able to Utilize Data that is Local for Majority of Queries Easier Expansion Improve Performance of Site by Upgrading Processor of Computer Adding Additional Disks Splitting a Site into Two or More Sites Expansion over Time as Business Grows
15.Challenges of DDBMS Tracking Data - Meta Data More Complex Must Track Distribution (where is the Data) V & H Fragmentation (How is Data Split) Replication (Multiple Copies for Consistency) Distributed Query Processing Optimization, Accessibility, etc., More Complex Block Analysis of Data Size Must also Now Consider the NW Transmitting Time Distributed Transaction Processing TP Potentially Spans Multiple Sites Submit Query to Multiple Sites Collect and Collate Results Distributed Concurrency Control Across Nodes
16.Challenges of DDBMS Replicated Data Management TP Must Choose the Replica to Access Updates Must Modify All Replica Copies Distributed Database Recovery Recovery of Individual Sites Recovery Across DDBMS Security Local and Remote Authorization During TP, be Able to Verify Remote Privileges Distributed Directory Management Meta-Data on Database - Local and Remote Must maintain Replicas of this - Every Site Tracks the Meta-Data for All Sites
17.A Complete Schema with Keys ... Keys Allow us to Establish Links Between Relations
18. …and Corresponding DB Tables which Represent Tuples/Instances of Each Relation 1 4 5 5 A S C null W B null null
19. …with Remaining DB Tables
20.What is Fragmentation? Fragmentation Divides a DB Across Multiple Sites Two Types of Fragmentation Horizontal Fragmentation Given a Relation R with n Total Tuples, Spread Entire Tuples Across Multiple Sites Each Site has a Subset of the n Tuples Essentially Fragmentation is a Selection Vertical Fragmentation Given a Relation R with m Attributes and n Total Tuples, Spread the Columns Across Multiple Sites Essentially Fragmentation is a Projection Not Generally Utilized in Practice In Both Cases, Sites can Overlap for Replication
21.Horizontal Fragmentation A horizontal subset of a relation which contain those of tuples which satisfy selection conditions. Consider Employee relation with condition DNO = 5 All tuples satisfying this create a subset which will be a horizontal fragment of Employee relation. A selection condition may be composed of several conditions connected by AND or OR . Derived horizontal fragmentation: Partitioning of a primary relation to other secondary relations which are related with Foreign keys.
22.Horizontal Fragmentation Site 2 Tracks All Information Related to Dept. 5
23.Horizontal Fragmentation Site 3 Tracks All Information Related to Dept. 4 Note that an Employee Could be Listed in Both Cases, if s/he Works on a Project for Both Departments
24.Refined Horizontal Fragmentation Further Fragment from Site 2 based on Dept. that Employee Works in Notice that G1 + G2 + G3 is the Same as WORKS_ON5 there is no Overlap
25.Refined Horizontal Fragmentation Further Fragment from Site 3 based on Dept. that Employee Works in Notice that G4 + G5 + G6 is the Same as WORKS_ON4 Note Some Fragments can be Empty
26.Vertical Fragmentation Subset of a relation created via a subset of columns. A vertical fragment of a relation will contain values of selected columns. There is no selection condition used in vertical fragmentation. A strict vertical slice/partition Consider the Employee relation. A vertical fragment of can be created by keeping the values of Name, Bdate , Sex, and Address. Since no condition for creating a vertical fragment Each fragment must include the primary key attribute of the parent relation Employee. All vertical fragments of a relation are connected.
27.Vertical Fragmentation Example Partition the Employee Table as Below Notice Each Vertical Fragment Needs Key Column EmpDemo EmpSupvrDept
28.Homogeneous DDBMS Homogeneous Identical Software (w.r.t. Database) One DB Product (e.g., Oracle, DB2, Sybase) is Distributed and Available at All Sites Uniformity w.r.t. Administration, Maintenance, Client Access, Users, Security, etc. Interaction by Programmatic Clients is Consistent (e.g., JDBC or ODBC or REST API …)
29.Non-Federated Heterogeneous DDBMS Non-Federated Heterogeneous Different Software (w.r.t. Database) Multiple DB Products (e.g., Oracle at One Site, MySQL at another, Sybase, Informix, etc.) Replicated Administration (e.g., Users Needs Accounts on Multiple Systems) Varied Programmatic Access - SWEs Must Know All Platforms/Client Software Complicated Very Close to Shared Nothing Architecture