12-Grid Computing

Use computing resource connected to high-speed information highway as if we use electric power grid --Only 30% utilization in academic/commercial environments. --Many applications have only episodic requirements. So, why don’t we share computation resource? --Computational results and data should be also made available to all users. Users: --Computational scientists and engineers --Experimental scientists --Association and corporations --Training and education --Consumers (E-commerce)

1. CSS434 Grid Computing Textbook No Corresponding Chapters Professor: Munehiro Fukuda A portion of these slides were compiled from The Grid: Blueprint for a New Computer Infrastructure. CSS434 Grid Computing 1

2. Network Infrastructure  Users login their organizational systems first locally or remotely.  If they are affiliated with other organizations,  They can login from the High-speed system of their main use Information high way to some other systems. (They are given an opportunity to use those resources in parallel).  Problems:  They must orchestrate job execution among the resources they use.  Should those resources be limited to such a handful number of researchers? CSS434 Grid Computing 2

3.Purposes of Computational Grid  Use computing resource connected to high-speed information highway as if we use electric power grid  Only 30% utilization in academic/commercial environments.  Many applications have only episodic requirements. So, why don’t we share computation resource?  Computational results and data should be also made available to all users.  Users:  Computational scientists and engineers  Experimental scientists  Association and corporations  Training and education  Consumers (E-commerce) CSS434 Grid Computing 3

4. Grid Applications Category Examples Characteristics Distributed DIS and Stellar Very large problems needing lots supercomputin dynamics of computing resource at a time g High Chip design and Harnessing many idle resources throughput parameter studies to increase aggregate throughput On demand Medical Allocating special resource instrumentation dynamically Data intensive Sky survey Using distributed data and needing high-volume data flows Collaborative Collaborative Support communication or design collaborative work Education CSS434 Grid Computing 4

5. Grid Services Architecture from www.globus.org slide High-energy Collaborative On-line physics data engineering instrumentation Applications analysis Regional Parameter climate studies studies Distributed Collab. Remote Application computing design control Toolkit Layer Data- Remote intensive viz Grid Services Information Resource mgmt ... Layer Security Data access Fault detection Transport ... Multicast Grid Fabric Layer Instrumentation Control interfaces QoS mechanisms CSS434 Grid Computing 5

6. Programming Model Uniform Access  Paradigm  Bag of task or master workers (Condor-MW)  Client server (NetSolve)  Object oriented (Legion)  Synchronous applications (Not suited for massively parallel computation.)  Language Support  MPI-G – message passing (Globus)  Open MP – shared memory  Math Library – remote procedure (NetSolve) CSS434 Grid Computing 6

7. Resource Management Discovery, Allocation, and Scheduling  Centralized resource manager Systems Resource Front-end Resource Job launcher descriptions process manager Globus RSL: resource Broker and GRAM spec. MDS language Condor ClassAd and Schedd Agent Matchmaker Sandbox DAGMan and startd (Starter) Legion IDL: interface Scheduler Collection Enactor  +: easy to manage def. language  –: a bottleneck  Decentralized resource manager  A collection of centralized manager (Condor’s gate flocking)  A combination of meta and local schedulers. CSS434 Grid Computing 7

8. Fault Tolerance  Check-pointing  At the master (Condor)  At each node but collected at the master (Catalina)  Use a whiteboard (Optimal Grid)  Re-execution of fault worker jobs from the beginning (Bayanihan, Optimal Grid)  Error code (NetSolve)  User is responsible to handle errors. CSS434 Grid Computing 8

9. Security  Resources covered with security layers  Legion (Message/MayI layers)  Entropia (Intercepting all system calls)  A use of commodity tools  SSL  Public key  Security Certificate  Java sandbox  Kerberos CSS434 Grid Computing 9

10. NetSolve http://icl.cs.utk.edu/netsolve/ Network of servers Client  RPC-based approach  Clients  Include a set of APIs calle Agent d as (asynchronous) RPCs  Agents  Match client’s requests for Agent choice Scalar services with servers Client request server  Servers reply  Encapsulates remotely acc MPP servers essed numerical libraries CSS434 Grid Computing 10

11. Legion http://legion.virginia.edu/  Legion classes Prog  Act as managers and make policy request Enactor  Core objects Scheduler  Provide mechanisms that classes Converted Legion object ID use to implement policies: hosts By context objects reserve (processors), vaults(memory), search Converted Logion object address context, binding agents, etc. By binding agents  Per-Program Scheduling Resource database Class  Participating sites can assure Host collection their local policies. tty tty Host Host  User can choose a scheduling Resources policy. Class tty CSS434 Grid Computing 11

12. Condor http://www.cs.wisc.edu/condor/ A: User’s local agent R: Each computer resource I/O forwarded to M: Central manager a user’s home CSS434 Grid Computing 12

13.AgentTeamwork at UWB Architecture CSS434 Grid Computing 13

14.Paper Review by Students  Globus  Legion  Condor  Netsolve  Discussions  What programming or execution model is each system based on?  What resource allocation and scheduling algorithm does each system use?  Are they fault-tolerant?  Did they any special security features for their own? CSS434 Grid Computing 14