1.CS 525 Advanced Distributed Systems Spring 2017 Indranil Gupta (Indy) Lecture 4, 5 Peer to Peer Systems January 26-31, 2017 All slides © IG 1
2.Why Study Peer to Peer Systems? First distributed systems that seriously focused on scalability with respect to number of nodes P2P techniques abound in cloud computing systems Key-value stores (e.g., Cassandra, Riak , Voldemort ) use Chord p2p hashing
3.Why Study Peer to Peer Systems?
4.A Brief History [6/99] Shawn Fanning (freshman Northeastern U.) releases Napster online music service [12/99] RIAA sues Napster, asking $100K per download [3/00] 25% UWisc traffic Napster, many universities ban it  60M users [2/01] US Federal Appeals Court: users violating copyright laws, Napster is abetting this [9/01] Napster decides to run paid service, pay % to songwriters and music companies [Today] Napster protocol is open, people free to develop opennap clients and servers http://opennap.sourceforge.net Gnutella: http://www.limewire.com (deprecated) Peer to peer working groups: http://p2p.internet2.edu
5.What We Will Study Widely-deployed P2P Systems Napster Gnutella Fasttrack ( Kazaa , Kazaalite , Grokster ) BitTorrent P2P Systems with Provable Properties Chord Pastry Kelips
6.Napster Structure S S S P P P P P Client machines ( “ Peers ” ) napster.com Servers Store their own files Store a directory, i.e., filenames with peer pointers Filename Info about PennyLane.mp3 Beatles, @ 126.96.36.199:1006 ….. P
7.Napster Operations Client Connect to a Napster server Upload list of music files that you want to share Server maintains list of <filename, ip_address , portnum > tuples. Server stores no files.
8.Napster Operations Client ( contd.) Search Send server keywords to search with (Server searches its list with the keywords) Server returns a list of hosts - < ip_address , portnum > tuples - to client Client pings each host in the list to find transfer rates Client fetches file from best host All communication uses TCP (Transmission Control Protocol) Reliable and ordered networking protocol
9.Napster Search Client machines ( “ Peers ” ) napster.com Servers Store their own files Store peer pointers for all files 2. All servers search their lists ( ternary tree algorithm) 5. download from best host 4. ping candidates 3. Response 1. Query S S S P P P P P P
10.Joining a P2P system Can be used for any p2p system Send an http request to well-known url for that P2P service - http://www.myp2pservice.com Message routed (after lookup in DNS=Domain Name system) to introducer, a well known server that keeps track of some recently joined nodes in p2p system Introducer initializes new peers’ neighbor table
11.Problems Centralized server a source of congestion Centralized server single point of failure No security: plaintext messages and passwds napster.com declared to be responsible for users ’ copyright violation “ Indirect infringement ” Next system: Gnutella
12.Gnutella Eliminate the servers Client machines search and retrieve amongst themselves Clients act as servers too, called servents [3/00] release by AOL, immediately withdrawn, but 88K users by 3/03 Original design underwent several modifications
13.Gnutella P P P P P P Servents ( “ Peers ” ) P Connected in an overlay graph (== each link is an implicit Internet path) Store their own files Also store “ peer pointers ”
14.How do I search for my Beatles file? Gnutella routes different messages within the overlay graph Gnutella protocol has 5 main message types Query (search) QueryHit (response to query) Ping (to probe network for other peers) Pong (reply to ping, contains address of another peer) Push (used to initiate file transfer) We ’ ll go into the message structure and protocol now All fields except IP address are in little-endian format 0 x12345678 stored as 0x78 in lowest address byte, then 0x56 in next higher address, and so on.
15.How do I search for my Beatles file? 15 Descriptor ID Payload descriptor TTL Hops Payload length Descriptor Header Type of payload 0x00 Ping 0x01 Pong 0x40 Push 0x80 Query 0x81 Queryhit Decremented at each hop, Message dropped when ttl =0 ttl_initial usually 7 to 10 Incremented at each hop ID of this search transaction Number of bytes of message following this header 0 15 16 17 18 22 Payload Gnutella Message Header Format
16.How do I search for my Beatles file? Minimum Speed Search criteria (keywords) Query (0x80) 0 1 ….. Payload Format in Gnutella Query Message
17.Gnutella Search P P P P P P P Who has PennyLane.mp3? Query ’ s flooded out, ttl -restricted, forwarded only once TTL=2
18.Gnutella Search Num. hits port ip_address speed (fileindex,filename,fsize) servent_id 0 1 3 7 11 n n+16 QueryHit (0x81) : successful result to a query Results Unique identifier of responder; a function of its IP address Info about responder Payload Format in Gnutella QueryHit Message
19.Gnutella Search P P P P P P P Who has PennyLane.mp3? Successful results QueryHit ’ s routed on reverse path
20.Avoiding excessive traffic To avoid duplicate transmissions, each peer maintains a list of recently received messages Query forwarded to all neighbors except peer from which received Each Query (identified by DescriptorID ) forwarded only once QueryHit routed back only to peer from which Query received with same DescriptorID Duplicates with same DescriptorID and Payload descriptor ( msg type, e.g., Query) are dropped QueryHit with DescriptorID for which Query not seen is dropped
21.After receiving QueryHit messages Requestor chooses “ best ” QueryHit responder Initiates HTTP request directly to responder ’ s ip+port GET /get/<File Index>/<File Name>/HTTP/1.0
22.After receiving QueryHit messages (2) HTTP is the file transfer protocol. Why ? Because it ’ s standard, well-debugged, and widely used. Why the “ range ” field in the GET request ? To support partial file transfers. What if responder is behind firewall that disallows incoming connections ?
23.Dealing with Firewalls P P P P P P P Requestor sends Push to responder asking for file transfer Has PennyLane.mp3 But behind firewall
24.Dealing with Firewalls servent_id fileindex ip_address port Push (0x40) same as in received QueryHit Address at which requestor can accept incoming connections
25.Dealing with Firewalls Responder establishes a TCP connection at ip_address , port specified. Sends GIV <File Index>:< Servent Identifier>/<File Name>
26.Ping-Pong Peers initiate Ping’ s periodically Ping s flooded out like Querys , Pongs routed along reverse path like QueryHits Pong replies used to update set of neighboring peers to keep neighbor lists fresh in spite of peers joining, leaving and failing Port ip_address Num. files shared Num. KB shared Pong (0x01) Ping (0x00) no payload
27.Gnutella Summary No servers Peers/ servents maintain “ neighbors ” , this forms an overlay graph Peers store their own files Queries flooded out, ttl restricted QueryHit (replies) reverse path routed Supports file transfer through firewalls Periodic Ping-pong to continuously refresh neighbor lists List size specified by user at peer : heterogeneity means some peers may have more neighbors Gnutella found to follow power law distribution: P(#links = L ) ~ ( k is a constant)
28.Problems Ping/Pong constituted 50% traffic Solution: Multiplex, cache and reduce frequency of pings/pongs Repeated searches with same keywords Solution: Cache Query, QueryHit messages Modem-connected hosts do not have enough bandwidth for passing Gnutella traffic Solution: use a central server to act as proxy for such peers Another solution: FastTrack System (soon)
29.Problems (contd.) Large number of freeloaders 70% of users in 2000 were freeloaders Only download files, never upload own files Flooding causes excessive traffic Is there some way of maintaining meta-information about peers that leads to more intelligent routing? Structured Peer-to-peer systems e.g ., Chord System (coming up next lecture)