1.Scaling MySQL at Venmo Percona Live 2019, Austin TX Dong Wang (PayPal Inc), Van Pham (Venmo), Heidi Wang (PayPal Inc) 1 © 2019 PayPal Inc. Confidential and proprietary.
2.Agenda Introduction Venmo History Application Ecosystem and MySQL Architecture Scalability Challenges Short Term Tactical Improvements Long Term Strategic Improvements Wrap up: Q & A © 2019 PayPal Inc. Confidential and proprietary.
3.The History of Venmo © 2019 PayPal Inc. Confidential and proprietary.
4.How Venmo Started • Venmo was founded by Andrew Kortina and Iqram Magdon-Ismail, as freshman roommates at the University The of Pennsylvania History of Venmo • The original prototype sent money through text messages, and eventually transitioned to a smartphone app • In 2012, the company was acquired by Braintree for $26.2 million • In December 2013, PayPal acquired Braintree and by default Venmo © 2019 PayPal Inc. Confidential and proprietary
5.How Venmo Works As of Q1 2019 • Venmo The has History of 40 Venmo million users Just Venmo Me • Account for 14.5% of PayPal's total user base • Venmo posted $21 billion in volume • Growth of 73% in volume annually Pay with Venmo © 2019 PayPal Inc. Confidential and proprietary.
6.Venmo Growing Pains Challenges Business Requirement • Growth in partnership with vendors • Add new features and initiatives • Exponential growth in the user base • Increase fraud detections and security • Exponential growth in payment volume • Scale the payment volume • Exponential growth in data volume • Increase user satisfaction • Keep what makes Venmo unique © 2019 PayPal Inc. Confidential and proprietary.
7.Venmo Payment Volume 2 Year Trend © 2019 PayPal Inc. Confidential and proprietary.
8.Application Ecosystem and Database Architecture ©2018 PayPal Inc. Confidential and proprietary. 8
9. Venmo Application Ecosystem Full Stack in AWS Mobile Web iOS Android Amazon CDN Web Admin (CloudFront) (Shabu) (Scope/VU) Routing Amazon Route 53 Nginx Envoy Orch Developer API REST Task Workers (risk/fraud/comp, etc) Misc (cron, etc) DB celery brokers Locks Core Analytics Auth OFAC Social Graph Feed Pub/Friend/User Feeds Login Events Queries © 2019 PayPal Inc. Confidential and proprietary.
10.Venmo Application Framework MVT Framework for both web and web services Model ORM Generated DB SQL View Template © 2019 PayPal Inc. Confidential and proprietary.
11.Database Architecture Amazon Aurora © 2019 PayPal Inc. Confidential and proprietary.
12.Amazon Aurora - Pros and Cons Pros Cons • Managed services • Less visibility to system and storage • Low latency read replicas layer • Stable and better performance • Limited vertical and horizontal scalability • Easy provisioning and scaling • Maximum cluster volume of 64 TB • Faster backup and cloning • Writer restart causes all readers to • Point-in-time recovery reboot • Custom end points • Monitoring tools © 2019 PayPal Inc. Confidential and proprietary.
13.Scalability Challenges at Venmo ©2018 PayPal Inc. Confidential and proprietary. 13
14.2018 Infrastructure Challenges Area Symptom Impact Scalability • Limited horizontal scalability with • Bad user experience more read-only nodes for MongoDB and • Higher call volume to MySQL customer support • Uneven CPU/connection distribution • Low user ratings in the • Read traffic not using read replicas app stores effectively • Can’t handle increase in payment volume Platform • Old version of MySQL, MongoDB and • Slow performance Cassandra • Inconsistent data DR Readiness (in • Limited distribution of DB nodes in US- • Degraded performance progress) East AZs in a regional failure • Lack of DB regional parity in US-West scenario © 2019 PayPal Inc. Confidential and proprietary.
15.2018 Architecture and Performance Challenges Area Symptom Impact Query • Bad performance from queries generated by • High latency during peak Performance ORM time • Top 10 slow queries > 75% of slow query • High CPU usage time Changes in • Too many indexes not used by the application • Slow queries Access Pattern • Can’t add covering index to large tables • Slower updates & payment per second Data Model • No data retention policy • Payment failure or low • Heavily skewed data distribution in MySQL payment per second • Unoptimized keyspace in Cassandra and • A high rate of time out MongoDB • Maintenance challenge Transaction • Multiple DBs involved • Payment failure Model • High number of deadlocks • Reduce payment • High number of blocking reads (Select for per second Update) © 2019 PayPal Inc. Confidential and proprietary.
16.2018 Operation Challenges Area Symptom Impact Monitoring • Many monitoring tools including legacy and • Low confidence in new Grafana, New Relic, DataDog, Sumo metrics validity Logic, PMM, MongoDB Cloud Manager • Only monitor basic metrics for all • Lack of notification for datastores critical metrics • Harder to troubleshoot problems Release Process • No dedicated release engineering org • Frequent incidents • Limited QA review of releases • Bad user experience • Lack of sufficient DBA review • Higher call volume to • Lack of sufficient testing before production release customer support • Low user ratings in the app stores © 2019 PayPal Inc. Confidential and proprietary.
17.Short Term Tactical Improvements ©2018 PayPal Inc. Confidential and proprietary. 17
18.Short Term Tactical Improvements Principles • Aim for peak traffic during Super Bowl • Target availability improvement • Minimize code change 2.5x pps • No big surgery on data models • Align with strategic moves • Provide foundational benefits for both short/long term © 2019 PayPal Inc. Confidential and proprietary.
19.Infrastructure Scaling • MySQL upgrade and row-based replication • Vertical scale of writer node • Vertical and horizontal scale of reader nodes • Read/write traffic separation • Domain isolation © 2019 PayPal Inc. Confidential and proprietary.
20.Infrastructure Scaling Improved DML latency Reduced blocked transaction ©2018 PayPal Inc. Confidential and proprietary. 20
21.Infrastructure Scaling Improvement in CPU usage Improvement in RAM available ©2018 PayPal Inc. Confidential and proprietary. 21
22.Application Optimization/Query Tuning Task Outcome Django Code Optimization On Single Table • Avoid querying millions of rows and then Select throwing away • 25% CPU reduction across the board Tuning of 7 Table Joining Queries • Workaround the plan instability by avoiding order by PK column • Query execution time reduced from >60 seconds to millisecond Django Code Optimization to Reduce DB Round • Get the one-row result set directly without Trips counting the number of rows first • Queries to DB reduced by 50% for a GET call Fixing Slow Queries of Critical Jobs Due to Plan • Root cause analysis using advanced techniques Instability • Avoid cascading slow queries • No more critical job failure © 2019 PayPal Inc. Confidential and proprietary.
23.Optimization on Single Table Select Avoid unnecessary query and then throw away fetched data 1233 def _function(cls, users): 1234 """ 1235 :param uses: A list of `User` objects. 1236 :return: A tuple of `User` objects which meets the condition 1237 """ 1238 1239 if not users or len(users) == 0: 1240 return [ ] Time CPU % Payments/Sec (PPS) CPU% for 100 PPS 12/26/2018 19:00 UTC 32.9 45 32.9/45*100 = 73.1 12/28/2018 19:00 UTC 30.1 55 30.1/55*100 = 54.7 Net Reduction (73.1 – 54.7) / 73.1 = 25% ©2018 PayPal Inc. Confidential and proprietary. 23
24.Workaround Plan Instability of a Slow 7-Table Joining Query Avoid Order By PK Column ©2018 PayPal Inc. Confidential and proprietary. 24
25.Root Cause of Ever Growing Cost Estimate on Using 2ndary Indexes Avoid blocking of index page merges ©2018 PayPal Inc. Confidential and proprietary. 25
26.Operational Enhancements Task Outcome Elasticity of Dev API Pool • Proactive expansion of application server pools during peak Improve Capacity Planning • Simulate peak handling in load test environment • Realistic projections of system capacity for peak traffic Improved Monitoring • Enable Performance Insight and performance schema. Single glass view of all databases Knowing All Levers and Knobs for Peak Handling • Optimize cron job timing • Turn feature off/on PayPal Risk Calls Capacity Preparation • Vastly improved the risk integration with PayPal and reduced losses Playbooks, Event Coordination/Communication • Planned execution of preemptive steps • Anticipation of issue handling with playbooks • War room, slack channels, dedicated bridge, point of contacts • Creation of the Performance & Scalability Team © 2019 PayPal Inc. Confidential and proprietary.
27.Release Process Enhancements Task Outcome Improve release process • Less rollback of releases Improve incident management • Reduction in the number of incidents Improve root cause analysis process • Analysis result in actionable ticket Improve QA & load test • Drastic reduction in Serv1 & Serv2 Mandatory DBA review of new data model and • Major improvement in query performance queries • Drastic reduction in the number of slow queries © 2019 PayPal Inc. Confidential and proprietary.
28.Core MySQL DB Performance © 2019 PayPal Inc. Confidential and proprietary.
29.Long Term Strategic Improvements ©2018 PayPal Inc. Confidential and proprietary. 29