Running MongoDB in Production part 2



1.Running MongoDB in Production, Part II Tim Vaillancourt Sr Technical Operations Architect, Percona Speaker Name

2.`whoami` { name: “tim”, lastname: “vaillancourt”, employer: “percona”, techs: [ “mongodb”, “mysql”, “cassandra”, “redis”, “rabbitmq”, “solr”, “mesos” “kafka”, “couch*”, “python”, “golang” ] }

3.Agenda ● Architecture and High-Availability ● Hardware ● Tuning MongoDB ● Tuning Linux

4.Architecture and High-Availability

5.High Availability ● Replication ○ Asynchronous ■ Write Concerns can provide psuedo-synchronous replication ■ Changelog based, using the “Oplog” ○ Maximum 50 members ○ Maximum 7 voting members ■ Use “vote:0” for members $gt 7 ○ Oplog ■ The “” capped-collection in “local” storing changes to data ■ Read by secondary members for replication ■ Written to by local node after “apply” of operation

6.Architecture ● Datacenter Recommendations ○ Minimum of 3 x physical servers required for High-Availability ○ Ensure only 1 x member per Replica Set is on a single physical server!!! ● EC2 / Cloud Recommendations ○ Place Replica Set members in 3 Availability Zones, same region ○ Use a hidden-secondary node for Backup and Disaster Recovery in another region ○ Entire Availability Zones have been lost before!


8.Hardware: Mainframe vs Commodity ● Databases: The Past ○ Buy some really amazing, expensive hardware ○ Buy some crazy expensive license ■ Don’t run a lot of servers due to above ○ Scale up: ■ Buy even more amazing hardware for monolithic host ■ Hardware came on a truck ○ HA: When it rains, it pours

9.Hardware: Mainframe vs Commodity ● Databases: A New Era ○ Everything fails, nothing is precious ○ Elastic infrastructures (“The cloud”, Mesos, etc) ○ Scale up: add more cheap, commodity servers ○ HA: lots of cheap, commodity servers - still up!

10.Hardware: Block Devices ● Isolation ○ Run Mongod dbPaths on separate volume ○ Optionally, run Mongod journal on separate volume ● RAID Level ○ RAID 10 == performance/durability sweet spot ○ RAID 0 == fast and dangerous ● SSDs ○ Benefit MMAPv1 a lot ○ Benefit WT and RocksDB a bit less ○ Keep about 20-30% free space for internal GC

11.Hardware: Block Devices ● EBS / NFS / iSCSI ○ Risks / Drawbacks ■ Exponentially more things to break (more on this) ■ Block device requests wrapped in TCP is extremely slow ■ You probably already paid for some fast local disks ■ More difficult (sometimes nearly-impossible) to troubleshoot ■ MongoDB doesn’t really benefit from remote storage features/flexibility ● Built-in High-Availability of data via replication ● MongoDB replication can bootstrap new members ● Strong write concerns can be specified for critical data

12.Hardware: Block Devices ● EBS / NFS / iSCSI ○ Things to break or troubleshoot… ■ Application needs a block from disk ■ System call to kernel for block ■ Kernel frames block request in TCP ● No logic to align block sizes ■ TCP connection/handshake (if not pooled) ■ TCP packet moves across wire, routers, switches ● Ethernet is exponentially slower than SATA/SAS/SCSI ■ Storage server parses TCP to block device ■ Storage server system calls to kernel for block ■ Storage server storage driver calls RAID/storage controller ■ Block is returned (finally!)

13.Hardware: CPUs ● Cores vs Core Speed ○ Lots of cores > faster cores (4 CPU minimum recommended) ○ Thread-per-connection Model ● CPU Frequency Scaling ○ ‘cpufreq’: a daemon for dynamic scaling of the CPU frequency ○ Terrible idea for databases or any predictability! ○ Disable or set governor to 100% frequency always, i.e mode: ‘performance’ ○ Disable any BIOS-level performance/efficiency tuneable ○ Set ENERGY_PERF_BIAS to ‘performance’ on CentOS/Red Hat

14.Hardware: Network Infrastructure ● Datacenter Tiers ○ Network Edge ○ Public Server VLAN ■ Servers with Public NAT and/or port forwards from Network Edge ■ Examples: Proxies, Static Content, etc ■ Calls backends in Backend VLAN ○ Backend Server VLAN ■ Servers with port forwarding from Public Server VLAN (w/Source IP ACLs) ■ Optional load balancer for stateless backends ■ Examples: Webserver, Application Server/Worker, etc ■ Calls data stores in Data VLAN

15.Hardware: Network Infrastructure ● Datacenter Tiers ○ Data VLAN ■ Servers, filers, etc with port forwarding from Backend Server VLAN (w/Source IP ACLs) ■ Examples: Databases, Queues, Filers, Caches, HDFS, etc

16.Hardware: Network Infrastructure ● Network Fabric ○ Try to use 10GBe for low latency ○ Use Jumbo Frames for efficiency ○ Try to keep all MongoDB nodes on the same segment ■ Goal: few or no network hops between nodes ■ Check with ‘traceroute’ ● Outbound / Public Access ○ Databases don’t need to talk to the internet* ■ Store a copy of your Yum, DockerHub, etc repos locally ■ Deny any access to Public internet or have no route to it ■ Hackers will try to upload a dump of your data out of the network!!

17.Hardware: Why So Quick? ● MongoDB allows you to scale reads and writes with more nodes ○ Single-instance performance is important, but deal-breaking ● You are the most expensive resource! ○ Not hardware anymore

18.Tuning MongoDB

19.Tuning MongoDB: MMAPv1 ● A kernel-level function to map file blocks to memory ● MMAPv1 syncs data to disk once per 60 seconds (default) ○ If a server with no journal crashes it can lose 1 min of data!!! ● In memory buffering of Journal ○ Synced every 30ms ‘journal’ is on a different disk ○ Or every 100ms ○ Or 1/3rd of above if change uses j:true WC

20.Tuning MongoDB: MMAPv1 ● Fragmentation ○ Can cause serious slowdowns on scans, range queries, etc ○ WiredTiger and RocksDB have little-no fragmentation due to checkpoints / compaction

21.Tuning MongoDB: WiredTiger ● WT syncs data to disk in a process called “Checkpointing”: ○ Every 60 seconds or >= 2GB data changes ● In-memory buffering of Journal ○ Journal buffer size 128kb ○ Synced every 50 ms (as of 3.2) ○ Or every change with Journaled write concern

22.Tuning MongoDB: RocksDB ● Deprecated in PSDMB 3.6+ ● Level-based strategy using immutable data level files ○ Built-in Compression ○ Block and Filesystem caches ● RocksDB uses “compaction” to apply changes to data files ○ Tiered level compaction ○ Follows same logic as MMAPv1 for journal buffering

23.Tuning MongoDB: Storage Engine Caches ● WiredTiger ○ In heap ■ 50% available system memory ■ Uncompressed WT pages ○ Filesystem Cache ■ 50% available system memory ■ Compressed pages ● RocksDB ○ Internal testing planned from Percona in the future ○ 30% in-heap cache recommended by Facebook / Parse Platform

24.Tuning MongoDB: Durability ● storage.journal.enabled = <true/false> ○ Default since 2.0 on 64-bit builds ○ Always enable unless data is transient ○ Always enable on cluster config servers ● storage.journal.commitIntervalMs = <ms> ○ Max time between journal syncs ● storage.syncPeriodSecs = <secs> ○ Max time between data file flushes

25.Tuning MongoDB: Don’t Enable! ● “cpu” ○ External monitoring is recommended ● “rest” ○ Will be deprecated in 3.6+ ● “smallfiles” ○ In most situations this is not necessary unless ■ You use MMAPv1, and ■ It is a Development / Test environment ■ You have 100s-1000s of databases with very little data inside (unlikely) ● Profiling mode ‘2’ ○ Unless troubleshooting an issue / intentional

26.Tuning Linux

27.Tuning Linux: Love your OS! ● “I can login via SSH...we’re done!” ● The database is only as fast as as the kernel ● Expect a default Linux install to be optimised for a cheap laptop, not your $$$ hardware

28.Tuning Linux: The Linux Kernel ● Avoid Linux earlier than 3.10.x - 3.12.x ● Large improvements in parallel efficiency in 3.10+ (for Free!) ● More:

29.Tuning Linux: NUMA ● A memory architecture that takes into account the locality of memory, caches and CPUs for lower latency ○ But no databases want to use it :( ● MongoDB codebase is not NUMA “aware” ○ Unbalanced memory allocations to a single zone