14/06 - Apache Cassandra Best Practices at Ebay

• ebay inc Cassandra footprints • NoSQL life cycle • Cassandra best prac?ces • Q&A

1.Cassandra  Best  Prac-ces   at     ebay  inc   Feng  Qu   principal  database  engineer,  ebay  inc   September  11,  2014   CassandraSummit2014 | #CassandraSummit

2.Agenda   •  ebay  inc  Cassandra  footprints   •  NoSQL  life  cycle   •  Cassandra  best  prac?ces     •  Q&A   CassandraSummit2014 | #CassandraSummit

3.ebay  inc   CassandraSummit2014 | #CassandraSummit

4.ebay  inc  Database  Pla5orms   •  We  manage  thousands  of  databases  powering  eBay   and  PayPal   CassandraSummit2014 | #CassandraSummit

5.Why  NoSQL?   •  Challenges  of  tradi?onal  RDBMS   •  Performance  penalty  to  maintain  ACID  features   •  Lack  of  na?ve  sharding  and  replica?on  features   •  Lack  of  linear  scalability   •  Cost  of  soMware/hardware   •  Higher  cost  of  commit   •  NoSQL  used  in  eBay  inc   •  Cassandra,  Couchbase,  MongoDB  managed  by  DBA   •  HBase,  Redis,  OpenTSDB    managed  by  developers   CassandraSummit2014 | #CassandraSummit

6.Cassandra  @  ebay  inc   •  Started  in  2011  at  eBay  and  later  expanded  to  PayPal   •  Started  with  Apache  Cassandra  0.8,  now  using  Apache   Cassandra  2.0  and  DataStax  Enterprise  4.0   •  Over  a  dozen  produc?on  clusters  on  hundreds  of   servers  across  3  data  centers   •  Choices  between  dedicated  cluster  for  large/cri?cal  use   case  and  mul?-­‐tenant  cluster  for  small  use  cases     •  Over  20  billions  daily  reads/writes  to  Cassandra     •  Cluster  size  varies  from  4-­‐node  to  80-­‐node   •  100TB+  user  data  on  HDD,  local  SSD  and  SSD  array   •  One  cluster  is  es?mated  to  grow  over  few  PBs   CassandraSummit2014 | #CassandraSummit

7.NoSQL  Life  Cycle   Use Case Analysis Data Operation Modeling Capacity Deployment Planning CassandraSummit2014 | #CassandraSummit

8.Data  Modeling  Phase   •  Development  team  requests  a  review  mee?ng  for  a   new  use  case  with  data  architect     •  Once  data  architect  understands  requirement  and  then   recommends  a  proper  data  store.  It  could  be  either  one   of  RDBMS  or  one  of  NoSQL  products  we  support   •  Both  par?es  work  on  data  modeling  together   •  Outputs  the  engagement  are  a  set  of  ?ckets,  for   tracking  purpose,  which  captures  project  informa?on   and  data  configura?on  for  chosen  data  store.     CassandraSummit2014 | #CassandraSummit

9.Data  Modeling  Best  Prac-ces   •  Unlike  tradi?onal  RDBMS,  data  modeling  for  Cassandra   is  quite  different.     •  Modeling  around  query  pa_ern,  not  en?ty   •  De-­‐normalize  to  improve  read  performance     •  Separate  read  heavy  data  from  write  heavy  data   •  Store  values  in  column  names  as  names  are  physical   sorted  already   •  Former  eBay  architect  Jay  Patel  published  few  technical   blogs  on  Cassandra  data  modeling.     CassandraSummit2014 | #CassandraSummit

10.Data  Modeling  Best  Prac-ces  -­‐  indexing   •  Secondary  index    +  Less  overhead  as  built  in    +  data  and  index  are  changed  atomically      -­‐  not  scale  well  with  high  cardinality  data   •  Column  family  as  index    +  No  hot  spot    -­‐  index  is  maintained  manually  by  applica?on    -­‐  index  change  is  not  atomically     •  Avoid  secondary  index  and  use  column  family  as  index   if  possible         CassandraSummit2014 | #CassandraSummit

11.Benchmark  Tes-ng   •  Benchmark  tes?ng  is  key  to  capacity  planning   •  Performance  baseline  with  near-­‐real  traffic  in   produc?on  size  environment   •  for  different  type  of  hardware   •  for  different  soMware  release   •  for  different  use  case  or  workload   •  A  proac?ve  and  repe??ve  process   CassandraSummit2014 | #CassandraSummit

12.Capacity  Planning  Phase   •  Is  key  to  avoid  surprise  in  produc?on   •  The  concept  behind  capacity  planning  is  simple,  but  the   mechanics  are  harder.   •  Business  requirements  may  increase,  need  to  forecast   how  much  resource  must  be  added  to  the  system  to   ensure  that  user  experience  con?nues  uninterrupted   •  Input:  clearly  defined  capacity  goal  coming  from   business  requirement  and  performance  baseline   from  benchmark  test   •  Output:  Iden?fy  resources  to  be  added,  such  as   memory,  CPU,  storage,  I/O,  network   •  Always  prepare  for  peak  +  headroom   CassandraSummit2014 | #CassandraSummit

13.Deployment  Best  Prac-ces   •  SoMware  packages  with  customized  op?miza?on   •  kernel,  JVM  heap,  compac?on   •  Deployment  automa?on  for  efficiency   •  Mul?  data  center  deployment  for  load  balancing  and   disaster  recovery   •  Vnode  is  a  must  for  manageability   •  SSD  as  default  storage  requires  addi?onal  OS  level   tuning     CassandraSummit2014 | #CassandraSummit

14.Opera-on  Best  Prac-ces   •  Collect  system  and  database  metrics   •  Monitoring  and  aler?ng   •  event  driven  and  metrics  driven  alerts   •  Opera?on  runbook   •  Reduce  human  error   •  Performance  tuning  runbook   •  nodetool  tpstats  for  dropped  requests   •  nodetool  cdistograms  for  latency  distribu?on   •  Troubleshoo?ng  runbook   •  Document  previous  incidents  as  future  reference     CassandraSummit2014 | #CassandraSummit

15.Opera-on  Best  Prac-ces   •  Rou?ne  repair  is  not  really  needed  if  there  is  no   deletes.  You  s?ll  need  run  repair  aMer  bringing  up  a   down  node  if  it  is  dead  for  a  while   •  Use  CNAME  in  client  configura?on  to  avoid  client  conf   change  in  case  of  hardware  replacement    with  new  IP/ name   •  Reduce  gc_grace  to  reduce  overall  data  size   •  Disable  row  cache,  unless  you  have  <100K  rows   •  Collect  sta?s?cs,  real-­‐?me  or  historical,  to  monitor   overall  system  performance   •  Disable  swap  to  avoid  a  slow  node   CassandraSummit2014 | #CassandraSummit

16.Capacity  Review   •  Rou?ne  capacity  review  and  adjustment   •  When  to  scale  up  and  when  to  scale  out   •  In  general,  scale  out  by  adding  nodes  to  increase   capacity  with  NoSQL   •  Some?mes,  it’s  cost  efficient  to  scale  up  at  component   level  by  iden?fying  scaling  bo_leneck,  then  resolve  it   accordingly   •  Network  bandwidth:  upgrade  to  10  Gbps  network   •  I/O  latency:  upgrade  to  (be_er)  SSD   •  Storage:  add/expand  data  volume   CassandraSummit2014 | #CassandraSummit

17.Typical  Use  Cases     •  Write  Intensive:  metrics  collec?on,  logging   •  Collec?ng  metrics  from  tens  of  thousands  devices   periodically     •  Read  Intensive:  home  page  feeds   •  Recommenda?on  backend  to  generate  dynamic   taste  graph     •  Mixed  workload:  personaliza?on,  classifica?on   •  Data  is  loaded  from  data  warehouse  periodically  in   bulk  and  from  user  events  consistently   •  Data  is  retrieved  in  real  ?me  when  user  visits  ebay   site     CassandraSummit2014 | #CassandraSummit

18.Metrics  Collec-on  Applica-on   CassandraSummit2014 | #CassandraSummit

19.The  End     •  We  are  hiring  for  NoSQL  talent.     •  Contact:   •  fengqu@ebay.com   •  www.linkedin.com/in/fengqu/   •  Q&A     CassandraSummit2014 | #CassandraSummit