验证大数据工作 - 使用Holden Karau在Apache Spark上生产之前停止故障

随着大数据工作从概念验证阶段转向为实际生产服务提供动力,我们必须开始考虑当一切最终出错时会发生什么(例如推荐不合适的产品或对坏数据做出的其他决策)。我们最终将全部登上故障船(特别是约40%的受访者自动将其Spark作业结果部署到生产中),重要的是自动识别出现问题的时间,以便我们可以停止部署在我们更新简历之前。 弄清楚什么时候出现了非常糟糕的事情比第一次看起来更棘手,因为我们希望在用户注意到它们之前捕获错误(或者在CNN注意到它们之前失败)。我们将探索用于验证的一般技术,查看在生产环境中验证大数据作业的人员的响应,以及可以帮助我们根据历史数据编写相关验证规则的库。 对于从事流媒体工作的人们,我们将讨论尝试在实时系统中进行验证的独特挑战,以及我们可以做的事情,除了在出现问题时保留最新的简历。为了保持谈话有趣的现实世界的例子(删除公司名称),以及几个创意共同许可的猫图片和可爱的熊猫GIF。
展开查看详情

1. Validating Big Data & ML Pipelines (Apache Spark) Melinda Seckington Now mostly “works”*

2.Holden: ● My name is Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google ● Apache Spark PMC, contributor to many others (including Airflow) ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & High Performance Spark ● Twitter: @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Code review livestreams: https://www.twitch.tv/holdenkarau / https://www.youtube.com/user/holdenkarau ● Spark Talk Videos http://bit.ly/holdenSparkVideos ● Talk feedback (if you are so inclined): http://bit.ly/holdenTalkFeedback

3.

4.What is going to be covered: Andrew ● Why my employer cares about this stuff ● My assumptions about y’all ● A super brief look at property testing ● What validation is & why you should do it for your data pipelines ● How to make simple validation rules & our current limitations ● ML Validation - Guessing if our black box is “correct” ● Cute & scary pictures ○ I promise at least one cat

5.Some of the reasons my employer cares* Khairil Zhafri ● We have a hoted Spark/Hadoop solution (called Dataproc) ● We also have hosted pipeline management tools (based on Airflow called Cloud Composer) ● Being good open source community members *Probably, it’s not like I go to all of the meetings I’m invited to.

6.Who I think you wonderful humans are? ● Nice* people ● Like silly pictures ● Possibly Familiar with one of Scala, Java, or Python? ● Possibly Familiar with one of Spark ● Want to make better software ○ (or models, or w/e) ● Or just want to make software good enough to not have to keep your resume up to date

7.So why should you test? ● Makes you a better person ● Avoid making your users angry ● Save $s ○ Having an ML job fail in hour 26 to restart everything can be expensive... ● Waiting for our jobs to fail is a pretty long dev cycle ● Honestly you’re probably not watching this unless you agree

8.So why should you validate? ● tl;dr - Your tests probably aren’t perfect ● You want to know when you're aboard the failboat ● Our code will most likely fail at some point ○ Sometimes data sources fail in new & exciting ways (see “Call me Maybe”) ○ That jerk on that other floor changed the meaning of a field :( ○ Our tests won’t catch all of the corner cases that the real world finds ● We should try and minimize the impact ○ Avoid making potentially embarrassing recommendations ○ Save having to be woken up at 3am to do a roll-back ○ Specifying a few simple invariants isn’t all that hard ○ Repeating Holden’s mistakes is still not fun

9.So why should you test & validate: Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark

10.So why should you test & validate - cont Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark

11.Why don’t we test? ● It’s hard ○ Faking data, setting up integration tests ● Our tests can get too slow ○ Packaging and building scala is already sad ● It takes a lot of time ○ and people always want everything done yesterday ○ or I just want to go home see my partner ○ Etc. ● Distributed systems is particularly hard

12.Why don’t we test? (continued)

13.Why don’t we validate? ● We already tested our code ○ Riiiight? ● What could go wrong? Also extra hard in distributed systems ● Distributed metrics are hard ● not much built in (not very consistent) ● not always deterministic ● Complicated production systems

14.What happens when we don’t itsbruce ● Personal stories go here ○ I have no comment about where these stories are from This talk is being recorded so we’ll leave it at: ● Negatively impacted the brand in difficult to quantify ways with words with multiple meanings ● Breaking a feature that cost a few million dollars ● Almost recommended illegal content (caught by a lucky manual) ● Every search result was a coffee shop

15.Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455

16.Where do folks get the data for pipeline tests? Lori Rielly ● Most people generate data by hand ● If you have production data you can sample you are lucky! ○ If possible you can try and save in the same format ● If our data is a bunch of Vectors or Doubles Spark’s got tools :) ● Coming up with good test data can take a long time ● Important to test different distributions, input files, empty partitions etc.

17.Property generating libs: QuickCheck / ScalaCheck PROtara hunt ● QuickCheck (haskell) generates tests data under a set of constraints ● Scala version is ScalaCheck - supported by the two unit testing libraries for Spark ● Sscheck (scala check for spark) ○ Awesome people*, supports generating DStreams too! ● spark-testing-base ○ Also Awesome people*, generates more pathological (e.g. empty partitions etc.) RDDs *I assume

18.With spark-testing-base & a million entries test("map should not change number of elements") { implicit val generatorDrivenConfig = PropertyCheckConfig(minSize = 0, maxSize = 1000000) val property = forAll(RDDGenerator.genRDD[String](sc)){ rdd => importantBussinesLogicFunction(rdd).count() == rdd.count() } check(property) }

19.But that can get a bit slow for all of our tests ● Not all of your tests should need a cluster (or even a cluster simulator) ● If you are ok with not using lambdas everywhere you can factor out that logic and test with traditional tools ● Or if you want to keep those lambdas - or verify the transformations logic without the overhead of running a local distributed systems you can try a library like kontextfrei ○ Don’t rely on this alone (but can work well with something like scalacheck)

20.Lets focus on validation some more: *Can be used during integration tests to further validate integration results

21.So how do we validate our jobs? Photo by: Paul Schadler ● The idea is, at some point, you made software which worked. ● Maybe you manually tested and sampled your results ● Hopefully you did a lot of other checks too ● But we can’t do that every time, our pipelines are no longer write-once run-once they are often write-once, run forever, and debug-forever.

22. How many people have something like this? Lilithis val data = ... val parsed = data.flatMap(x => try { Some(parse(x)) } catch { case _ => None // Whatever, it's JSON } }

23. But if we’re going to validate... val data = ... data.cache() val validData = data.filter(isValid) val badData = data.filter(! isValid(_)) if validData.count() < badData.count() { // Ruh Roh! Special business error handling goes here } ... Pager photo by Vitachao CC-SA 3

24.Well that’s less fun :( Sn.Ho ● Our optimizer can’t just magically chain everything together anymore ● My flatMap.map.map is fnur :( ● Now I’m blocking on a thing in the driver

25. Counters* to the rescue**! Miguel Olaya ● Spark has built in counters ○ Per-stage bytes r/w, shuffle r/w, record r/w. execution time, etc. ○ In UI can also register a listener from spark validator project ● We can add counters for things we care about ○ invalid records, users with no recommendations, etc. ○ Accumulators have some challenges (see SPARK-12469 for progress) but are an interesting option ● We can _pretend_ we still have nice functional code *Counters/Accumulators are your friends, but the kind of friends who steal your lunch money ** In a similar way to how regular expressions can solve problems….

26.First counters free….

27. Just a little bit of code for the next ones…. Phoebe Baker val parsed = data.flatMap(x => try { Some(parse(x)) happyCounter.add(1) } catch { case _ => sadCounter.add(1) None // What's it's JSON } } // Special business data logic (aka wordcount) // Much much later* business error logic goes here Pager photo by Vitachao CC-SA 3

28.Ok but what about those *s Miguel Olaya ● Turns out accumulators aren’t really great for tracking data properties ● Turns out sometimes for validation we really care about data properties ● But we can kind of fake it and hope

29.General Rules for making Validation rules Photo by: Paul Schadler ● According to a sad survey most people check execution time & record count ● spark-validator is still in early stages but interesting proof of concept ● Sometimes your rules will miss-fire and you’ll need to manually approve a job ● Remember those property tests? Could be Validation rules ● Historical data ● Domain specific solutions