在Spark错误上使用Spark ML - 集群告诉我们什么? 与霍尔顿卡劳

如果您订阅了user@spark.apache.org,或者在大公司工作,您可能会看到一些常见的Spark错误消息。 在过去的几年里,即使参加Spark Summit,你也会看到像Spark中的“Top K Mistakes”这样的会谈。虽然很酷的非基于机器学习的工具确实存在以检查Spark的日志 - 但它们不使用机器学习,因此不是很酷但也受到人类为他们编写规则所付出的努力的限制。 本PPT将介绍当我们在堆栈跟踪上训练“常规”聚类模型时会发生什么,并探索用于将用户消息分类到Spark列表的DL模型。 来确保机器人无法自我修复,并留下来学习如何在我们的机器人朋友的帮助下更好地工作。 这次是Spark ML on Spark输出,加上一点Tensorflow对整个集群来说很有趣,但可能没有自动响应用户列表的提示。

1.Using Spark ML on Spark Errors What Do the Clusters Tell Us?

2.Who am I? ● My name is Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google focused on OSS Big Data ● Apache Spark PMC (think committer with tenure) ● Contributor to a lot of other projects ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of High Performance Spark & Learning Spark (+ more) ● Twitter: @holdenkarau ● Slideshare http://www.slideshare.net/hkarau ● Linkedin https://www.linkedin.com/in/holdenkarau ● Github https://github.com/holdenk ● Related Spark Videos http://bit.ly/holdenSparkVideos


4.Normally I’d introduce my co-speaker Sylvie burr ● However she was organizing the Apache Beam Summit and is just too drained to be able to make it. ● I did have to cut a few corners (and re-use a few cat pictures) as a result

5.Some links (slides & recordings will be at): CatLoversShow Today’s talk: http://bit.ly/2QoZuKz Yesterday’s talk (Validating Pipelines): https://bit.ly/2QqQUea

6.Who do I think you all are? Amanda ● Nice people* ● Familiar-ish to very familiar with Spark ● Possibly a little bit jaded (but also maybe not)

7.What we are going to explore together! ● The Spark Mailing Lists ○ Yes, even user@ ● My desire to be lazy ● The suspicion that srowen has a robot army to help ● A look at how much work it would be to build that robot army ● The depressing realization “heuristics” are probably better anyways (and some options)

8.Some of the reasons my employer cares* Khairil Zhafri ● We have a hoted Spark/Hadoop solution (called Dataproc) ● We also have hosted pipeline management tools (based on Airflow called Cloud Composer) ● Being good open source community members *Probably, it’s not like I go to all of the meetings I’m invited to.

9.The Spark Mailing Lists & friends Petfu l ● user@ ○ Where people to to ask questions about using Spark ● dev@ ○ Discussion about developing Spark ○ Also where people sometimes go when no one answers user@ ● Stackoverflow ○ Some very active folks here as well ● Books/etc.

10.~ unanswered Spark posts Richard J 8536 :(

11. Stack overflow growth over time Khalid Abduljaleel Petfu l *Done with bigquery. Sorry!

12.Discoverability might matter Petfu l

13.Anyone have an outstanding PR? koi ko

14.So how do we handle this? Helen Olney ● Get more community volunteers ○ (hard & burn out) ● Answer more questions ○ (hard & burn out) ● Answer less questions? ○ (idk maybe someone will buy a support contract) ● Make robots! ○ Hard and doesn’t work entirely

15.How many of you have had? Helen Olney ● Java OOM ● Application memory overhead exceeded ● Serialization exception ● Value is bigger than integer exception ● etc.

16.Maaaaybe robots could help? Matthew Hurst ● It certainly seems like some folks have common issues ● Everyone loves phone trees right? ○ Press 1 if you’ve had an out-of-memory exception press 2 if you’re running Python ● Although more seriously some companies are building recommendation systems on top of Spark to solve this for their customers

17.Ok well, let’s try and build some clusters? _torne ● Not those cluster :p ● Lets try k=4, we had 4 common errors right?

18.I’m lazy so let’s use Spark: _torne body_hashing = HashingTF(inputCol="body_tokens", outputCol="raw_body_features", numFeatures=10000) body_idf =IDF(inputCol="raw_body_features", outputCol="body_features") assembler = VectorAssembler( inputCols=["body_features", "contains_python_stack_trace", "contains_java_stack_trace", "contains_exception_in_task", "is_thread_start", "domain_features"], outputCol="features") kmeans = KMeans(featuresCol="features", k=4, seed=42)

19.Damn not quite one slide :( _torne dataprep_pipeline = Pipeline(stages=[tokenizer, body_hashing, body_idf, domains_hashing, domains_idf, assembler]) pipeline = Pipeline(stages=[dataprep_pipeline, kmeans])

20.f ford Pinto by Morven

21.ayphen f ford Pinto by Morven

22.“Northern Rock”

23.Let’s see what the results Rikki's Refuge ● Let’s grab an email or two from each cluster and take a peek

24.Waiiiit…. Rikki's Refuge

25.Oh hmm. Well maybe 4*4 right? Sherrie Thai ● 159 non group-zero messages…

26.Well when do we start to see something? w.vandervet *Not actually a great way to pick K

27.Let’s look at some of the records - 73 Susan Young 1 {plain=Hi All,\n Greetings ! I needed some help to read a Hive table\nvia Pyspark for which the transactional property is set to 'True' (In other\nwords ACID property is enabled). Following is the entire stacktrace and the\ndescription of the hive table. Would you please be able to help me resolve\nthe error:\n\n18/03/01 11:06:22 INFO BlockManagerMaster: Registered BlockManager\n18/03/01 11:06:22 INFO EventLoggingListener: Logging events to\nhdfs:///spark-history/local-1519923982155\nWelcome to\n ____ __\n / __/__ ___ _____/ /__\n _\ \/ _ \/ _ `/ __/ '_/\n /__ / .__/\_,_/_/ /_/\_\ version 1.6.3\n /_/\n\nUsing Python version 2.7.12 (default, Jul 2 2016 17:42:40)\nSparkContext available as sc, HiveContext available as sqlContext.\n>>> from pyspark.sql import HiveContext\n>>> hive_context = HiveContext(sc)\n>>> hive_context.sql("select count(*)

28.Let’s look at some of the records - 133 nagy.tamas 5 {plain=Hi Gourav,\n\nMy answers are below.\n\nCheers,\nBen\n\n\n> On Feb 23, 2017, at 10:57 PM, Gourav Sengupta <gourav.sengupta@gmail.com> wrote:\n> \n> Can I ask where are you running your CDH? Is it on premise or have you created a cluster for yourself in AWS? Our cluster in on premise in our data center.\n> \n> Also I have really never seen use s3a before, that was used way long before when writing s3 files took a long time, but I think that you are reading it. \n> \n> Anyideas why you are not migrating to Spark 2.1, besides speed, there are lots of apis which are new and the existing ones are being deprecated. Therefore there is a very high chance that you are already working on code which is being deprecated by the SPARK community right now. We use CDH and upgrade with whatever Spark version they include, which is 1.6.0. We are waiting for the move to Spark 2.0/2.1.\n> \n> And besides that would you not want to work on a platform which is at least 10 times faster What would that be?\n> \n> Regards,\n> Gourav Sengupta\n> \n> On Thu, Feb 23, 2017 at 6:23 PM, Benjamin Kim <bbuild11@gmail.com <mailto:bbuild11@gmail.com>> wrote:\n> We are trying to use Spark 1.6 within CDH 5.7.1 to retrieve a 1.3GB Parquet file from AWS S3. We can read the schema and show some data when the file is loaded into a DataFrame, but when we try to do some operations, such as count, we get this error below.\n> \n> com.cloudera.com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain\n> at com.cloudera.com.amazonaws.auth.AWSCredentialsProviderChain.getCredentials(AWSCredentialsProviderChain.java:117)\n> at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3779)\n> at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.headBucket(AmazonS3Client.java:1107)\n> at com.cloudera.com.amazonaws.services.s3.AmazonS3Client.doesBucketExist(AmazonS3Client.java:1070)\n> at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:239)\n> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2711)\n> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:97)\n> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2748)\n> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2730)\n> at

29.Let’s look at some of the records - 133 翮郡 陳 6 {plain=I see, that’s quite interesting. For problem 2, I think the issue is that Akka 2.0.5 *always* kept TCP connections open between nodes, so these messages didn’t get lost. It looks like Akka 2.2 occasionally disconnects them and loses messages. If this is the case, and this behavior can’t be disabled with a flag, then it’s a problem for other parts of the code too. Most of our code assumes that messages will make it through unless the destination node dies, which is what you’d usually hope for TCP.\n\nMatei\n\nOn Oct 31, 2013, at 1:33 PM, Imran Rashid <imran@quantifind.com> wrote:\n\n> pretty sure I found the problem -- two problems actually. And I think one of them has been a general lurking problem w/ spark for a while.\n> \n> 1) we should ignore disassociation events, as you suggested earlier. They seem to just indicate a temporary problem, and can generally be ignored. I've found that they're regularly followed by AssociatedEvents, and it seems communication really works fine at that point.\n> \n> 2) Task finished messages get lost. When this message gets sent, we dont' know it actually gets there:\n> \n> https://github.com/apache/incubator-spark/blob/scala-2.10/core/src/main/scala/org/apache/spark/executor/StandaloneExecutorBackend.scala #L90\n> \n> (this is so incredible, I feel I must be overlooking something -- but there is no ack somewhere else that I'm overlooking, is there??) So, after the patch, spark wasn't hanging b/c of the unhandled DisassociatedEvent. It hangs b/c the executor has sent some taskFinished messages that never get received by the driver. So the driver is waiting for some tasks to finish, but the executors think they are all done.\n> \n> I'm gonna add the reliable proxy pattern for this particular interaction and see if its fixes the problem\n> http://doc.akka.io/docs/akka/2.2.3/contrib/reliable-proxy.html#introducing-the-reliable-proxy\n> \n> imran\n> \n> \n> \n> On Thu, Oct 31, 2013 at 1:17 PM, Imran Rashid <imran@quantifind.com> wrote:\n> Hi Prashant,\n> \n> thanks for looking into this. I don't have any answers yet, but just wanted to send you an update. I finally figured out how to get all the akka logging turned on, so I'm looking at those for more info. One thing immediately jumped out at me -- the Disassociation is actually immediatley followed by an Association! so maybe I came to the wrong conclusion of our test of ignoring the DisassociatedEvent. I'm going to try it again -- hopefully w/ the logging on, I can find out more about what is going on. I might ask on akka list for help w/ what to look for. also this thread makes me think that it really should just re-associate:\n> https://groups.google.com/forum/#!searchin/akka-user/Disassociated/akka-user/SajwwbyTriQ/8oxjbZtawxoJ\n> \n> also, I've