Spark介绍及Spark多数据源分析
展开查看详情
1.Spark Spark ( ) liwei.li@alibaba-inc.com
2.• • Spark • • Spark •
3.
4.• • -> Spark • • -> HBase • • -> HDFS/ • • …… • • •
5. MapReduce(Hadoop) TEZ Spark map shuffle DAG RDD(DAG )+Cache Stage shuffle SQL Dataset struct API( ) Mapper Reducer DAG Vertex Edge streaming MLlib GraphX java SQL( HIVE) java SQL( HIVE) SQL scala java python R HBase MongoDB Datastax SQLServer Spark
6.Why Spark ( ) MLlib SQL Streaming GraphX Spark Core API Spark Core API SQL Python Scala Java R • query Cache Spark Hadoop 100 • Spark SQL • SQL Python Scala Java R • Kafka HBase Cassandra MongoDB Redis MYSQL SQL Server
7.Why Spark ( ) https://spark-packages.org/?q=tags%3A%22Data%20Sources%22
8.Spark
9. HBase+Spark 90% • Spark Streaming ETL HBase/Phoenix • HBase/Phoenix • HBase/Phoenix Spark SQL
10.• kafka Streaming • Spark Streaming Kafka ETL HBase/Phoenix • Spark Streaming HBase/Phoenix • LogService MongoDB MYSQL
11.• HBase RDS T ( ) Spark Parquet • Spark • Kafka HBase RDS MongoDB
12. ( ) • Spark Streaming MLlib • • Kafka HBase RDS MongoDB
13.• : HBase RDS MongoDB • Spark Cache Spark • Cache Cache Spark
14.
15.• (1) Kafka • (2) ETL Spark streaming Kafka ETL HBase/Phoenix HBase/Phoenix • (3) Spark SQL HBase/Phoenix HBase/Phoenix • (4) BI JDBC HBase/Phoenix
16.
17.Spark
18.Spark 1 2 Spark Spark Streaming https://github.com/jaceklaskowski/spark-structured-streaming-book https://github.com/lw-lin/CoolplaySpark Spark Core https://github.com/JerryLead/SparkInternals Spark SQL https://github.com/jaceklaskowski/mastering-spark-sql-book Spark MLlib https://github.com/endymecy/spark-ml-source-analysis 3 https://yq.aliyun.com/teams/382?spm=a2c4e.11153940.0.0.4f4c6d3dCRQqzB submit )
19.
20. + HBase RDS SQL Streaming MLib GraphX MongoDB Spark Redis (Meta+OSS) (ECS) HBase Spark • ( ) “SQL ThriftServer” “ LivyServer” Spark • (0 ) SLA Spark • ( ) Spark HBase SQL Join HBase • ( ) Spark
21.- 1 link https://dwz.cn/Fvqv066s - 2 ( ):
22.