申请试用
HOT
登录
注册
 
Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Quer

Near Real-Time Analytics with Apache Spark: Ingestion, ETL, and Interactive Quer

Spark开源社区
/
发布于
/
8413
人观看
Near real-time analytics has become a common requirement for many data teams as the technology has caught up to the demand. One of the hardest aspects of enabling near-realtime analytics is making sure the source data is ingested and deduplicated often enough to be useful to analysts while writing the data in a format that is usable by your analytics query engine. This is usually the domain of many tools since there are three different aspects of the problem: streaming ingestion of data, deduplication using an ETL process, and interactive analytics. With Spark, this can be done with one tool. This talk with walk you through how to use Spark Streaming to ingest change-log data, use Spark batch jobs to perform major and minor compaction, and query the results with Spark.SQL. At the end of this talk you will know what is required to setup near-realtime analytics at your organization, the common gotchas including file formats and distributed file systems, and how to handle data the unique data integrity issues that arise from near-realtime analytics.
0点赞
0收藏
2下载
确认
3秒后跳转登录页面
去登陆