申请试用
HOT
登录
注册
 

Maps and Meaning- Graph-based Entity Resolution in Apache Spark

Spark开源社区
/
发布于
/
4275
人观看

Data integration and the automation of tedious data extraction tasks are the fundamental building blocks of a data-driven organizations and are overlooked or underestimated at times. Aside from data extraction, scraping and ETL tasks, entity resolution is a crucial step in successfully combining datasets. The combination of data sources is usually what provides richness in features and variance. Building an expertise in entity resolution is important for data engineerings to successfully combine data sources. Graph-based entity resolution algorithms have emerged as a highly effective approach.

This talk will present the implementation of a graph-bases entity resolution technique in GraphX and in GraphFrames respectively. Working from concept, through how to implement the algorithm in Spark, the technique will also be illustrated by walking through a practical example. The technique will exhibit an example where efficacy can be achieved based on simple heuristics, and at the same time map a path to a machine-learning assisted entity resolution engine with a powerful knowledge graph at its center.

The role of ML can be found upstream in building the graph, for example by using classification algorithms in determining the link strength between nodes based on data, or downstream where dimensionality reduction can play a role in clustering and reduce the computational load in the resolution stage. The audience will leave with a clear picture of a scalable data pipeline performing entity resolution effectively and a thorough understanding of the internal mechanism, ready to apply it to their use cases.

6点赞
2收藏
0下载
确认
3秒后跳转登录页面
去登陆