大数据与数据挖掘

本篇讲述了当今大数据框架的生态系统,基础编程模式,以及相关组件的基本原理,信息量比较大,比较适合入门者引导课程。
展开查看详情

1.Big Data Mining 巨量資料探勘 1 1052DM02 MI4 (M2244) ( 3069) Thu, 8 , 9 ( 15:10-17:00 ) ( B130) Min-Yuh Day 戴敏育 Assistant Professor 專任助理教授 Dept. of Information Management , Tamkang University 淡江大學   資訊管理學系 http://mail. tku.edu.tw/myday/ 2017-02-23 Tamkang University Tamkang University 巨量資料基礎: MapReduce 典範、 Hadoop 與 Spark 生態系統 (Fundamental Big Data: MapReduce Paradigm, Hadoop and Spark Ecosystem)

2.週次 (Week) 日期 (Date) 內容 (Subject/Topics) 1 2017/02/16 巨量資料探勘課程介紹 (Course Orientation for Big Data Mining) 2 2017/02/23 巨量資料基礎: MapReduce 典範、 Hadoop 與 Spark 生態系統 (Fundamental Big Data: MapReduce Paradigm, Hadoop and Spark Ecosystem) 3 2017/03/02 關連分析 (Association Analysis) 4 2017/03/09 分類與預測 (Classification and Prediction) 5 2017/03/16 分群分析 (Cluster Analysis) 6 2017/03/23 個案分析與實作一 (SAS EM 分群分析 ) : Case Study 1 (Cluster Analysis – K-Means using SAS EM) 7 2017/03/30 個案分析與實作二 (SAS EM 關連分析 ) : Case Study 2 (Association Analysis using SAS EM) 課程大綱 (Syllabus) 2

3.週次 (Week) 日期 (Date) 內容 (Subject/Topics) 8 2017/04/06 教學行政觀摩日 (Off-campus study) 9 2017/04/13 期中報告 (Midterm Project Presentation) 10 2017/04/20 期中考試週 (Midterm Exam) 11 2017/04/27 個案分析與實作三 (SAS EM 決策樹、模型評估 ) : Case Study 3 (Decision Tree, Model Evaluation using SAS EM) 12 2017/05/04 個案分析與實作四 (SAS EM 迴歸分析、類神經網路 ) : Case Study 4 (Regression Analysis, Artificial Neural Network using SAS EM) 13 2017/05/11 Google TensorFlow 深度學習 (Deep Learning with Google TensorFlow ) 14 2017/05/18 期末報告 (Final Project Presentation) 15 2017/05/25 畢業班考試 (Final Exam) 課程大綱 (Syllabus) 3

4.2017/02/23 巨量資料基礎: MapReduce 典範、 Hadoop 與 Spark 生態系統 (Fundamental Big Data: MapReduce Paradigm, Hadoop and Spark Ecosystem ) 4

5.Big Data Analytics and Data Mining 5

6.Architectures of Big Data Analytics 6

7.Architecture of Big Data Analytics 7 Source: Stephan Kudyba (2014), Big Data, Mining, and Analytics: Components of Strategic Decision Making, Auerbach Publications Data Mining OLAP Reports Queries Hadoop MapReduce Pig Hive Jaql Zookeeper Hbase Cassandra Oozie Avro Mahout Others Middleware Extract Transform Load Data Warehouse Traditional Format CSV, Tables * Internal * External * Multiple formats * Multiple locations * Multiple applications Big Data Sources Big Data Transformation Big Data Platforms & Tools Big Data Analytics Applications Big Data Analytics Transformed Data Raw Data

8.Architecture of Big Data Analytics 8 Source: Stephan Kudyba (2014), Big Data, Mining, and Analytics: Components of Strategic Decision Making, Auerbach Publications Data Mining OLAP Reports Queries Hadoop MapReduce Pig Hive Jaql Zookeeper Hbase Cassandra Oozie Avro Mahout Others Middleware Extract Transform Load Data Warehouse Traditional Format CSV, Tables * Internal * External * Multiple formats * Multiple locations * Multiple applications Big Data Sources Big Data Transformation Big Data Platforms & Tools Big Data Analytics Applications Big Data Analytics Transformed Data Raw Data Data Mining Big Data Analytics Applications

9.Architecture for Social Big Data Mining (Hiroshi Ishikawa, 2015) 9 Hardware Software Social Data Physical Layer Logical Layer Integrated analysis Multivariate analysis Application specific task Data Mining Conceptual Layer Enabling Technologies Analysts Model Construction Explanation by Model Construction and confirmation of individual hypothesis Description and execution of application-specific task Integrated analysis model Natural Language Processing Information Extraction Anomaly Detection Discovery of relationships among heterogeneous data Large-scale visualization Parallel distrusted processing Source: Hiroshi Ishikawa (2015), Social Big Data Mining, CRC Press

10.Business Intelligence (BI) Infrastructure 10 Source: Kenneth C. Laudon & Jane P. Laudon (2014), Management Information Systems: Managing the Digital Firm, Thirteenth Edition, Pearson.

11.Data Warehouse Data Mining and Business Intelligence Increasing potential to support business decisions End User Business Analyst Data Analyst DBA Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems 11 Source: Jiawei Han and Micheline Kamber (2006), Data Mining: Concepts and Techniques, Second Edition, Elsevier

12.The Evolution of BI Capabilities Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 12

13.Data Science and Business Intelligence 13 Source: EMC Education Services, Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, Wiley, 2015

14.Data Science and Business Intelligence 14 Source: EMC Education Services, Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, Wiley, 2015 Predictive Analytics and Data Mining (Data Science)

15.Data Science and Business Intelligence 15 Source: EMC Education Services, Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data, Wiley, 2015 Predictive Analytics and Data Mining (Data Science) What if … ? What’s the optimal scenario for our business? What will happen next? What if these trends countinue ? Why is this happening? Optimization, predictive modeling, forecasting statistical analysis Structured/unstructured data, many types of sources, very large datasets

16.Data Mining at the Intersection of Many Disciplines Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 16

17.A Taxonomy for Data Mining Tasks Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 17

18.Traditional Analytics 18 Operational Data Sources EDW Data Mart Data Mart Analytic Mart Analytic Mart BI and Analytics Unstructured, Semi-structured and Streaming data (i.e. sensor data) handled often outside the Warehouse flow Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics

19.Hadoop as a “new data” Store 19 Operational Data Sources EDW Data Mart Data Mart Analytic Mart Analytic Mart BI and Analytics Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics

20.Hadoop as an additional input to the EDW 20 Operational Data Sources EDW Data Mart Data Mart Analytic Mart Analytic Mart Analytic Mart Data Mart BI and Analytics Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics

21.Hadoop Data Platform As a “staging Layer” as part of a “data Lake” – Downstream stores could be Hadoop, data appliances or an RDBMS 21 Data Mart Operational Data Sources EDW Data Mart Analytic Mart Analytic Mart BI and Analytics Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics

22.SAS Big data Strategy – SAS areas 22 Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics

23.SAS Big data Strategy – SAS areas 23 Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics

24.SAS® Within the HADOOP ECOSYSTEM 24 Impala Next-Gen SAS ® User User Interface Metadata Data Access Data Processing File System SAS ® User MPI Based SAS ® LASR™ Analytic Server SAS ® High- Performance Analytic Procedures HDFS Base SAS & SAS/ACCESS ® to Hadoop™ SAS Metadata Pig Map Reduce In-Memory Data Access SAS ® Visual Analytics SAS ® Enterprise Miner™ SAS ® Data Integration SAS ® Enterprise Guide ® Hive SAS Embedded Process Accelerators SAS ® In-Memory Statistics for Haodop Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics EG EM VA

25.SAS enables the entire lifecycle around HADOOP 25 IDENTIFY / FORMULATE PROBLEM DATA PREPARATION DATA EXPLORATION TRANSFORM & SELECT BUILD MODEL VALIDATE MODEL DEPLOY MODEL EVALUATE / MONITOR RESULTS SAS enableS the entire lifecycle around HADOOP SAS Visual Analytics SAS Visual Statistics SAS In-Memory Statistics for Hadoop Done using either the Data Preparation, Data Exploration or Build Model Tools SAS High Performance Analytics Offerings supported by relevant clients like SAS Enterprise Miner, SAS/STAT etc. Decision Manager SAS Scoring Accelerator for Hadoop SAS Code Accelerator for Hadoop SAS Visual Analytics Decision Manager Done using either the Data Preparation, Data Exploration or Build Model Tools Source: Deepak Ramanathan (2014), SAS Modernization architectures - Big Data Analytics

26.Data Mining Process 26

27.Data Mining Process A manifestation of best practices A systematic way to conduct DM projects Different groups has different versions Most common standard processes: CRISP-DM (Cross-Industry Standard Process for Data Mining) SEMMA (Sample, Explore, Modify, Model, and Assess) KDD (Knowledge Discovery in Databases) Source: Turban et al. (2011), Decision Support and Business Intelligence Systems 27

28.Data Mining Process (SOP of DM) What main methodology are you using for your analytics , data mining , or data science projects ? 28 Source: http://www.kdnuggets.com/polls/2014/analytics-data-mining-data-science-methodology.html

29.Data Mining Process 29 Source: http://www.kdnuggets.com/polls/2014/analytics-data-mining-data-science-methodology.html