The foundation for Hadoop MapReduce - cse.sc.edu

Perform Analytics on Unstructured data using MapReduce Programming paradigm; Use Hadoop, HDFS, HIVE,PIG and other products in the Hadoop ecosystem ...
展开查看详情

1.Module 5 – Advanced Analytics - Technology and Tools 1 Module 5: Advanced Analytics - Technology and Tools

2.Module 5: Advanced Analytics - Technology and Tools Upon completion of this module, you should be able to: Perform Analytics on Unstructured data using MapReduce Programming paradigm Use Hadoop , HDFS, HIVE,PIG and other products in the Hadoop ecosystem for unstructured data analytics Effectively use advanced SQL functions and Greenplum extensions for in-database analytics Use MADlib to solve analytics problems in-database 2 Module 5: Advanced Analytics - Technology and Tools

3.Module 5: Advanced Analytics - Technology and Tools During this lesson the following topics are covered: MapReduce & Hadoop HDFS – the Hadoop Distributed File System Lesson 1: Analytics for Unstructured Data - MapReduce and Hadoop 3 Module 5: Advanced Analytics - Technology and Tools

4.Putting the Data Analytics Lifecycle into Practice From Module 4 – Lifecycle Phase 1: Discovery Phase 2: Data Preparation Phase 3: Model Planning Phase 4: Model Building Phase 5: Results & Key Findings Phase 6: Operationalize You have “ big data, ” how can you make it suitable for analysis? That is, obtain results & key findings in a timely manner? Combine Phases 1 and 2 4 Module 5: Advanced Analytics - Technology and Tools

5.Why Hadoop? Answer: Big Datasets! 5 Module 5: Advanced Analytics - Technology and Tools

6.Why MapReduce? “In Pioneer days, they used oxen for heavy pulling. When one ox couldn’t budge a log, they didn’t try to grow a larger ox… We shouldn’t be trying to grow bigger computers, but to add more systems of computers.” Grace Hopper The MapReduce paradigm helps you add more oxen By definition, big data is too large to handle by conventional means. Sooner or later, you just can’t scale up anymore 6 Module 5: Advanced Analytics - Technology and Tools

7.What is MapReduce? A parallel programming model suitable for big data processing Split data into distributable chunks (“shards”) Define the steps to process those chunks Run that process in parallel on the chunks Scalable by adding more machines to process chunks Leverage commodity hardware to tackle big jobs The foundation for Hadoop MapReduce is a parallel programming model Hadoop is a concrete platform that implements MapReduce 7 Module 5: Advanced Analytics - Technology and Tools

8.What is MapReduce? A parallel programming model suitable for big data processing Split data into distributable chunks (“shards”) Define the steps to process those chunks Run that process in parallel on the chunks Scalable by adding more machines to process chunks Leverage commodity hardware to tackle big jobs The foundation for Hadoop MapReduce is a parallel programming model Hadoop is a concrete platform that implements MapReduce 7 Module 5: Advanced Analytics - Technology and Tools

9.The Map part of MapReduce Transform (Map) input values to output values: <k1,v1>  <k2,v2> Input – Key/Value Pairs For instance, Key = line number, Value = text string Map Function Steps to transform input pairs to output pairs For example, count the different words in the input Output – Key/Value Pairs For example, Key = <word>, Value = <count> Map output is the input to Reduce 9 Module 5: Advanced Analytics - Technology and Tools

10.The Reduce Part of MapReduce Merge (Reduce) Values from the Map phase Reduce is optional. Sometimes all the work is done in the Mapper Input Values for a given Key from all the Mappers Reduce Function Steps to combine (Sum?, Count?, Print?,…) the values Output Print values?, load into a DB? send to the next MapReduce job? 10 Module 5: Advanced Analytics - Technology and Tools

11.This is the “Hello World” of MapReduce Distribute the text of millions of documents over hundreds of machines. (Beach, 1) (Beach, 1) (Beach, 1) (Beach, 1) (Beach, 1) (Beach, 2) (Beach, 1) (Beach, 2) (Beach, 5) transition finalize Reduce Map MAPPERS can be word-specific. They run through the stacks and shout “One!” every time they see the word “beach” REDUCERS listen to all the M appers and total the counts for each word. Motivating Example: Word Count 11 Module 5: Advanced Analytics - Technology and Tools

12.This is the “Hello World” of MapReduce Distribute the text of millions of documents over hundreds of machines. (Beach, 1) (Beach, 1) (Beach, 1) (Beach, 1) (Beach, 1) (Beach, 2) (Beach, 1) (Beach, 2) (Beach, 5) transition finalize Reduce Map MAPPERS can be word-specific. They run through the stacks and shout “One!” every time they see the word “beach” REDUCERS listen to all the M appers and total the counts for each word. Motivating Example: Word Count 11 Module 5: Advanced Analytics - Technology and Tools

13.Mapper1 Maps two regular expression searches: To: Michael, Dan, Lori, Susan From: Walt Emits the outbound directed edge of the social graph: <Key, Value> = <Walt, [Michael, Dan, Lori, Susan]> Reducer1 Gets the output from the mapper with different values <Key, Value> = <Walt, [Michael, Dan, Lori, Susan]> <Key, Value> = <Walt, [Lori, Susan, Jeff, Ken]> Unions the values for the second directed edge: <Key, Value> = <Walt, [Dan, Jeff, Ken, Lori, Michael, Susan]> From: To: Data is reduced by about a third. Social Triangle: First Directed Edge 13 Module 5: Advanced Analytics - Technology and Tools

14.Mapper2 Reverses the previous Map: To: Michael, Dan, Lori, Susan From: Walt Emits the inbound directed edge of the social graph: <Key, Value> = <Susan, Walt>; <Lori, Walt>; <Dan, Walt>; etc Reducer2 Gets the output from the mapper with different values <Key, Value> = <Susan, Walt> <Key, Value> = <Susan, Jeff> Unions the values for the third directed edge: <Key, Value> = <Susan, [Jeff, Ken, Walt]> Data again reduced by about a third. From: To: Social Triangle: Second Directed Edge 14 Module 5: Advanced Analytics - Technology and Tools

15.Mapper3 Join [inbound] and [outbound] lists by Key Walt, [Jeff, Ken, Lori, Susan], [Jeff, Lori, Stanley, Walt] Emits <Person, Person> pair with level of association: <Key, Value> = <Walt:Susan, reciprocal>; <Walt:Lori, directed>, etc Reducer3 Reducer unions the output of the mappers and presents rules: <Key, Value> = <Walt:Susan, reciprocal> <Key, Value> <Walt:Lori, directed> The third reducer can shape the data any way that serves the business objective. From: To: Social Triangle: Third Directed Edge 15 Module 5: Advanced Analytics - Technology and Tools

16.Natural Language Processing Unstructured text mining means extracting “features” from a document Features are structured meta-data representing the document Goal: “vectorize” the documents But getting to the underlying data can be very difficult 16 Module 5: Advanced Analytics - Technology and Tools

17.When I fist noticed it, I wanted to freak out. There it was an object floating in on a direct path, It didn't move side to side or volley up and down. It moved as if though it had a mission or purpose. I was nervous, and scared, So afraid in fact that I could feel my knees buckling. I guess because I didn't know what to expect and I wanted to act non aggressive. I though that I was either going to be taken, blasted into nothing, or… Source: http://www.infochimps.com/datasets/60000-documented-ufo-sightings-with-text-descriptions-and-metada Q: What is the witness describing? July 15 th , 2010. Raytown, Missouri A : An encounter with a UFO. Q: What is the emotional state of the witness? A : Frightened, ready to flee. Example: UFOs Attack 17 Module 5: Advanced Analytics - Technology and Tools

18.Source: http://www.infochimps.com/datasets/60000-documented-ufo-sightings-with-text-descriptions-and-metada If we really are on the cusp of a major alien invasion, eyewitness testimony is the key to our survival as a species. When I fist noticed it, I wanted to freak out. There it was an object floating in on a direct path, It didn't move side to side or volley up and down. It moved as if though it had a mission or purpose. I was nervous, and scared, So afraid in fact that I could feel my knees buckling. I guess because I didn't know what to expect and I wanted to act non aggressive. I though that I was either going to be taken , blasted into nothing, or… Typo Strangely, the computer finds this account unreliable! Machine error Turn of phrase Ambiguous meaning “UFO” keyword missing Example: UFOs Attack 18 Module 5: Advanced Analytics - Technology and Tools

19.Investigators need to… Search for keywords and phrases, but your topic may be very complicated or keywords may be misspelled within the document Manage document meta-data like time, location and author. Later retrieval may be key to identifying this meta-data early, and the document may be amenable to structure. Understand content via sentiment analysis, custom dictionaries, natural language processing, clustering, classification and good ol’ domain expertise. …with computer-aided text mining Example: UFOs Attack 19 Module 5: Advanced Analytics - Technology and Tools

20. After vectorization, advanced techniques can be applied Clustering Classification Decision Trees Scoring Once models have been built, use them to automatically categorize incoming documents Machine Learning 20 Module 5: Advanced Analytics - Technology and Tools

21.What is ……………….. _______ ____________ ? People use “Hadoop” to mean one of four things: MapReduce paradigm. HDFS: The Hadoop distributed file system. Java Classes for HDFS types and MapReduce job management. Massive unstructured data storage on commodity hardware. > > > > With Hadoop, you can do MapReduce jobs quickly and efficiently. (ideas) (actual Hadoop) 21 Module 5: Advanced Analytics - Technology and Tools

22.A framework for performing big data analytics An implementation of the MapReduce paradigm Hadoop glues the storage and analytics together and provides reliability, scalability, and management What do we Mean by Hadoop Storage (Big Data) HDFS – Hadoop Distributed File System Reliable, redundant, distributed file system optimized for large files MapReduce (Analytics) Programming model for processing sets of data Mapping inputs to outputs and reducing the output of multiple Mappers to one (or a few) answer(s) Two Main Components 22 Module 5: Advanced Analytics - Technology and Tools

23.Hadoop and HDFS 23 Module 5: Advanced Analytics - Technology and Tools

24.Hadoop Operational Modes Java MapReduce Mode Write Mapper, Combiner, Reducer functions in Java using Hadoop Java APIs Read records one at a time Streaming Mode Uses *nix pipes and standard input and output streams Any language (Python, Ruby, C, Perl, Tcl/Tk, etc.) Input can be a line at a time, or a stream at a time 24 Module 5: Advanced Analytics - Technology and Tools

25.Hadoop Classes in Java public static class MapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable (1); private Text word = new Text(); public void map (LongWritable key, Text value, OutputCollector < Text , IntWritable > output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } 4 Arguments (LongWritable, Text, Text, IntWritable) Defined as Java Class with Hadoop types Emits via output.collect(,) function Standard Java coding paradigm Hadoop defines a set of classes that extend the scalar classes in Java (examples: IntWritable, Text) Hadoop offers a number of base classes to provide a framework for jobs This Mapper incorporates the MapReduceBase, Reporter and OutputCollector classes explicitly Module 5: Advanced Analytics - Technology and Tools l 25

26.Example: Hadoop Streaming Mode Script to invoke Hadoop 1 Hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop-*-streaming.jar \ 2 -input input/ \ # relative to HDFS 3 -output output \ # relative to HDFS 4 -mapper mapper.py \ 5 -reducer reducer.py \ 6 -file scripts/mapper.py \ 7 -file scripts/reducer.py 26 Module 5: Advanced Analytics - Technology and Tools

27.Simple Example of a Mapper: mapper.py Code Comments import sys for line in sys.stdin # input comes from STDIN line = line.strip() # strip leading/trailing whitespace words = line.split () # split line based on whitespace for word in words: print ‘%s %s %(word, 1) # for each word in the collection words # write word and count of “1”, tab delimited 27 Module 5: Advanced Analytics - Technology and Tools

28.Code Code, continued cur_count = 0 if cur_word == word: cur_word = None cur_count += count for line in sys.stdin else: line = line.strip() if cur_word: word, count = line.split(" ",1) print %st%s (cur_word, cur_count) try cur_count, cur_word = (count, word) count = int(count) if cur_word == word: except ValueError: print %st%s %(cur_word, cur_count) continue Simple Example of a Reducer: reducer.py 28 Module 5: Advanced Analytics - Technology and Tools

29.Putting it all Together: MapReduce and HDFS Task Tracker Task Tracker Task Tracker Job Tracker Hadoop Distributed File System (HDFS) Client/Dev Large Data Set (Log files, Sensor Data) Map Job Reduce Job Map Job Reduce Job Map Job Reduce Job Map Job Reduce Job Map Job Reduce Job Map Job Reduce Job 2 1 3 4 29 Module 5: Advanced Analytics - Technology and Tools