Apache NiFi - Meetup

Introduction to building DataFlows with Apache NiFi. Andrew Psaltis. HDF / IoT / Cybersecurity Architect. apsaltis@hortonworks.com. @itmdata.

1.Introduction to building DataFlows with Apache NiFi Andrew Psaltis HDF / IoT / Cybersecurity Architect apsaltis@hortonworks.com @itmdata https:// www.linkedin.com /in/ andrewpsaltis

2.Hortonworks Company Profile ONLY 100 open source Apache Hadoop data platform % Founded in 2011 HADOOP 1 ST provider to go public subscription customers 800+ employees across ~1000 countries technology partners 1,600+ TM 16 IPO 4Q14 (NASDAQ: HDP)

3.1600+ Partners 3000+ members 15,000+ Weekly visitors Participating with a Growing and Thriving Ecosystem

4.Key Highlights Doubled the customer base in 2015 800+ Customer Growth in 2015

5.Financial Services Customers 5 5 of the top 1 0 0 Retail Customers 7 5 of the top 1 0 0 Telco Customers 8 of the top 9 in North America Automotive Customers 8 of the world’s top 2 0 Watch and Read about our customers at: www.hortonworks.com/customers Hortonworks Momentum in Every Industry

6.INTERNET OF ANYTHING AGE OF DATA Open source is the norm, and Apache is the center of gravity Founded: 2011 IPO: 2014

7.Simplistic View of Enterprise Data Flow Data Flow Process and Analyze Data Acquire Data Store Data

8.Different organizations/business units across different geographic locations… Realistic View of Enterprise Data Flow

9.Interacting with different business partners and customers Realistic View of Enterprise Data Flow

10.Apache NiFi Created to address the challenges of global enterprise dataflow Key features: Visual Command and Control Data Lineage (Provenance) Data Prioritization Data Buffering/Back-Pressure Control Latency vs. Throughput Secure Control Plane / Data Plane Scale Out Clustering Extensibility

11.Apache NiFi What is Apache NiFi used for? Reliable and secure transfer of data between systems Delivery of data from sources to analytic platforms Enrichment and preparation of data: Conversion between formats Extraction/Parsing Routing decisions What is Apache NiFi NOT used for? Distributed Computation Complex Event Processing Joins / Complex Rolling Window Operations

12.Apache NiFi Deep Dive

13.Terminology FlowFile U nit of data moving through the system C ontent + A ttributes (key/value pairs ) Processor P erforms the work, can access FlowFiles Connection Links between processors Queues that can be dynamically prioritized Process Group Set of processors and their connections Receive data via input ports, send data via output ports

14.Visual Command & Control Drag and drop processors to build a flow Start, stop, and configure components in real time View errors and corresponding error messages View statistics and health of data flow Create templates of common processor & connections

15.Provenance/Lineage Tracks data at each point as it flows through the system Records, indexes, and makes events available for display Handles fan-in/fan-out, i.e. merging and splitting data View attributes and content at given points in time

16.Prioritization Configure a prioritizer per connection Determine what is important for your data – time based, arrival order, importance of a data set Funnel many connections down to a single connection to prioritize across data sets Develop your own prioritizer if needed

17.Back-Pressure Configure back-pressure per connection Based on number of FlowFiles or total size of FlowFiles Upstream processor no longer scheduled to run until below threshold

18.Latency vs. Throughput Choose between lower latency, or higher throughput on each processor Higher throughput allows framework to batch together all operations for the selected amount of time for improved performance Processor developer determines whether to support this by using @ SupportsBatching annotation

19.Security Control Plane Pluggable authentication 2-Way SSL, LDAP, Kerberos Pluggable authorization File-based authority provider out of the box Multiple roles to defines access controls Audit trail of all user actions Data Plane Optional 2-Way SSL between cluster nodes Optional 2-Way SSL on Site-To-Site connections ( NiFi -to- NiFi ) Encryption/Decryption of data through processors Provenance for audit trail of data

20.Extensibility Built from the ground up with extensions in mind Service-loader pattern for… Processors Controller Services Reporting Tasks Prioritizers Extensions packaged as NiFi Archives (NARs) Deploy NiFi lib directory and restart Provides ClassLoader isolation Same model as standard components

21.Rapid Ecosystem Adoption: 130+ Processors HTTP Syslog Email HTML Image Hash Encrypt Extract Tail Merge Evaluate Duplicate Execute Scan GeoEnrich Replace Convert Split Translate HL7 FTP UDP XML SFTP Route Content Route Context Route Text Control Rate Distribute Load NEW AMQP

22.Architecture OS/Host JVM Flow Controller Web Server Processor 1 Extension N FlowFile Repository Content Repository Provenance Repository Local Storage OS/Host JVM Flow Controller Web Server Processor 1 Extension N FlowFile Repository Content Repository Provenance Repository Local Storage OS/Host JVM Flow Controller Web Server Processor 1 Extension N FlowFile Repository Content Repository Provenance Repository Local Storage OS/Host JVM NiFi Cluster Manager – Request Replicator Web Server Master NiFi Cluster Manager (NCM) OS/Host JVM Flow Controller Web Server Processor 1 Extension N FlowFile Repository Content Repository Provenance Repository Local Storage Slaves NiFi Nodes

23.Apache NiFi Site-To-Site

24.Site-To-Site Direct communication between two NiFi instances Push to Input Port on receiver, or Pull from Output Port on source Communicate between clusters, standalone instances, or both Handles load balancing and reliable delivery S ecure connections using certificates (optional)

25.Site-To-Site Push S ource connects Remote Process Group to Input Port on destination Site -To-Site takes care of load balancing across the nodes in the cluster NCM Node 1 Input Port Node 2 Input Port Standalone NiFi RPG

26.Site-To-Site Pull D estination connects Remote Process Group to Output Port on the source If source was a cluster, each node would pull from each node in cluster NCM Node 1 RPG Node 2 RPG Standalone NiFi Output Port

27.Site-To-Site Client Code for Site-To-Site broken out into reusable module https://github.com/apache/nifi/tree/master/nifi-commons/nifi-site-to-site-client Foundation for integration with stream processing platforms Java Program Site-To-Site Client Node 1 Output Port NCM Node 2 Output Port

28.Current Stream Processing Integrations Spark Streaming - NiFi Spark Receiver https://github.com/apache/nifi/tree/master/nifi-external/nifi-spark-receiver Storm – NiFi Spout & Bolt https://github.com/apache/nifi/tree/master/nifi-external/nifi-storm-spout Flink – NiFi Source & Sink https://github.com/apache/flink/tree/master/flink-streaming-connectors/flink-connector-nifi Apex - NiFi Input Operators & Output Operators https://github.com/apache/incubator-apex-malhar/tree/master/contrib/src/main/java/com/datatorrent/contrib/nifi

29.Bi-Directional Data Flows