NiFi Cluster - Amazon Simple Storage Service (S3)

1. Integração de Dados com. Apache Nifi. Marco Garcia. CTO, Founder – Cetax, TutorPro. 2.

1.Integração de Dados com Apache Nifi Marco Garcia CTO, Founder – Cetax, TutorPro

2.Com mais de 20 anos de experiência em TI, sendo 18 exclusivamente com Business Intelligence , Data Warehouse e Big Data, Marco Garcia é certificado pelo Kimball University , nos EUA, onde obteve aula pessoalmente com Ralph Kimball – um dos principais gurus do Data Warehouse. 1º Instrutor Certificado Hortonworks LATAM Arquiteto de Dados e Instrutor na Cetax Consultoria. 02 Apresentação

3.Fluxos de Dados

4.Remote sensor delivery (Internet of Things - IoT ) Intra-site / Inter-site / global distribution (Enterprise) Ingest for feeding analytics (Big Data) Data Processing (Simple Event Processing) Where do we find Data Flow?

5.Simplistic View of Enterprise Data Flow The Data Flow Thing Process and Analyze Data Acquire Data Store Data

6.Basics of Connecting Systems For every connection, these must agree: Protocol Format Schema Priority Size of event Frequency of event Authorization access Relevance P 1 Producer C 1 Consumer

7.IoT is Driving New Requirements

8.IoAT Data Grows Faster Than We Consume It Much of the new data exists in-flight between systems and devices as part of the Internet of Anything NEW TRADITIONAL Ability to consume data The Opportunity Unlock transformational business value from a full fidelity of data and analytics f or all data. Geolocation Server logs Files & emails ERP, CRM, SCM Traditional Data Sources Internet of Anything Sensors and machines Clickstream Web & social

9.Internet of Anything is Driving New Requirements Need trusted insights from data at the very edge to the data lake in real-time with full-fidelity Data generated by sensors, machines, geo-location devices, logs, clickstreams, social feeds, etc. Modern applications need access to both data-in-motion and data-at- rest IoAT data flows are multi-directional and point-to-point Very different than existing ETL, data movement, and streaming technologies which are generally one direction The perimeter is outside the data center and can be very jagged This “Jagged Edge” creates new opportunity for security, data protection, data governance and provenance

10.Meeting IoAT Edge Requirements GATHER DELIVER PRIORITIZE Track from the edge Through to the datacenter Small Footprints operate with very little power Limited Bandwidth can create high latency Data Availability exceeds transmission bandwidth recoverability Data Must Be Secured throughout its journey both the data plane and control plane

11.The Need for Data Provenance For Operators Traceability, lineage Recovery and replay For Compliance Audit trail Remediation For Business Value sources Value IT investment BEGIN END LINEAGE

12.The Need for Fine-grained Security and Compliance It’s not enough to say you have encrypted communications Enterprise authorization services –entitlements change often People and systems with different roles require difference access levels Tagged/classified data

13.Real-time Data Flow It’s not just how quickly you move data – it’s about how quickly you can change behavior and seize new opportunities

14.HDF Powered by Apache NiFi Addresses Modern Data Flow Challenges Aggregate all IoAT data from sensors, geo-location devices, machines, logs, files, and feeds via a highly secure lightweight agent Collect: Bring Together Logs Files Feeds Sensors Mediate point-to-point and bi-directional data flows, delivering data reliably to real-time applications and storage platforms such as HDP Conduct: Mediate the Data Flow Deliver Secure Govern Audit Parse, filter, join, transform, fork, and clone data in motion to empower analytics and perishable insights Curate: Gain Insights Parse Filter Transform Fork Clone

15.Apache Nifi Manages Data-in-Motion Core Infrastructure Sources Constrained High-latency Localized context Hybrid – cloud / on-premises Low-latency Global context Regional Infrastructure Apache NiFi, Apache MiNiFi, Apache Kafka, Apache Storm are trademarks of the Apache Software Foundation

16.Apache Nifi Manages Data-in-Motion Core Infrastructure Sources Constrained High-latency Localized context Hybrid – cloud / on-premises Low-latency Global context Regional Infrastructure Apache NiFi, Apache MiNiFi, Apache Kafka, Apache Storm are trademarks of the Apache Software Foundation

17.November 2014 NiFi is donated to the Apache Software Foundation (ASF) through NSA’s Technology Transfer Program and enters ASF’s incubator. 2006 NiagaraFiles (NiFi) was first incepted at the National Security Agency (NSA) A Brief History July 2015 NiFi reaches ASF top-level project status

18.Designed In Response to Real World Demands Visual User Interface D rag and drop for efficient, agile operations Immediate Feedback S tart, stop, tune, replay dataflows in real-time Adaptive to Volume and Bandwidth A ny data, big or small Provenance Metadata G overnance, compliance & data evaluation Secure Data Acquisition & Transport F ine grained encryption for controlled data sharing HDF Powered by Apache NiFi

19.Apache NiFi Powerful and reliable s ystem to process and distribute data . Directed graphs of data routing and transformation . Web-based User Interface for creating, monitoring, & controlling data flows Highly configurable - modify data flow at runtime, dynamically prioritize data Data Provenance tracks data through entire system Easily extensible through development of custom components [1]

20.Nifi Use Cases Ingest Logs for Cyber Security: Integrated and secure log collection for real-time data analytics and threat detection Feed Data to Streaming Analytics: Accelerate big data ROI by streaming data into analytics systems such as Apache Storm or Apache Spark Streaming Data Warehouse Offload: Convert source data to streaming data and use HDF for data movement before delivering it for ETL processing. Enable ETL processing to be offloaded to Hadoop without having to change source systems . Move Data Internally: Optimize resource utilization by moving data between data centers or between on-premises infrastructure and cloud infrastructure Capture IoT Data: Transport disparate and often remote IoT data in real time, despite any limitations in device footprint, power or connectivity—avoiding data loss Big Data Ingest Easily and efficiently ingest data into Hadoop

21.Arquitetura do NIFI

22.Apache NiFi: The three key concepts Manage the flow of information Data Provenance Secure the control plane and data plane

23.Apache NiFi – Key Features Guaranteed delivery Data buffering Backpressure Pressure release Prioritized queuing Flow specific QoS Latency vs. throughput Loss tolerance Data provenance Recovery/recording a rolling log of fine-grained history Visual command and control Flow templates Multi- tenant Authorization Designed for extension Clustering

24.Flow Based Programming (FBP) FBP Term NiFi Term Description Information Packet FlowFile Each object moving through the system. Black Box FlowFile Processor Performs the work, doing some combination of data routing, transformation, or mediation between systems. Bounded Buffer Connection The linkage between processors, acting as queues and allowing various processes to interact at differing rates. Scheduler Flow Controller Maintains the knowledge of how processes are connected, and manages the threads and allocations thereof which all processes use. Subnet Process Group A set of processes and their connections, which can receive and send data via ports. A process group allows creation of entirely new component simply by composition of its components.

25.NiFi Architecture

26.NiFi Architecture

27.Primary Components NiFi executes within a JVM living within a host operating system. The primary components of NiFi then living within the JVM are as follows : Web Server The purpose of the web server is to host NiFi’s HTTP-based command and control API . Flow Controller The flow controller is the brains of the operation. It provides threads for extensions to run on and manages their schedule of when they’ll receive resources to execute . Extensions There are various types of extensions for NiFi which will be described in other documents. But the key point here is that extensions operate/execute within the JVM.

28.Primary Components(Cont..) FlowFile Repository The FlowFile Repository is where NiFi keeps track of the state of what it knows about a given FlowFile that is presently active in the flow. The default approach is a persistent Write-Ahead Log that lives on a specified disk partition . Content Repository The Content Repository is where the actual content bytes of a given FlowFile live. The default approach stores blocks of data in the file system. More than one file system storage location can be specified so as to get different physical partitions engaged to reduce contention on any single volume . Provenance Repository The Provenance Repository is where all provenance event data is stored. The repository construct is pluggable with the default implementation being to use one or more physical disk volumes. Within each location event data is indexed and searchable.

29.NiFi Cluster Starting with the NiFi 1.x/HDF-2.x release, a Zero-Master Clustering paradigm is employed. NiFi Cluster Coordinator : A Cluster Coordinator is the node in a NiFI cluster that is responsible managing the nodes in a cluster. Determines which nodes are allowed in the cluster. P roviding the most up-to-date flow to newly joining nodes. Nodes : Each cluster is made up of one or more nodes. The nodes do the actual data processing. Primary Node : Every cluster has one Primary Node. On this node, it is possible to run "Isolated Processors" (see below). ZooKeeper Server : It is used to automatically elect a Primary Node and cluster co-ordinator . We will learn in detail about NiFi Cluster in following Lessons..