Easily Build a Smart Pulsar Stream Processor——Simon Crosby

展开查看详情

1.Continuous Stream Processing with Pulsar and Swim Simon Crosby, CTO Swim swim.a i

2.SwimOS is an Apache 2.0 licensed platform that makes it easy to build applications that deliver continuous intelligence from streaming data, at scale swimos.org

3.PM25 Pollution NOX Pollution

4. Streaming Platforms • Apache Pulsar • Apache Kafka • Apache Beam • Support pub/sub at scale • CNCF NATS • Buffer data between pubs & subs • Amazon Kinesis • Event-time ordered delivery • Google pub/sub • Azure Enterprise Data Bus • Events stored in arrival order • Salesforce Kafka • Don’t run applications • Confluent Cloud • … Ø SwimOS is a stream processor that delivers continuous intelligence from streaming data

5. Stream Processors • Stream processors subscribe to a broker to analyze streaming event data • Their insights can be asynchronously consumed by publishing back to the broker • The broker offers a low-latency API that gives the stream processor events in real-time • Pulsar does not control execution of the stream processor

6.SwimOS is a Stateful, Real-time Stream Processor Application: Distributed, stateful, concurrent graph of Web Agents & real-time UIs Infra: Distributed, p2p mesh of instances on k8s using WebSockets • Builds and auto-scales apps from real-world event data, creating a stateful graph that continuously computes – driven by data • Automates infrastructure operation • Load balances, secures, persists and auto-scales the application • Apps are easy to develop • Delivers unimaginable performance

7.Major Mobile Provider • > 150M devices • > 10Gb/s of streaming data from Pulsar • Continuous analysis, aggregation & reduction • Millisecond latency • Pervasively real-time UI • Distributed across AZ 6

8. Pulsar’s Many Pros • Event Processing – Filtering – Transformation – Counts / Windows – Alerts • Serverless is a great abstraction • SQL-style API • Storage tiering • Delivery guarantees • lll Multi-tenancy l • Replication • Scaling Database

9. Challenges… l l • How ! many topics do you need? üü ü ü l l

10. Challenges… l l ! l engine_temp: 290 fan_temp: 188 l coolant_vol: 25

11. Challenges… Streaming analytics (#solved !) ☞ polling ® not real-time Client Application Client Client Client Client Continuous Intelligence •demands engine_temp: 290 Databases fan_temp:don’t 188drive computation! • Data driven computation (though in-memory is faster) • Analysis in context, everywhere, coolant_vol: concurrently 25 • What DB architecture do you need? • Stateful, in-memory, distributed • Scaling / clustering / consistency … • Pervasively real-time computation

12.> Palo Alto, CA

13. 60 ~600 TB/day TB/day There’s more data than your cloud could store ▶ (mostly ephemeral)

14.Intelligence is driven by (a flow of) state changes - not raw data

15. Users Want Stateful, Continuous, Contextual Analysis λ λ xn-1 Streams are a sequence of state changes They never stop… (so “store-then-analyze” is silly) Applications always have to have an answer “Meaning” depends on granular contextual relationships

16. Introducing Swim Web Agents • SwimOS subscribes to event streams from real-world sources • It creates a stateful, concurrent web agent for each data source • Each web agent cleans, labels, analyzes data from its real-world twin • Agents dynamically link to related agents, creating a stateful in-memory graph • Containment, proximity… logical relationships eg: pod/cluster … • Computed relationships: correlated… • Linked web agents share their states in real-time • Web Agents are vertices in the graph • Each continually computes on its own state & state of its links, as data flows over the graph – and streams its results in real-time over its links • This is data-driven, stateful, continuous computation

17.Web Agents Continuously Compute - Driven by Data Analyze data to determine state Real-world Stateful Web Agent Analytics Relational Analysis Relational Graph MapReduce Learning & Prediction

18.I’m green • Web agents are stateful

19.I’m red • Raw data can typically be discarded

20.I’m still red • Noisy / redundant updates are discarded

21.… … I’m green …… … … I’m still red … ……… …… …… …… … … … … No push … … … …No push … … … … Streaming data auto-scales the application – … … composed of concurrent web agents - at low cost, in … real time, as data arrives

22.A Swim Application is an Active Graph

23.A graph of linked active Web Agents 1000m

24. Web Agents Link to Form a Computational Graph ① SwimOS creates a web agent for each source in streaming data Web agents continuously compute on their own state and the state of linked web agents enabling granular contextual analysis on-the-fly ③ Powerful operators for analysis, learning & prediction continuously compute on state ② Agents interlink to reflect & stream results real-world relationships

25.Eg:Un-supervised Training & Prediction Observed Predicted D Training P ro B a c pa k ga tio n

26.• A scaled application is a graph dynamically built from data • Objects are stateful and concurrent

27. SwimOS Eliminates “the Stack” Developer defines entities & their relationships – as Java objects Swim builds a stateful, distributed, graph of concurrent web agents that statefully represent real-world sources, from streaming data * Web agents collaborate to analyze, learn, predict and respond on the fly = They continuously stream real- time insights to UIs & applications

28. Pulsar and Swim: Better Together Application: Distributed, stateful, concurrent graph of Web Agents & real-time UIs Infra: Distributed, p2p mesh of instances on k8s using WebSockets • Builds and auto-scales apps from real-world event data, creating a stateful graph that continuously computes – driven by data • Automates infrastructure operation • Load balances, secures, persists and auto-scales the application • Apps are easy to develop • Delivers unimaginable performance

29.Questions ? swim.a i

StreamNative 是一家围绕 Apache Pulsar 和 Apache BookKeeper 打造下一代流数据平台的开源基础软件公司。秉承 Event Streaming 是大数据的未来基石、开源是基础软件的未来这两个理念,专注于开源生态和社区的构建,致力于前沿技术。