Real-time monitoring of Mobile Internet Quality of Experience us

在电信领域,用户体验是核心竞争力,我们来分享一下基于Apache Flink,Hadoop, YARN, Kafka, Nifi, Druid还有一些高级可视化组件,完成状态数据流的基本操作,比如JOIN,Windows,聚合,CEP(复杂实践处理)来产生告警信息等等,我们基于此构建出下一代平台给SOC(服务操作中)团队使用,确保我们的客户有最优的用户体验。

1.Real-time monitoring of Mobile Internet QoE using Flink everis, an NTT DATA Company San Francisco April 2018

2.Speakers • Senior Telecom Analytics Consultant at everis. • Analytics Manager at everis with 10+ years of experience in the Telecom sector. • 5+ years of experience in Telecom sector. • Analytics for: Commercial, Sales, Operational, • Working in batch and streaming analytics Technical, Regulatory compliance. innovative use cases. • Interests: tackling BDA challenges from the • Interests: next-generation stream processing definition to the implementation with disruptive engines, bussiness-applied machine learning use technologies. Data should be played. cases…

3.everis, an NTT Data Company 17.000 15 professionals countries

4.The Challenge: Mobile Internet Service Quality Are customers satisfied? Is network performance good enough? How do customers perceive the service? CeX is the next big Battle Ground!

5.The Challenge has evolved Voice connectivity Voice quality Internet connectivity Mobile voice calls Smartphones mobile Applications usage data connectivity

6.Classic tools and processes are not enough Customer complaints analysis, Net Promoter Score, Churn over technical issues Network performance & trouble Business support systems management New problems require new tools!

7.Network performance # Customers affected by failures Throughput benchmark Network Performance Index

8. Getting ready to face the challenge Research Define • How does the customer perceive Mobile Internet Quality? Best Mean Opinion Score metric possible for: • How do main players measure customer perception? • Video Streaming • What tools can be used to calculate experience metrics? • Web Browsing • Which data is available to calculate CeX metrics? • Smartphone Apps All logos are registered trademarks or trademarks of their respective owners and are only used for illustrative purposes

9.Let’s hit the target, with the right tools Mean Opinion Score for Video Streaming, Web Browsing and Apps Usage Service operations center for customer experience monitoring Root cause analysis and CapEx / OpEx prioritization Hit straight into service perception

10.Who are the users of this new tool? Service Operations Center Identify massive MOS problems in geographical zones or per individual customer Analyze root causes over the network that affect MOS Real-time End to End monitoring of service perception

11.Benefits • Proactively avoid customer contacts • Identify customer satisfaction and enable troubleshooting • Characterize high value customers and prioritize monitoring efforts • As a result: • Enhance customer recommendation • Reduce Churn • Save costs and expenses • Invest better

12.Why Flink? Per event processing Event time processing Session Windowing Multiple Window Firings Complex event processing

13.Architecture Network Performance KPI Definitions & Other Datasources for DPI Data • Leverages existing HDP® stack Data Alarm Rules enrichment Ingestion Alarms Enriched KPIs Ingested Sources • Loosely coupled Processing Complex Event KPI Calculations & Processing Enrichment • Fault tolerant (HA) Raw data long- • Easily extendable Storage term storage Analytical Datastore Visualization • Flexible reporting Interface to Custom Exploratory Alarms Manager Dashboards Analysis

14.How Big is Big? 30 # Events x Day (in billions) TB 53 • We’re growing really fast! 25 • Now processing 23 billion 20 records a day. 15 23 23 • Processing roughly a half 10 of ingested data 7 5 • Aiming to more than duplicate soon. - 2017Q3 2017Q4 2018Q1 2018Q2 2018Q3

15. Challenge #1 – Multiple Stream Enriching Sources for enrichment = 1 Sources for enrichment = 2 (1st Attempt) val res = DPIInfoXNetworkElementXGeoInfo(...) val res = NetworkElementXGeoInfo(...) Sources for enrichment = 2 (2nd Attempt) TL;DR: Keep the stream flowing! val res = (DPIInfo, NetworkElement, GeoInfo) And a quick tip: Keep intermediate data objects as flexible as possible val res = (DPIInfo, NetworkElement)

16. Challenge #2 – Session Windowing The problem Long gone are the days when web pages were simple! Now 100s of requests per web page are the norm. Solution env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) We defined a “session” as the val hdrStream = env.addSource(kafkaConsumer).filter(...) minimum interaction of the end hdrStream.keyBy(e => (e.userKey, e.webPageKey)) .window(EventTimeSessionWindows.withGap(Time.minutes(GAP_MINUTES))) user to which the system .aggregate(new CalcWebQoEAggregateFunction) .addSink(kafkaProducer) assigns a score. So simple!

17. Challenge #3 – Multiple Window Firings (I) The problem Solution env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) 𝑓 𝑣, 𝑤, 𝑥, 𝑦, 𝑧 = 𝑎𝑣 + 𝑏𝑤 + 𝑐𝑥 + 𝑑𝑦 + 𝑒𝑧 val consumer = new FlinkKafkaConsumer010[...](“pmTopic", MySerdeSchema, readerProps) .assignTimestampsAndWatermarks( new BoundedOutOfOrdernessTimestampExtractor[...](Time.minutes(60)){ We need to compute these override def extractTimestamp(element: ...): Long = element.startTimestamp formulas for each window on }) val pmStream = env.addSource(consumer).flatMap(...) hourly data, but variables often pmStream.keyBy(e => (e.time, e.networkElementId, e.formulaId)) come at (very) different processing .window(TumblingEventTimeWindows.of(Time.minutes(60))) .aggregate(new CalcFormulasAggregateFunction) times. .addSink(kafkaProducer) 7:00 8:00 Watermark 8:00 8:00 Too much latency! 𝒘 𝒚 8:00 8:00 8:00 9:00 𝒗 𝒙 𝒛 𝒗 Processing Time 8:00 9:00 9:10 9:15 10:00

18. Challenge #3 – Multiple Window Firings (II) New problems New Solution • If data comes early, we still need to env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime) //not EventTime wait until the next hour to trigger the val consumer = new FlinkKafkaConsumer010[...](“pmTopic", MySerdeSchema, readerProps) window. val pmStream = env.addSource(consumer).flatMap(...) pmStream.keyBy(e => (e.time, e.networkElementId, e.formulaId)) • If data comes later (it does all the .timeWindow(Time.days(1)) //max time we’re waiting for data .trigger(ContinuousProcessingTimeTrigger.of[TimeWindow](Time.minutes(10))) time!) we’d need to make our out-of- .evictor(new RemoveAfterCalculateEvictor()) //removes all data of a calc’d window .aggregate(new CalcFormulaAggregateFunction()) //returns Option[...] orderness parameter bigger, making .filter(_.isDefined).map(_.get) .addSink(kafkaProducer) our latency problem even worse. 1st Firing = None 2nd Firing = Some(...) TL;DR: Much better latency and scalability at the cost of a few more CPU cycles and a 8:00 8:00 bigger state. 𝒘 𝒚 8:00 8:00 8:00 9:00 𝒗 𝒙 𝒛 𝒗 Processing Time 8:00 9:00 9:10 9:15 9:20 10:00

19. Challenge #4 – Making CEP Dynamically The problem Solution • We need to run several complex event • Make it dynamic! (more or less) patterns on some different datasources. • Use JSONs to configure patterns. • New rules are added, and the existing • Flink parses the JSONs and starts ones are changed from time to time. running the pattern matching. • Hard-code this is not an option! • Needs the job to be restarted  { "name": “VideoQoE CEP", "eventClass": "", “inputTopic": “video_qoe", "outputTopic": "alarms", “alarmType": 12345, “alarmBody": "Alarm! The current Video QoE score in zone ${geographicalArea} is ${score}", "patterns": [ {"name": "start", "quantifier": "+", "condition": "score > 3 && score <= 3.5", "optional": false, "greedy": true}, {"name": "end", "quantifier": "1", "condition": "score < 3", "optional": false, "greedy": false, "contiguity": "relaxed", "within": "10 min"} ] }

20.What’s Next • Add more sources to the mix (Fixed network, Contact Center, etc.) • Use ML to enhance our anomaly detection capabilities. • Further develop interfaces to external systems. • Calibrate math models to better resemble quality expectations of customers. • Improve our CEP module. • Data Science / ML work area.

21.Muchas gracias Consulting, Transformation, Technology and Operations