- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
数据科学家与企业工程师
展开查看详情
1 .data science and enterprise engineering how data scientists and engineers work in tandem to achieve real-time personalization at overstock.com
2 .© Overstock 2 overstock pushing boundaries of retail since 1999 Midvale, Utah 5 million unique products for sale billions of visits and page views 160+ countries
3 . © Overstock overstock marketing team in the beginning..... dev • pigs remember animal farm? • data science team – 3 data scientists • engineering team - 2 developers, 1 QA, and a product owner • chickens • plenty of business owners • lots of channel managers 4
4 . © Overstock overstock marketing the problems we were up against dev • data scientists were not working with the business to solve business problems • engineers were not working with data scientists • engineering was in a relational data mindset • not regularly delivering business value • solving yesterday's problems - today 5
5 .© Overstock 6 days minutes seconds
6 . © Overstock business 1st problem we came together to solve problem • real-time bidding (RTB) is a means by which advertising inventory is bought and sold on a This is where I add text. If I want it to keep going, it per-impression basis via real-time bidding, similar to looks like this. financial markets • low-latency operation – need to bid >10ms • pre-compute scores nightly • push scored to cache in AWS • partnered with Bees Wax • replace existing 3rd party partners 7
7 . © Overstock problem to propensity to purchase solve • identifying unique patterns and tendencies that indicate a user is ready to purchase This is where I add text. If I want it to keep going, it looks like this. • user score: represents a customer’s likelihood of purchase • collected at the visit/user level • basic steps • turn raw user interactions into ‘features’ • train classifiers on months of data with label = purchase vs. no purchase • predict on new users/visits 8
8 . © Overstock class imbalance challenges • many new customers data science • billions of unique page views in a calendar year • many users we are seeing for the first time (potentially) • low conversion • a small percentage of sessions end in a purchase • sparse web logs mean we have to digest an enormous amount of data to generate useful features • we are interested in accuracy on the positive label • recall over precision 9
9 . © Overstock what the team was up against challenges • Constraints with current infrastructure and processes • used to pulling data instead of streaming it • data scientists are a scarce resource • working on many non-relative tasks • not a lot of experience with the care and feeding of enterprise software • a bit set in the academia mindset 10
10 .© Overstock 11
11 . © Overstock Faster Scotty recap what we accomplished in 6 months • we were solving today's problems... by the end of the day • engineering was thinking about streaming data • had the pieces in place to start scoring users in minutes 12
12 . © Overstock days minutes seconds 13
13 . © Overstock business needed to move faster challenges • score users faster data science + engineering • moved from daily to hourly • picked up a fraud project • assign fraud score within 5 mins 14
14 . © Overstock moving from daily batches to more regular micro-batches challenges • hourly jobs data science + • much more responsive scoring = larger server footprint engineering • can still train offline in batch, but must have pipelines better tuned. • every 5 minutes • ETL must be tightly honed • at this point you hit the edge of what is possible in the batch setting • feedback loops become critical for success 15
15 . © Overstock balancing business goals with enterprise engineering challenges • as adoption increases visibility increases business vs engineering • need standardized data representations • have to make sure architecture is hardened and robust • need fail-safe mechanisms for critical processes • MOST IMPORTANT • you can never, ever slow down overstock.com 16
16 . © Overstock User Training Data User ML Attributes User ID User ID Trained Experience Models Training Data Content Attributes Content Asset ML Asset Data Platform Asset Training Data 17
17 . © Overstock needed to move faster recap • what was accomplished in 6 months data science + • fraud scores in a minute or less engineering • users were being scored in a minute • what was left • still not fast enough for real time personalization on overstock.com 18
18 . © Overstock days minutes seconds 19
19 . © Overstock business near real-time personalization on the shopping site problem • near real-time personalization on overstock This is where I add text. If • putting custom recs in front of the user. I want it to keep going, it looks like this. • can't slow the site down 20
20 . © Overstock from micro-batches to near real-time challenges • near real-time can limit the size of models and data science + complexity of calculations engineering + business • we aren’t trading stocks, we sell home goods • we may not know much about a user we are trying to personalize for • what can you personalize for a new user as they initially interact with your site • heuristics moving towards full models 21
21 . © Overstock from micro-batches to near real-time challenges • real-time data flow requires rea-time analytics data science + and strategy engineering + business • introspection into processes becomes critical • must have automated process control to prevent algorithms from running wild • empower business owners to operate strategically without suffering from information overload 22
22 . © Overstock from micro-batches to near real-time challenges • every piece of the process needs to function as data science + a cohesive flow engineering + business • do we need to wait for the data to get there, then act (fraud?) • or act on the data we have (recommendations for a shopping site) 23
23 . © Overstock audiences & user attributes user exhaust auction, bid, & win logs user attributes user attributes user events bids tracking deep links pcm trking id raw ml deep ml raw data features bids features data links live rex campaign asset events logs asset attributes tracking assets deep links campaign logs asset data platform targeted asset exhaust assets 24
24 . © Overstock what was done and undone recap data science + • still not quite to near-real-time/low-latency recs for engineering + business shopping site • fraud score and email recommendations are fast enough 25
25 . © Overstock where we roadmap are going • page-less personalization data science + • deep learning for fraud engineering + business • balancing real-time needs with efficient gains We all need to be daring and collaborative to accelerate innovation! 26
26 . © Overstock take lessons learned always data science + • the cloud presents a whole new set of problems engineering + business • tech is hard, politics are harder • we can’t operate in silos • we must be aligned with the business • the process must be iterative 27
27 .questions? and of course, we are hiring!
28 .