- 快召唤伙伴们来围观吧
- 微博 QQ QQ空间 贴吧
- 视频嵌入链接 文档嵌入链接
- 复制
- 微信扫一扫分享
- 已成功复制到剪贴板
Building A Feature Factory
Building, managing, and maintaining thousands of features across thousands of models. Building features can be repetitive, tedious and extremely challenging to scale. We will explore the ‘Feature Factory’ built at Databricks and implemented at several clients and the processes that are imperative for the democratization of feature development and deployment. The feature factory allows consumers to ensure repetitive feature creation, simplifies scoring and enables massive scalability through feature multiplication.
展开查看详情
1 .WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
2 .Building A Feature Factory Daniel Tomes, Databricks #UnifiedDataAnalytics #SparkAISummit
3 .Me • Norman, OK – Undergrad OU – SOONER – Masters – OK State • ConocoPhillips • Raleigh, NC • Cloudera • Databricks #UnifiedDataAnalytics #SparkAISummit 3
4 .Retail Farming/Oil Fraud Ads Marketing InfoSec General • Channel • Geospatial Pattern • Campaign • Partners • Intrusions • Weather • Sales • IoT • Manipulation • Bids • Sales • Octets • Geo • Costs • Weather • Market • Social • Customer • ISPs • Municipal • Revenue • Customer • Purchases • Region • Market • Employee • Equipment • Contract • Competitor • Competitor • Vendor • Service • Social • Loyalty • Coupon • Gov • Interaction • Campaign #UnifiedDataAnalytics #SparkAISummit 4
5 . Metrics are easy: Fewer Better Defined Better Documented Implement Likely been describing the business for years Through Metrics Grouping & Multiplying Concepts = Feature Engineering #UnifiedDataAnalytics #SparkAISummit 5
6 . Measure vs Metric vs Feature An individual measurable property or A quantifiable measure that characteristic of an observation. is used to track and assess the status of a specific A raw, aggregated or altered metric process Measure that can provide predictive power in pattern recognition, classification, and regression. Numbers or values that can be summed and/or averaged, such as sales, leads, distances, durations, Metric Feature temperatures, and weight #UnifiedDataAnalytics #SparkAISummit 6
7 . Measure vs Metric vs Feature An individual measurable property or A quantifiable measure that characteristic of an observation. is used to track and assess the status of a specific process Measure A raw, aggregated or altered metric that can provide predictive power in 31 pattern recognition, classification, and regression. Numbers or values that can be summed and/or averaged, such as sales, leads, distances, durations, Metric Feature temperatures, and weight #UnifiedDataAnalytics #SparkAISummit 7
8 . Measure vs Metric vs Feature An individual measurable property or A quantifiable measure that characteristic of an observation. is used to track and assess the status of a specific Measure A raw, aggregated or altered metric process that can provide predictive power in 31 pattern recognition, classification, and regression. Numbers or values that can be summed and/or Metric averaged, such as sales, leads, distances, durations, +31 Feature temperatures, and weight Country Code #UnifiedDataAnalytics #SparkAISummit 8
9 . Measure vs Metric vs Feature An individual measurable property or A quantifiable measure that characteristic of an observation. is used to track and assess the status of a specific Measure A raw, aggregated or altered metric process that can provide predictive power in 31 pattern recognition, classification, and regression. Numbers or values that can be summed and/or Metric averaged, such as sales, +31 Feature leads, distances, durations, temperatures, and weight Country .002428571 Code #UnifiedDataAnalytics #SparkAISummit 9
10 .How It Goes Data Engineer Data Scientist Identify data scope and scale Modeling Data Filtering Understand target if applicable Twisting (Sales X Time Ranges) Scope down to relevant data Tweaking (Scaling/Binning) Scope up to include more data Clustering/PCA/Correlation Explore available data Pearson/Outlier Understand data models Model Stacking Understand business rules Data Leaks Model Tuning Identify Metrics & Features Evaluation Write code that writes code Join, Union, Agg Optimize #UnifiedDataAnalytics #SparkAISummit 10
11 .Feature Factory Data Engineer Data Scientist Identify data scope and scale Modeling Data Filtering Understand target if applicable Twisting (Sales X Time Ranges) Scope down to relevant data Tweaking (Scaling/Binning) Scope up to include more data Clustering/PCA/Correlation Explore available data Pearson/Outlier Understand data models Model Stacking Understand business rules Data Leaks Model Tuning Evaluation Identify Metrics & Features Write code that writes code Join, Union, Agg Optimize #UnifiedDataAnalytics #SparkAISummit 11
12 . End Result Feature Feature Concept Set Family Feature Magic Base DF Factory Sauce #UnifiedDataAnalytics #SparkAISummit 12
13 .Why A Feature Factory Rapidly prototype and deliver 1000s of features Build them all and let science decide Univariate Selection Algorithms Feature Importance Models (XGBoost) Correlation Matrices High-Dimensional PCA #UnifiedDataAnalytics #SparkAISummit 13
14 .Why A Feature Factory Feature reusability Consistent logic (joins and formulas) Optimized feature generation Process Documentation – Finally! Scalable (10K+ features) #UnifiedDataAnalytics #SparkAISummit 14
15 .What Is A Feature Factory Code Base - APIs Accelerator – Configurable – Not OEM Extensible & Customizable…Incomplete #UnifiedDataAnalytics #SparkAISummit 15
16 .How It Works Land the scaffolding Gut the demo Structure, Configure your Concepts Initialize your data and your metrics #UnifiedDataAnalytics #SparkAISummit 16
17 .Abstract Architecture #UnifiedDataAnalytics #SparkAISummit 17
18 .Concrete Example (TPC-DS) Store Sales Catalog Sales Web Sales #UnifiedDataAnalytics #SparkAISummit 18
19 .Concrete Example (TPC-DS) Store Returns Catalog Returns Web Returns #UnifiedDataAnalytics #SparkAISummit 19
20 .TPC-DS Architecture Store Web Catalog #UnifiedDataAnalytics #SparkAISummit 20
21 . Master Concept Implemented Concept Feature Family Implemented Feature Architecture #UnifiedDataAnalytics #SparkAISummit 21
22 . Implementation 2. Implement Store 3. Implement Feature Family 4. Implement Features Process 1. Define the concept (channel) 2. Implement the concepts 3. Build Feature Families 4. Implement Features 1. Rename & Define Channel #UnifiedDataAnalytics #SparkAISummit 22
23 .Feature - Definition Store -> Sales -> Feature #UnifiedDataAnalytics #SparkAISummit 23
24 . Highlights - Multipliers Feature Families Multipliers – Sales – Time – Customer – Categorical – Weather – Trends – Geo Time Window (Multiplier) Base Metrics Base Metric Categorical (Sales/Customer) (Weather/Geo) (Category) Total_Sales_6m_Sunny_Category-MensShoes Total_Customers_3m_GeoRange_CheckoutMethod-Self #UnifiedDataAnalytics #SparkAISummit 24
25 .Highlights - Multipliers Time Windows Categorical Dims Sales Metrics (8) 1m 3m 6m 12m Customer Metrics (8) Item Category (8) 1w 2w 3w 4w Demographics (12) 8 * 9 * 8 * 8 * 12 = 55,296 possible features & < 20 lines of code Common Example 8 sales metrics * 4 time windows * 5 dims with avg of 12 distincts 8 * 4 * 5 * 12 = 1,920 features Send to feature importance/selection process and pick top n #UnifiedDataAnalytics #SparkAISummit 25
26 .Highlights – Joiners/Groupers Automatic aggs/joins when features need it Accurate & Optimized Once & Only once #UnifiedDataAnalytics #SparkAISummit 26
27 .Highlights – Canned Data Where is relevant Data? Just browse the data related to the concept #UnifiedDataAnalytics #SparkAISummit 27
28 .
29 .