认知数据库:基于Apache SCALE的人工智能支持关系数据库系统

我们描述了认知数据库的设计和实现,它是一个基于Spark的关系数据库,演示了支持AI的SQL查询的新功能。我们的方法的一个关键方面是首先将结构化数据源看作有意义的非结构化文本,然后使用文本使用称为词嵌入的自然语言处理(NLP)技术构建无监督神经网络模型。我们将单词嵌入模型无缝地集成到现有的SQL查询基础设施中,并使用它来支持一类新的基于SQL的分析查询,称为认知智能(CI)查询。
展开查看详情

1.Cognitive Database: An Apache Spark-Based AI-Enabled Relational Database System Rajesh Bordawekar IBM T. J. Watson Research Center bordaw@us.ibm.com #AI5SAIS

2.Outline • Word Embedding Overview • Cognitive Database Design • Cognitive Intelligence (CI) Queries • Spark Implementation Details • Case Study: Image and Text Database • Summary #AI5SAIS 2 #AI5SAIS

3.Word Embedding Overview • Unsupervised neural network based NLP approach to capture meanings of words based neighborhood context – Meaning is captured as collective contributions from words in the neighborhood • Generates semantic representation of words as low- dimensional vectors (200-300 dimensions) • Semantic similarity measured using distance metric (e.g., cosine distance) between vectors #AI5SAIS 3 #AI5SAIS

4.Cognitive Database Key Ideas • Uses dual view of relational data: tables and meaningful text, with all relational entities mapped to text, without loss of information • Uses word-embedding approach to extract latent information from database tables • Classical Word embedding model extended to capture constraints of the relational model (e.g., primary keys) • Enables relational databases to capture and exploit semantic contextual similarities #AI5SAIS 4 #AI5SAIS

5.Using Embedding Models External Unstructured and Structured Results Structured Relational Tables Data sources Pre-trained Model Cognitive Intelligence Queries Word Embedding in Structured Query Systems Model Model built from data Structured Data sources source being queried Relational Tables #AI5SAIS #AI5SAIS

6.Cognitive Database Features • Enables SQL-based information retrieval based on semantic context, rather than, data values • Unlike analytics databases, does not view database tables as feature and model repositories • Latent features exposed to users via standard SQL based Cognitive Intelligence (CI) queries • Users can invoke standard SQL queries using typed relational variables over a semantic model built over untyped strings #AI5SAIS 6 #AI5SAIS

7.Customer Analytics Workload custID Date Merchant State Category Items Amount custA 9/16 Whole Foods NY Fresh Produce Bananas, Apples 200 custB 10/16 Target NJ Stationery Crayons, Pens, Notebooks 60 custC 10/16 Trader Joes CT Fresh Produce Bananas, Oranges 80 custD 9/16 Walmart NY Stationery Crayons, Folders 25 Meaning vector Text representation of a table row for every token “custD 9/16 Walmart NY Stationery ‘Crayons, Folders’ 25” Words in the neighborhood contribute to the overall meaning of “custID” custB For this relational view, Words similar in meaning custA custD custA is similar to custC closer in vector space custC custB is similar to custD #AI5SAIS 7 #AI5SAIS

8.Cognitive Intelligence Queries • Semantic Similarity/Dissimilarities • Semantic Clustering • Cognitive OLAP queries • Inductive Reasoning queries • Semantic Relational Operations Can work with externally trained models and over multiple data types. #AI5SAIS 8 #AI5SAIS

9. CI Query Example Cognitive UDF • Operates on relational variables. Can be sets or sequences val result_df = spark.sql(s””” • For each input variable, SELECT VENDOR_NAME, fetches vectors from the proximityCust_NameUDF(VENDOR_NAME, ‘$v’) embedding model AS proximityValue FROM Index_view • Computes semantic HAVING proximityValue > 0.5 similarity between vectors ORDER BY proximityValue DESC using nearest neighbor ”””) approaches CI similarity Query: Find similar entities to a given entity (VENDOR_NAME) based on transaction characteristic similarities #AI5SAIS 9 #AI5SAIS

10.Cognitive Database Applications • Analysis over multi-modal data (Retail, Health, Insurance) • Entity similarity queries (Customer Analytics, IT Ticket Management, Time-series) • Cognitive OLAP (Finance, Insurance…) • Entity Resolution (Master Data Management) • Analysis of time-series data (IoT, Health) #AI5SAIS 10 #AI5SAIS

11.Cognitive Databases Stages Cognitive ETL Vector Storage Query Execution Vector Domain Pre-computed Learned Vectors External Learned UDFs Vectors External Text Text Domain Sources Tokenized Relations Relational Relational Relational CI Queries Relations Tables System Tables #AI5SAIS #AI5SAIS

12.Training from source database Relational Tables Data Cleaning text numerical values images k-means clustering Get image tags Create unique tokens (Numpy/Scipy) (Watson VRS) (Python) Create unique tokens Create image features (Python) Hyperparameter Tuning* Training Text File Window size Word Embedding Vector Dimensions Training Word Embedding Model #AI5SAIS #AI5SAIS

13.Why Spark? Database Community Usability Data Science Community across multiple Spark SQL+UDFs user domains PySpark/Pandas APIs (Scala/Python) via Jupyter 2.2.0 Dataframes-based Representation Portability Support for Spark SQL based Cognitive Intelligence Queries across multiple Standardized (IBM Z zOS/zLinux, IBM P Linux,AIX, x86) platforms/OS SQL Queries GPU Acceleration Relational Databases CSV Files …. ….. JSON Opportunities for Flexibility over Acceleration multiple input data formats #AI5SAIS #AI5SAIS

14.Cognitive Database: Spark Execution Flow Input Table Trained Model SQL Query Output Table SELECT X.custID, X.custName, proximityAvg(X.InvestType,Y.InvestType) FROM cust X, cust Y WHERE Y.custID=‘471’ AND proximityAvg(X.InvestType,Y.InvestType) LIMIT 5 Similarity Spark DF Computation Source Spark Spark Spark SQL DF Data DF Spark SQL UDF Specialized Nearest Neighbor Word Embedding #AI5SAIS 14 #AI5SAIS

15.Invoking Cognitive Database in Jupyter #AI5SAIS #AI5SAIS

16. Case Study: Application Database with links to images Picture ID National Park Country Path of JPEG Image PK_01 Corbett India ./Img_Folder/Img_01.JPEG PK_05 Kruger South Africa ./Img_Folder/Img_05.JPEG PK_09 Sunderbans India ./Img_Folder/Img_09.JPEG PK_11 Serengeti Tanzania ./Img_Folder/Img_11.JPEG Internal Training database with features extracted from linked images Picture Image Id National Park Country Animal Name Class Dietary Habit color Id PK_01 Img_01.JPEG Corbett India Elephant Mammal Herbivores Gray PK_05 Img_05.JPEG Kruger South Africa Rhinoceros Mammal Herbivores Gray PK_09 Img_09.JPEG Sunderbans India Crocodile Reptile Carnivorous Gray PK_11 Img_11.JPEG Serengeti Tanzania Lion Mammal Carnivorous Yellow The above merged data is used as an input to train the word embedding model that generates embeddings of each unique token based on the neighborhood. Each row of the database is viewed as a sentence. #AI5SAIS 16 #AI5SAIS

17.CI Semantic Clustering Query: Find all images whose similarity to user chosen images of [lion, vulture, shark] using the attributeSimAvg UDF with similarity score greater than 0.75 SELECT X.imagename, X.classA, X.classB, X.classC, X.classD, FROM ImageDataTable X WHERE (X.imagename <> ’n01314663_7147.jpeg’) AND (X.imagename <> ’n01323781_13094.jpeg’) AND (X.imagename <> ’n01314663_8531.jpeg’) AND (attributeSimAvgUDF(’n01314663_7147.jpeg’, ’n01323781_13094.jpeg’, ’n01314663_8531.jpeg’, X.imagename) > 0.75) #AI5SAIS 17 #AI5SAIS

18. Output X.Imagename X.classB X.classC X.classD n01604330_12473 bird_of_prey, new_world_vulture, andean_condor, condor, sloth_bear mammal carnivore n01316422_1684 mammal, carnivore, eagle glutton_wolverine, piste_ski_run, bird_of_prey downhill_skiing, ern, ski_slope n01324431_7056 bird_of_prey, new_world_vulture, andean_condor, tayra mammal carnivore n01604330_12473 n01316422_1684 n01324431_7056 #AI5SAIS 18 #AI5SAIS

19.CI Analogy Query: Find all images whose classD satisfies the analogy query [reptile: monitor_lizard :: aquatic_vertebrate : ?] using analogyQuery UDF having similarity score greater than 0.5. SELECT X.imagename, X.classA, X.classB, X.classC, X.classD FROM ImageDataTable X WHERE (analogyQuery(’reptile’,’monitor_lizard’,’aquatic_vertebrate’,X.classD,1) > 0.5) X.Imagename X.classB X.classC X.classD n02512053_1493 aquatic_vertebrate spiny_finned_fish permit, archerfish n02512053_3292 aquatic_vertebrate spiny_finned_fish archerfish, mojarra n02512053_602 aquatic_vertebrate spiny_finned_fish lookdown, permit #AI5SAIS 19 #AI5SAIS

20.CI Query using external knowledge base: Find all images of animals whose classD similarity score to the Concept of ‘‘Hypercarnivore" of Wikipedia using proximityAvgForExtKB UDF is greater than 0.5. Exclude images that are already tagged as carnivore, herbivore, omnivore or scavenger. SELECT X.imagename,X.classA,X.classB,X.classC, X.classD FROM ImageDataTable X WHERE (proximityAvgAdvForExtKB(’CONCEPT_Hypercarnivore’, X.classD) > 0.5) ORDER BY SimScore DESC #AI5SAIS 20 #AI5SAIS

21.Summary • Novel relational database system that uses word embedding approach to enable semantic queries in SQL • Spark-based implementation that loads data from a variety of sources and invokes Cognitive Intelligence queries using Spark SQL • Demonstration of the cognitive database capabilities using a multi-modal (text+image) dataset • Illustration of seamlessly integrating AI capabilities into relational database ecosystem #AI5SAIS 21 #AI5SAIS

22.References • Bordawekar and Shmueli, Enabling Cognitive Intelligence Queries in Relational Databases using Low-dimensional Word Embeddings, arXiv:1603.07185, March 2016 • Bordawekar, Bandopadhyay, and Shmueli, Cognitive Database: A Step Towards Endowing Relational Databases with Artificial Intelligence Capabilities, arXiv:1712:07199, December 2017 #AI5SAIS 22 #AI5SAIS