ASTERIX: Scalable Warehouse-Style Web Data Integration

越来越多的数字信息正在产生,在社交网络、博客、网络社区等建立日常基础。组织各个领域的研究人员都认识到有巨大的价值和洞察力储存这些新出现的数据并提供给查询者,分析,和其他目的。这种新型的“大数据”应用程序对数据管理提出了具有挑战性的要求平台的可伸缩性,灵活性,可管理性,以及分析能力。在尤克欧文,我们正在建立下一代数据库系统,称为ASTERIX,以响应这些趋势。我们正在进行的工作涉及以下方面:问题:数据如何进入系统?什么原语我们是否应该提供更好地处理肮脏/嘈杂的数据?怎么可能我们支持对空间数据进行有效的数据分析。使用真实的例子,我们展示了Asterix通过用于模糊匹配的支持集相似性谓词,以及回答空间聚合查询。
展开查看详情

1. ASTERIX: Scalable Warehouse-Style Web Data Integration Sattam Alsubaiee, Alexander Behm, Raman Grover, Rares Vernica Vinayak Borkar, Michael J. Carey, Chen Li University of California, Irvine {salsubai,abehm,ramang,vborkar,mjcarey,chenli}@ics.uci.edu, rares.vernica@hp.com ABSTRACT semistructured data management, parallel database systems, and A growing wealth of digital information is being generated on a first-generation data-intensive computing platforms (MapReduce daily basis in social networks, blogs, online communities, etc. Or- and Hadoop), ASTERIX was envisioned to be a parallel, semistruc- ganizations and researchers in a wide variety of domains recog- tured information management system with the ability to ingest, nize that there is tremendous value and insight to be gained by store, index, query, analyze, and publish very large quantities of warehousing this emerging data and making it available for query- semistructured data. ASTERIX is well-suited to handle use cases ing, analysis, and other purposes. This new breed of “Big Data” ranging all the way from rigid, relation-like data collections, whose applications poses challenging requirements against data manage- types are well understood and invariant, to flexible and more com- ment platforms in terms of scalability, flexibility, manageability, plex data, where little is known a priori and the instances in data and analysis capabilities. At UC Irvine, we are building a next- collections are highly variant and self-describing. generation database system, called ASTERIX, in response to these Traditionally, there have been two major approaches to integrat- trends. We present ongoing work that approaches the following ing data from disparate sources [7]: Data warehousing, and virtual questions: How does data get into the system? What primitives data integration (a.k.a. federation), both of which have been exten- should we provide to better cope with dirty/noisy data? How can sively studied and implemented. ASTERIX follows a “web ware- we support efficient data analysis on spatial data? Using real exam- housing” philosophy where social and web data are ingested into ples, we show the capabilities of ASTERIX for ingesting data via and analyzed in a single scalable platform. In this paper, we intro- feeds, supporting set-similarity predicates for fuzzy matching, and duce specific features of ASTERIX that emerged from this goal. answering spatial aggregation queries. Figure 1 provides an overview of how the various software com- ponents of ASTERIX map to nodes in a shared-nothing cluster. The bottom-most layer provides storage facilities for managed datasets Categories and Subject Descriptors based on LSM-trees, which can be targets of ingestion. Further up H.2.4 [DATABASE MANAGEMENT]: Systems; H.2.7 [DATABASE the stack lies our data-parallel runtime called Hyracks [8]. It sits MANAGEMENT]: Database Administration—Data warehouse and at roughly the same level that Hadoop does in implementations of repository other high-level languages such as Pig [17] or Hive [3] or Jaql [11]. The topmost layer of ASTERIX is a parallel DBMS, with a full, flexible data model (ADM) and query language (AQL) for describ- General Terms ing, querying, and analyzing data. AQL is comparable to languages Design, Management, Performance such as Pig, Hive, or Jaql, but ADM and AQL support both native storage and indexing of data as well as access to external data (e.g., Keywords in HDFS). As part of the AQL compiler, we have developed Alge- bricks, a model-agnostic, algebraic “virtual machine” for optimiz- Data-intensive computing, Cloud computing, Semistructured data, ing parallel queries. Algebricks is the target for AQL compilation, ASTERIX, Hyracks but it can also be the target for other declarative data languages. 1. INTRODUCTION 1.1 EXAMPLE SCENARIO We started the ASTERIX project [1, 5] at UC Irvine approxi- Consider the upcoming US presidential elections. The elections mately two and a half years ago. Our goal at the outset was to de- are bound to generate a significant amount of online activity (tweets, sign and implement a highly scalable platform for information stor- blogs, etc) and are expected to be extensively covered by news me- age, search, and analytics. By combining and extending ideas from dia. Each tweet or news article related to the event contributes to a large information repository that can provide useful insights if subjected to analysis. We will use a simple example based on the election context to show the capabilities of ASTERIX. Permission to make digital or hard copies of all or part of this work for The ASTERIX data model (ADM) is based on ideas from JSON personal or classroom use is granted without fee provided that copies are with additional primitive types as well as type constructors bor- not made or distributed for profit or commercial advantage and that copies rowed from object databases [16]. Figure 2 shows how tweets and bear this notice and the full citation on the first page. To copy otherwise, to CNN news articles can be represented as records using ADM. No- republish, to post on servers or to redistribute to lists, requires prior specific tice that the record types shown in the figure are open types, sig- permission and/or a fee. IIWeb ’12 May 20 2012, Scottsdale, AZ, USA nifying that the instances of this type will conform to its specifica- Copyright 2012 ACM 978-1-4503-1239-4/12/05 ...$10.00. tion but are allowed to contain arbitrary additional fields that vary

2. Data loads and Data publishing AQL queries feeds from and to external with a time interval (seconds) between successive requests to the external sources sources and (JSON, XML,...) results applications remote Twitter service for fetching tweets. To facilitate analysis on the received data, a user may specify a pre-processing function which is applied to each feed record before persisting it to a dataset. For example, the Twitter feed utilizes High Speed Network a user-defined function addHashTagsToTweet that extracts all hash tags1 in a tweet. The extracted hash tags are appended to the feed record and used subsequently for further analysis (Sections 3 Asterix Client Interface Asterix Client Interface Asterix Client Interface and 4). Similar pre-processing to extract topics is applied to each AQL Metadata AQL Metadata AQL Metadata Compiler Manager Compiler Manager Compiler Manager news feed record. Like any ASTERIX dataset, a feed definition Hyracks Dataflow Layer Hyracks Dataflow Layer Hyracks Dataflow Layer also requires a partitioning clause that informs ASTERIX about Dataset / Feed Storage Dataset / Feed Storage Dataset / Feed Storage how to partition the receiving data. In our running example, we LSM Tree Manager LSM Tree Manager LSM Tree Manager simply hash partition both feed datasets by their id. create feed dataset Tweets(TweetType) using TwitterAdapter ("interval"="10") apply function addHashTagsToTweet Local Disks Local Disks Local Disks partitioned by key id; ASTERIX Cluster create feed dataset News(NewsType) using CNNFeedAdapter("topic"="politics","interval"="600") Figure 1: ASTERIX system architecture. apply function getTaggedNews from one instance to the next. The example also illustrates how partitioned by key id; ADM includes features such as optional fields with known types create index locationIndex on Tweets(location) type rtree; (location in TweetType), and nested collections of primitive values (hashTags in TweetType). begin feed Tweets; begin feed News; create type TweetType create type NewsType as open { as open { Figure 3: Feed definitions for the running example. id: string, id: string, username: string, title: string, Once a data feed has been defined, data ingestion can be trig- location: point? , description: string, gered with a begin feed statement2 (as shown in Figure 3). text: string, link: string, Feed datasets are first-class citizens; a user may write AQL queries timestamp: string, topics: {{string}}? against them, create secondary indexes, etc. In our example, we hashTags: {{string}}? }; create an R-tree index on the location attribute of TweetType. }; Runtime Execution: A feed ingestion workflow is a DAG ex- Figure 2: Metadata definition for the running example. ecuted as an ever-running parallel Hyracks job. The adapter asso- ciated with the feed may run on multiple nodes. Each adapter in- Data storage in ASTERIX is based on the concept of a “dataset”, stance connects to the external feed source and receives data. The a declared collection of instances of a given type. ASTERIX sup- output ADM records are spread across the cluster nodes based on ports both system-managed datasets, which are stored and managed the partitioning criterion specified in the feed definition. Any pre- by ASTERIX as partitioned LSM-based B+ trees with optional sec- processing (user-defined function) is applied at the recipient nodes ondary indexes, and external datasets, where the data can reside in before the ADM instances are inserted in the local LSM-B+tree(s) existing HDFS files or collections of files in the cluster nodes’ local for that dataset. Secondary indexes are also updated if necessary. file systems. More information about ADM can be found in [5]. Related Work: Data feeds may seem similar to streams from the data stream management systems literature [2, 4] or to com- 2. DATA INGESTION/FEEDS plex event processing systems [13, 24]. There are several impor- To amass data from services such as Twitter or CNN news, and tant differences, however. Data feeds in ASTERIX are a “plumb- serve as an effective platform for tracking and analyzing social me- ing” concept; they are simply the mechanism for having data arrive dia activity, ASTERIX supports continuous data ingestion via data into the ASTERIX system from external sources that produce data feeds. Data can be collected from a wide variety of sources using continuously, and to have that data incrementally populate a per- wrappers (adapters in ASTERIX) that abstract away the mecha- sisted dataset. To our knowledge, this will be the first system to nism of connecting with an external service, receiving data in either explore the challenges involved in building a feed ingestion facility push or pull mode, and transforming the data into ADM records that deals with semi-structured data and employs partitioned paral- understood by ASTERIX. Continuously arriving data is persisted lelism in order to scale the facility and couple it with high-volume in ASTERIX as a feed dataset. Feed datasets differ from regular and/or parallel external data sources. The most closely related work datasets only in where their data comes from; they are essentially is the AT&T Bistro data feed management system [21], but the fo- append-only datasets bundled with connections to a data provider. cus of that work is on routing large amounts of file-based data from Figure 3 shows AQL DDL statements to create a Twitter feed pre-determined feeds to the applications that need access to them. and a CNN news feed. In the context of our running example, the Twitter feed Tweets contains ADM records that conform to 3. FUZZY MATCHING the TweetType described in Figure 2. Similarly the CNN news 1 Words beginning with #; hash tags in a tweet are symbolic of top- feed News contains instances of NewsType. Each feed has an ics associated with the tweet associated adapter and may be given adapter-specific configuration 2 ASTERIX also provides suspend feed, resume feed and parameters. The TwitterAdapter in the example is configured end feed statements for controlling the lifecycle of a feed.

3. ASTERIX complements previous work in that we are extending the Having as a target use case the archiving, querying, and analysis core (single machine) techniques to implement end-to-end support of semistructured data drawn from Web sources, it is evident that for set-similarity queries in a parallel database system, including data quality is an issue. Social network users often publish data parallel joins and optimizations via distributed secondary indexes. informally, and can post tweets from mobile devices, resulting in many abbreviated keywords and typos. Fuzzy matching capabili- ties need to be added to ASTERIX, and we are in the process of 4. SPATIAL AGGREGATION adding fuzzy selection and join queries, described as follows. Large volumes of events and social data can be aggregated and Fuzzy Selection Queries: Many people mistype (perhaps pur- analyzed to derive knowledge valuable to businesses, governments, posely) the name of a currently prominent politician “Rick Santo- and society. For instance, consider the case where the campaign rum” as “Rick Sanitarium”. Therefore, when querying the system manager of a presidential candidate such as Mitt Romney wants to for tweets about Rick Santorum, we should include similar matches know how potential voters are reacting to the Republican presiden- as well, e.g., using edit distance. One may even infer the sentiment tial primaries in a certain geographic area. of the tweeter based on the misspelling. Currently, we are in the A useful piece of information is the level of voters’ interest in process of adding support for fuzzy selection queries. the rival Rick Santorum in February/March, 2012 (close to Super Fuzzy Join Queries: In addition, analyzing such data, e.g., to Tuesday of the primary election) in different geographical regions. make recommendations or to identify sub-populations and trends in Such information is clearly valuable to the decision-making pro- social networks, often requires the matching of multiple sets of data cess of the campaign. The increasing availability and popularity based on set-similarity measures. For example, suppose a news of social data and event data make such information more readily provider like CNN wants to optimize its web page layout by ana- available and more real time. We can derive such knowledge by lyzing the success of past news stories. Apart from article-specific doing a spatial aggregation on Twitter data as follows: We for- measures, such an analysis could take into account the general im- mulate a query to find the tweets mentioning “Santorum” posted pact of news stories on certain combinations of topics on the Twitter from February 15, 2012 to March 1, 2012, group them on a grid community. The AQL query in Figure 4 illustrates a possible query structure, and compute the number of such tweets in each cell in formulation. Intuitively, the query returns the top ten most popu- the grid. By doing this spatial analysis, the campaign staff could lar articles based on their relevance to topics mentioned in tweets. gain an understanding of the public opinion, and make informed The first part of the query (up to group by) generates all pairs of decisions such as broadcasting more political ads in certain areas. tweets and news articles that have similar topics based on the Jac- Given the importance of spatial aggregation queries, we have card similarity of their topic-lists. The ∼= operator means “simi- added capabilities for them in ASTERIX. Such a query specifies a lar to”, and has been qualified with set simfunction and set grid structure including a spatial range and a resolution and asks for simthreshold. The second part of the query counts the number a density distribution (histogram) of data within the grid. The query of related tweets per news article, and returns the top ten articles may optionally include a time interval and keywords and do an ag- with the highest tweet count. gregation on the data records satisfying these additional conditions. A spatial aggregation query partitions data records into groups and set simfunction "jaccard" set simthreshold "0.5f" applies an aggregation function to all records in each group. Figure 5 shows a color-coded density grid on the map that visualizes the for $tweet in dataset(’Tweets’) results of a spatial aggregation query using the Google Maps API. for $article in dataset(’News’) where $tweet.hashTags ∼=$article.topics group by $a := $article.id with $article order by count($article) limit 10 return {"article": $article, "popularity": count($article)} Figure 4: A set-similarity join to find the top ten most popular news articles based on their relevance to topics in tweets. Runtime Execution: Executing queries with set-similarity (or string-similarity) predicates is challenging on large amounts of data. First, the predicates themselves are expensive, rendering brute-force solutions impractical. For example, computing the Jaccard similar- ity requires the union and intersection of two sets, and the classic edit distance algorithm uses dynamic programming, all of which Figure 5: A visualization of the results of a spatial aggregation are computationally expensive. Second, standard divide-and-conquer query. The color of each cell indicates the tweet count. strategies based on hash partitioning are not directly applicable, be- cause similar items may not have the same hash value. Currently, Consider the AQL query in Figure 6 which spatially aggregates ASTERIX can execute fuzzy joins efficiently based on principles election-related tweets. It starts by constraining tweets to a bound- that we developed while studying how to perform fuzzy joins in ing rectangle inside the US, a datetime window, and those contain- the context of Hadoop [23, 22] (without pre-existing indexes). We ing the hashtag “Santorum”. The spatial-cell function deter- have recently started implementing indexed support for such fuzzy mines which grid cell a tweet belongs to. This function receives the queries based on our earlier work in [6]. We expect to speed up location of the tweet, the origin of the bounding rectangle, and the both selection and join queries with secondary indexes. latitude and longitude increments (to specify the resolution of the Related Work: There have been many studies on set-similarity grid). It returns the cell (represented by a rectangle) that the tweet and string-similarity selection [15, 10], and join queries [20, 25], belongs to. Those tweets are then grouped according to their con- some in the context of relational database systems [9]. Our work in taining grid cells. Finally the count function is applied to each

4.group of tweets to return the final answer as pairs of cell and num- 0910989, 0910859, 0910820, and 0844574, a grant from the UC ber of tweets (that satisfy the predicates) in that cell. Discovery program, and a matching donation from eBay. for $tweet in dataset(’Tweets’) l e t $searchHashTag := "Santorum" 6. REFERENCES l e t $leftBottom := create-point(33.13,-124.27) [1] ASTERIX Website. http://asterix.ics.uci.edu/. l e t $rightTop := create-point(48.57,-66.18) [2] D. J. Abadi et al. The design of the Borealis stream l e t $latResolution := 3.0 processing engine. In In CIDR, 2005. l e t $longResolution := 3.0 [3] Apache Hive, http://hadoop.apache.org/hive. l e t $region := create-rectangle($leftBottom,$rightTop) where spatial-intersect($tweet.location, $region) and [4] A. Arasu et al. Stream: The stanford stream data manager. $tweet.time > datetime("2012-02-15T00:00:00Z") and IEEE Data Eng. Bull., 26(1):19–26, 2003. $tweet.time < datetime("2012-03-01T23:59:59Z") and [5] A. Behm, V. R. Borkar, M. J. Carey, R. Grover, C. Li, some $hashTag in $tweet.hashTags N. Onose, R. Vernica, A. Deutsch, Y. Papakonstantinou, and s a t i s f i e s ($hashTag = $searchHashTag) V. J. Tsotras. ASTERIX: towards a scalable, semistructured group by $c := spatial-cell($tweet.location, $leftBottom, $latResolution, $longResolution) data platform for evolving-world models. Distributed and with $tweet Parallel Databases, 2011. return { "cell": $c, "count": count($tweet) } [6] A. Behm, C. Li, and M. J. Carey. Answering approximate Figure 6: Spatial aggregation query over tweets that were gen- string queries on large data sets using external memory. In erated by US users close to Super Tuesday of the Republican ICDE, 2011. primary election, containing the hashtag “Santorum”. [7] P. A. Bernstein and L. M. Haas. Information integration in the enterprise. Commun. ACM, 51(9):72–79, Sept. 2008. Runtime Execution: Since ASTERIX provides rich spatial sup- [8] V. R. Borkar et al. Hyracks: A flexible and extensible port, spatial aggregation queries are executed efficiently by using foundation for data-intensive computing. In ICDE, 2011. a secondary R-tree index. Thus, all records outside of the query [9] L. Gravano et al. Approximate string joins in a database bounding region are filtered quickly. One path that we are in- (almost) for free. In VLDB, pages 491–500, 2001. vestigating to further boost performance is to incrementally pre- [10] M. Hadjieleftheriou, A. Chandel, N. Koudas, and aggregate the data into a spatial index (akin to a materialized view). D. Srivastava. Fast indexes and algorithms for set similarity Related Work: There are existing studies on answering spatial selection queries. In ICDE, 2008. aggregation queries [18, 12], and spatio-temporal aggregation [19], [11] Jaql, http://www.jaql.org. where their goal is to do an aggregation based on space and time [12] I. Lazaridis and S. Mehrotra. Progressive approximate conditions simultaneously. Mathioudakis et al. [14] proposed a aggregate queries with a multi-resolution tree structure. In framework to identify spatial burstiness assuming the space is de- SIGMOD, 2001. composed using a grid-based layout. Spatial Aggregation in AS- [13] M. Li, M. Mani, E. A. Rundensteiner, and T. Lin. Complex TERIX is different from these earlier studies since we are interested event pattern detection over streams with interval-based in finding the density distribution of spatial objects, possibly with temporal semantics. In Proc. of the 5th ACM Int. Conf. on textual and temporal predicates. Moreover, in the ASTERIX sys- Distrib. Event-Based Systems, pages 291–302, 2011. tem, spatial aggregation queries are part of a general spatial frame- [14] M. Mathioudakis, N. Bansal, and N. Koudas. Identifying, work, where the goal is to support different types of useful spatial attributing and describing spatial bursts. PVLDB, 3(1), 2010. queries rather than supporting specific queries in an ad-hoc way. [15] G. Navarro. A guided tour to approximate string matching. ACM Comput. Surv., 33(1):31–88, 2001. 5. FUTURE/ONGOING WORK [16] Object database management systems. In this paper, we presented three key features of the ASTERIX http://www.odbms.org/odmg/. system to warehouse and analyze social and Web data: Data feeds, [17] C. Olston et al. Pig Latin: a not-so-foreign language for data fuzzy matching, and spatial aggregation. We have described our processing. In SIGMOD, 2008. data feed mechanism for continuously ingesting data and showed [18] D. Papadias, P. Kalnis, J. Zhang, and Y. Tao. Efficient olap how ASTERIX can help with inferring useful information via fuzzy operations in spatial data warehouses. In SSTD, 2001. matching and spatial aggregation. Currently, the ADM/AQL layer [19] D. Papadias, Y. Tao, P. Kalnis, and J. Zhang. Indexing of ASTERIX is able to run parallel queries – including lookups, spatio-temporal data warehouses. In ICDE, 2002. large scans, parallel joins (regular and fuzzy), and parallel aggre- [20] S. Sarawagi and A. Kirpal. Efficient set joins on similarity gates – for data stored in partitioned LSM B+ trees and indexed via predicates. In SIGMOD Conference, 2004. secondary indexes such as LSM-based R-trees. The system’s exter- [21] V. Shkapenyuk, T. Johnson, and D. Srivastava. Bistro data nal data access and data feed features are also operational. We plan feed management system. SIGMOD, 2011. to offer a first open-source release of ASTERIX during the latter part of 2012, and we are now seeking early partners who would like [22] R. Vernica. Efficient Processing of Set-Similarity Joins on to try ASTERIX on their favorite “Big Data” problems. Our ongo- Large Clusters. Ph.D. thesis, UC Irvine, 2011. ing work includes hardening and documenting the ASTERIX code [23] R. Vernica, M. J. Carey, and C. Li. Efficient parallel base for initial public release, adding indexing support for fuzzy se- set-similarity joins using MapReduce. In SIGMOD, 2010. lection queries, improving the performance of spatial aggregation, [24] D. Wang, E. A. Rundensteiner, R. T. Ellison, and H. Wang. adding support for continuous queries, extending AQL with win- Active complex event processing infrastructure: Monitoring dowing features, and starting to work with a few early users and and reacting to event streams. In ICDE Workshops, 2011. use cases to learn by experience where we should go next. [25] C. Xiao, W. Wang, and X. Lin. Ed-join: An efficient algorithm for similarity joins with edit distance constraints. Acknowledgements: This project is supported by NSF IIS awards In VLDB, 2008.