Why is large-scale data mining a thing?

注脚

展开查看详情

1.CSCI 6900: Mining Massive Datasets Shannon Quinn (with content graciously and viciously borrowed from William Cohen’s 10-605 Machine Learning with Big Data and Stanford’s MMDS MOOC http://www.mmds.org/ )

2.“Big Data”

3.Astronomy Sloan Digital Sky Survey New Mexico, 2000 140TB over 10 years Large Synoptic Survey Telescope Chile, 2016 Will acquire 140TB every five days 1 1 http :// www.economist.com /node/15557443

4.Particle Physics Large Hadron Collider (LHC) 150 million sensors 40 million data points / second (before filtering) 100 collisions of interest (after filtering ) 1 Even after rejecting 199,999 of every 200,000 collisions, generates 15PB of data per year 1,2 If all collisions were recorded, LHC would generate 500EB of data per day ~900EB transmitted over IP per year 3 2 http :// www.nature.com /news/2011/110119/full/469282a.html 1 http :// cds.cern.ch /record/1092437/files/CERN-Brochure-2008-001-Eng.pdf 3 http :// www.cisco.com /en/US/solutions/collateral/ns341/ns525/ns537/ns705/ns827/ VNI_Hyperconnectivity_WP.html

5.Biology Nucleotide sequences from 120,000+ species in GenBank 1 European Bioinformatics Institute (EBI) 20PB of data (genomic data doubles in size each year ) 2 A single sequenced human genome can be around 140GB in size 2 Heterogeneous data, spread out over many labs 1 http :// www.nature.com /nature/journal/v455/n7209/full/455047a.html 2 http :// www.nature.com /nature/journal/v498/n7453/full/498255a.html

6.

7.Data Mining Knowledge discovery “Big Data” “Predictive Analysis” “Data Science”

8.Data Scientists in demand

9.Why is large-scale data mining a thing? Why not use the same algorithms on larger data?

10.Big ML c . 1993 (Cohen, “Efficient…Rule Learning”, IJCAI 1993)

11.Big ML c . 1993 (Cohen, “Efficient…Rule Learning”, IJCAI 1993)

12.Related paper from 1995…

13.So in mid 1990’s….. Experimental datasets were small Many commonly used algorithms were asymptotically “slow”

14.Big ML c . 2001 ( Banko & Brill, “Scaling to Very Very Large…”, ACL 2001) Task: distinguish pairs of easily-confused words (“affect” vs “effect”) in context

15.Big ML c . 2001 ( Banko & Brill, “Scaling to Very Very Large…”, ACL 2001)

16.So in 2001….. We’re learning: “there’s no data like more data” For many tasks, there’s no real substitute for using lots of data

17.…and in 2009 Eugene Wigner’s article “The Unreasonable Effectiveness of Mathematics in the Natural Sciences ” examines why so much of physics can be neatly explained with simple mathematical formulas such as f = ma or e = mc 2 . Meanwhile, sciences that involve human beings rather than elementary particles have proven more resistant to elegant mathematics. Economists suffer from physics envy over their inability to neatly model human behavior. An informal, incomplete grammar of the English language runs over 1,700 pages . Perhaps when it comes to natural language processing and related fields , we’re doomed to complex theories that will never have the elegance of physics equations. But if that’s so, we should stop acting as if our goal is to author extremely elegant theories, and instead embrace complexity and make use of the best ally we have: the unreasonable effectiveness of data . Norvig , Pereira, Halevy, “The Unreasonable Effectiveness of Data”, 2009

18.…and in 2012 Dec 2011

19.…and in 2013

20.How do we use very large amounts of data ? Working with big data is not about code optimization learning details of todays hardware/software: GraphLab , Hadoop , parallel hardware, …. Working with big data is about Understanding the cost of what you want to do Understanding what the tools that are available offer Understanding how much can be accomplished with linear or nearly-linear operations (e.g., sorting, …) Understanding how to organize your computations so that they effectively use whatever’s fast Understanding how to test/debug/verify with large data * * according to William Cohen / Shannon Quinn

21.Asymptotic Analysis: Basic Principles Usually we only care about positive f(n), g(n), n here…

22.Asymptotic Analysis: Basic Principles Less pedantically: Some useful rules: Only highest-order terms matter Leading constants don’t matter Degree of something in a log doesn’t matter

23.Empirical analysis of complexity: plot run-time on a log-log plot and measure the slope (using linear regression)

24.Where do asymptotics break down? When the constants are too big or n is too small When we can’t predict what the program will do Eg , how many iterations before convergence? Does it depend on data size or not? When there are different types of operations with different costs We need to understand what we should count

25.Where do asymptotics break down? When the constants are too big or n is too small When we can’t predict what the program will do Eg , how many iterations before convergence? Does it depend on data size or not? When there are different types of operations with different costs We need to understand what we should count

26.Numbers (Jeff Dean says) Everyone Should Know

27.A typical CPU (not to scale) K8 core in the AMD Athlon 64 CPU 16x bigger 256x bigger Hard disk (1Tb) 128x bigger

28.A typical disk

29.Numbers (Jeff Dean says) Everyone Should Know ~= 10x ~= 15x ~= 100,000x 40x