Learn Tracker: Life Long Learning with MOOCs

C O U R S E

Introduction to Data Science

Bill Howe, University of Washington

https://www.coursera.org/course/datasci

C O U R S E L E C T U R E

Scalability Basics

Notes taken on July 2, 2014 by Edward Tanguay

•

what does scalability mean?

•

operationally

•

in the past, "scale up"

•

needs to work even if data doesn't fit in main memory on one machine

•

need to bring in data, work on it, then bring in more data, work on it, etc.

•

what was important was the size of the memory footprint

•

this was "out of core" processing of large datasets

•

but this began to not be enough since you couldn't bring data on and off fast enough to satisfy the load

•

now, "scale out"

•

you can increase the speed that you can process the data by making use of 1000s of cheap computers

•

algorithmically

•

in the past

•

algorithm is scalable if when you have n data items, you must do no more than n^m operations, where m is 1 or 2 or some low number, with 4 and beyond it becomes difficult for large datasets

•

this was tractable, became the definition of scalable

•

now:

•

n^m/k where k is the number of computers you can apply to the problem

•

soon, "streaming data"

•

"if you have n data items, you should do no more that n * log(n) operations

•

as data grows, you may only get one pass at it

•

whenever you hear "log" you should think "trees" so this enables you to look at each item and put it in some tree data structure

•

example: Large Synoptic Survey Telescope (30TB / night)

•

example 1: "Needle in Haystack"

•

find matching DNA sequences that equal "GATTACGATATTA"

•

linear search

•

move through each, compare, until you get a match

•

40 records, 40 comparisons

•

n records, N comparisons

•

algorithmic complexity is order N

•

sort the sequences

•

sort them first, then start in the middle, and move to the direction of the one you are looking for

•

after the first check, you remove the need to check half the data

•

40 records, 4 comparisons

•

but we had to sort the data ahead of time

•

algorithmic complexity is log(N) comparisons

•

this type of approach to search is baked into databases, i.e. "needle in haystack" problems

•

e.g. CREATE INDEX seq_idx ON sequence

•

this is old-style scalability, i.e. it fits in memory

•

builds a platform for building and reusing indexes

•

"there is a lot of work that you are getting for free just by turning your problem into an SQL statement"

•

example 2: "Read Trimming"

•

given set of DNA sequences, trim a suffix of each and create a new dataset

•

this is a standard preprocessing operation

•

linearly

•

loop through and process each

•

there is no index that is going to help us here, you have to at least touch every record

•

how can we do better?

•

the processing of each doesn't have to do with the processing of the other

•

therefore, break dataset into a number of groups and give groups to different processors which process simultaneously

•

complexity is N/k, where

•

N is the number of items to operate

•

k is the number of workers