EDWARD'S LECTURE NOTES:
More notes at http://tanguay.info/learntracker
C O U R S E 
Introduction to Data Science
Bill Howe, University of Washington
https://www.coursera.org/course/datasci
C O U R S E   L E C T U R E 
Scalability Basics
Notes taken on July 2, 2014 by Edward Tanguay
what does scalability mean?
operationally
in the past, "scale up"
needs to work even if data doesn't fit in main memory on one machine
need to bring in data, work on it, then bring in more data, work on it, etc.
what was important was the size of the memory footprint
this was "out of core" processing of large datasets
but this began to not be enough since you couldn't bring data on and off fast enough to satisfy the load
now, "scale out"
you can increase the speed that you can process the data by making use of 1000s of cheap computers
algorithmically
in the past
algorithm is scalable if when you have n data items, you must do no more than n^m operations, where m is 1 or 2 or some low number, with 4 and beyond it becomes difficult for large datasets
this was tractable, became the definition of scalable
now:
n^m/k where k is the number of computers you can apply to the problem
soon, "streaming data"
"if you have n data items, you should do no more that n * log(n) operations
as data grows, you may only get one pass at it
whenever you hear "log" you should think "trees" so this enables you to look at each item and put it in some tree data structure
example: Large Synoptic Survey Telescope (30TB / night)
example 1: "Needle in Haystack"
find matching DNA sequences that equal "GATTACGATATTA"
linear search
move through each, compare, until you get a match
40 records, 40 comparisons
n records, N comparisons
algorithmic complexity is order N
sort the sequences
sort them first, then start in the middle, and move to the direction of the one you are looking for
after the first check, you remove the need to check half the data
40 records, 4 comparisons
but we had to sort the data ahead of time
algorithmic complexity is log(N) comparisons
this type of approach to search is baked into databases, i.e. "needle in haystack" problems
e.g. CREATE INDEX seq_idx ON sequence
this is old-style scalability, i.e. it fits in memory
builds a platform for building and reusing indexes
"there is a lot of work that you are getting for free just by turning your problem into an SQL statement"
example 2: "Read Trimming"
given set of DNA sequences, trim a suffix of each and create a new dataset
this is a standard preprocessing operation
linearly
loop through and process each
there is no index that is going to help us here, you have to at least touch every record
how can we do better?
the processing of each doesn't have to do with the processing of the other
therefore, break dataset into a number of groups and give groups to different processors which process simultaneously
complexity is N/k, where
N is the number of items to operate
k is the number of workers