Spinner
Although HaLoop and Twister are efficient in handling many iterative algorithm but many machine learning and graph algorithms still perform poorly, due to those systems’ inability to exploit the (sparse) computational dependencies present in these tasks
As Spinner refers to the recomputed solution as partial solution so we differentiation iteration into Bulk iteration, where each iteration produces completely new partial solution from previous iteration result and Incrmental Iteration, where each iteration result differs partially from the previous result.
Existing data flow system supports bulk iteration where the whole result is consumed to get a new result but incremental iteration evolve the result by adding some data points. This implies adding of a mutable state that is carried to the new iteration.
Comparing both the iteration:
1.) The bulk iteration is carried out since a termination condition is met while for incremental iteartion the partial solution s is a set of
data points and the algorithms do not fully recompute si+1 from si, but rather update si by adding or updating some of its data points.
2.) In terms of performance implication bulk and incremental iterations may differ significantly. So the incremental iteration will only touch some of the data points instead of taking the whole data point together.
So Spinner supports both Bulk as well as as incremental iteration to compute an iterative iteration this gives us to perform the computation more selectively and efficiently.
An incremental iteration can be expressed using the bulk iterations with two data sets (S and W) for the partial solution and a step functions combining u and . The step function reads both data sets and computes a new version of S and W. However, recall that the primary motivation for incremental iterations is to avoid creating a completely new version of the partial solution, but to apply point updates instead. The updated partial solution should be implicitly carried to the next iteration.
As Spinner refers to the recomputed solution as partial solution so we differentiation iteration into Bulk iteration, where each iteration produces completely new partial solution from previous iteration result and Incrmental Iteration, where each iteration result differs partially from the previous result.
Existing data flow system supports bulk iteration where the whole result is consumed to get a new result but incremental iteration evolve the result by adding some data points. This implies adding of a mutable state that is carried to the new iteration.
Comparing both the iteration:
1.) The bulk iteration is carried out since a termination condition is met while for incremental iteartion the partial solution s is a set of
data points and the algorithms do not fully recompute si+1 from si, but rather update si by adding or updating some of its data points.
2.) In terms of performance implication bulk and incremental iterations may differ significantly. So the incremental iteration will only touch some of the data points instead of taking the whole data point together.
So Spinner supports both Bulk as well as as incremental iteration to compute an iterative iteration this gives us to perform the computation more selectively and efficiently.
An incremental iteration can be expressed using the bulk iterations with two data sets (S and W) for the partial solution and a step functions combining u and . The step function reads both data sets and computes a new version of S and W. However, recall that the primary motivation for incremental iterations is to avoid creating a completely new version of the partial solution, but to apply point updates instead. The updated partial solution should be implicitly carried to the next iteration.
Performance Evaluation
Figure 1 Compares the Spinner(Stratosphre) and compare them with other framework like Spark and Giraffe.
Figure 2 Compares the execution time of running the algorithm on various websites. The runtime of the algorithm is similar in Spark, Giraph, and Stratosphere for the small Wikipedia dataset. We were unable to use Spark and Giraph with the large data-sets, because the number of messages created exceeds the heap size on each node.
References
https://www.stratosphere.eu/sites/default/files/papers/spinningFastIterativeDataFlows_12.pdf