Recently there was an interesting Problem on Project Euler, Problem 415 called Titanic Sets. The problem essentially asks for how many sets are possible from a lattice grid of GxG points such that the sets have only two co-linear points. This is not very hard problem if the grid size is reasonable. But when the grid size is 10^11, then the problem all of a sudden becomes extremely challenging. In some ways this is similar to the big data problem. Everyone has been dealing with data for the last few decades. But the volume of data being gathered in the recent years is what’s making it more challenging.
To cope with this data volume, we now have several Big Data solutions. Any big data solution typically consists of a component to store lots of data and another component to process all that data. Map/Reduce is one such component for processing large volumes of data.
Now coming back to the Euler problem, as of this writing there are 41 solvers after the problem has been there for 15 days. It’s usually unusual because of all the smart people on Project Euler who solve the problems in a couple of hours to just a few days to fill up the top 50 to 100 spots. So, the fastest solvers table is a good indicator of the complexity of a problem. Of course, what’s hard to one may be very easy to another and vice versa. However, this table does serve as a good indicator on average.
When I managed to solve this problem and looked at the private forum, one of the solutions took less than 5 minutes running on a 2000 node cluster! Another PE member estimated that his current algorithm would take 6+ days.
Isn’t it excellent to be able to use a big cluster and get the result in just a few minutes instead of days? Finally we are able to make use of all that silicon that Intel is producing. There is clearly a benefit and knowledge to be gained in learning and programming algorithms for distributed computing and finally be able to actually pull of something like that in practice is highly commendable (I don’t have access to such large clusters, so I must say I envy people who do :)).
In comparison, my C++ code took 1 minute 22 seconds running single threaded on a Core i7 2.8 GHz machine. A few solutions are within the 1 minute rule. And the person who proposed the problem actually has a solution that gives the result within 13 seconds! These algorithms are sub-linear and hence very fast even for the order of 10^11.
Clearly, when one is faced with large computational problems they have choices on how to reduce the computation time. One is to reduce the order of the computation with more and more sophisticated algorithms and the other is to go the distributed computing route and use Map/Reduce or some other similar infrastructure. Of course, not every problem can be solved by reducing the order especially when dealing with large volumes of data where the bottleneck is the data sitting on the disks (as opposed to the Project Euler problem which is based on a large number and not data). In such cases, it’s better to definitely have distributed computing to take advantage of the I/O bandwidth from multiple storage devices as opposed to a single device. Perhaps a hybrid approach can provide the best of both worlds.
However, it should be noted that it’s not always easy to come up with a sophisticated algorithm that reduces the order and second it’s probably even more challenging to code such an algorithm. So, perhaps investing a lot of effort on reducing the order complexity may not give the desired return on investment. It sort of depends on how long it would take to come up with and code the algorithm but also on how many times the algorithm is going to be used (may be a couple of years running every day on a peta-byte data store?). The more it’s going to be used the more you may invest in further optimizing. Perhaps, in reality it’s not going to be develop and deploy. It’s probably a repeated cycle of develop and deploy by learning the systems behavior and tweaking the algorithm to make it even more faster. While it’s easy to throw in more nodes to the clusters and scale and not worry about the cost of continuous refinement of the algorithm, the cost of operating such a large cluster in terms of the data center space, electricity and hardware failures etc should be kept in mind.
In conclusion, dealing with large computational tasks with large volumes of data should make use of a combination of distributed infrastructure such as Map/Reduce and at the same time look at reduce the order of the algorithm as much as possible.