Hadoop helps to make big data tasks feasible by providing two important services: while HDFS introduces controlled redundancy to prevent data loss, the Map/Reduce framework encourages algorithm designers to read and write data sequentially and thus optimize throughput and resource utilization. In this talk we dive into the details of how sequential access affects performance. In the first part of the talk, we show that sequential access is important not only for hard drives, but all storage components used in today's computers. Based on this observation, we then discuss statistical techniques to improve performance of common analytical tasks. In particular, we show how randomness can be used strategically to improve speed and possibly accuracy.
Presenter: Ulrich Rückert, Datameer