SlideShare une entreprise Scribd logo
1  sur  16
Télécharger pour lire hors ligne
High-level languages for Big Data Analytics
Jose Luis Lopez Pino
jllopezpino@gmail.com
Janani Chakkaradhari
chjananicse@yahoo.com
June 19, 2013
Abstract
This work presents a review of the literature about the high-level lan-
guages which came out since the MapReduce programming model and
Hadoop implementation shook up the parallel programming over huge
datasets. MapReduce was a major step forward in the field, but it has
severe limitations that the high-level programming languages try to over-
come them in different ways.
Our work intends to compare three of the main high level languages
(Pig Latin, HiveQl and Jaql) based on four different criteria that we con-
sider are very relevant and which studies are consistent in our opinion.
Those criteria are expressive power, performance, query processing and
JOIN implementation
Analyses based on multiple criteria reveals the differences between the
languages but it is shown that none of the languages analysed (Pig Latin,
HiveQL and Jaql) beats all the other in every criterion. It depends on
the scenario or application we need to consider these comparison results
to choose which language will be more suitable for implementation
Finally, we are going to address two very well-known pitfalls of MapRe-
duce: latency and implementation of complex algorithms.
1 Introduction
1.1 The MapReduce programming model
The MapReduce programming model was introduced by Google in 2004[5]. This
model allows programmers without any experience in parallel coding to write
the highly scalable programs and hence process voluminous data sets. This high
level of scalability is reached thanks to the decomposition of the problem into
a big number of tasks.
MapReduce is a completely different approach to big data analysis and it
has been proven effective in large cluster systems. This model is based on two
functions that are coded by the user: map and reduce.
• The Map function produces a set of key/value pairs, taking a single pair
key/value as input.
1
• The Reduce function takes a key and a set of values related to this key
as input and it might also produce a set of values, but commonly it emits
only one or zero values as output
Using this model, programmers do not have to care about how the data is
distributed, how to handle with failures or how to balance the system. However,
this model also have important drawbacks: it is complicated to code very simple
and common task using this single dataflow, many of the tasks are expensive to
perform, the user code is difficult to debug, the absence of schema and indexes
and a lot of network bandwidth might be consumed[10]. The purpose of the
different high level languages is to address some of those shortcomings.
1.2 Hadoop
In current world, the generation of data per day is measured in petabyte scales.
These large amounts of data have made the need of data to be stored in more
than one system at a time. This means partitioning the data and soring in
separate machine. File systems that manage the storage across a network of
machines are called distributed file system. The challenging aspect of DFS is in
the fault tolerance which leads to data loss. Hadoop becomes the solution for
this problem.
Doug Cutting, the creator of Hadoop, named it after his son’s toy elephant.
Hadoop has two layers such as storage and execution layer. The storage layer
is Hadoop Distributed File System (HDFS) designed for storing very large files
with streaming data access patterns, running on clusters of commodity hard-
ware. The fixed block size of Hadoop is 64MB by default. The execution layer
is Hadoop Map Reduce and it is responsible for running a job in parallel on
multiple servers at the same time.
Since the Hadoop cluster nodes are commodity hardware it is very less cost
and the architecture is very simple.A typical Hadoop cluster has a single master
server called name node that manages the File System and the job tracker
process manages the job.This node should be high quality with high processing
speed.It also has multiple slave nodes which are called data nodes that run the
tasks on their local server.
In general the data will be duplicated (3 times) in different nodes to support
the fault tolerance. If any of these data nodes fails the master nodes keeps track
of it and immediately replaces the one which is active and updates the data [24].
2 High level languages
After MapReduce was publicly announced and the Hadoop framework was cre-
ated, multiple high level languages have been created specifically which intent
to deal with some of the problems of the model mentioned before. Also some
already existing languages have been integrated to work over this model, like R
(MapR, Ricardo[4]) and SQL.
2
Concerning the selection of the languages for our comparison, we have chosen
those three programming languages that are present in all the comparisons
(Pig Latin, HiveQL and Jaql) for sake of consistency but we have also consider
important to mention some other interesting high level query languages (Meteor
and DryadLINQ). Additionally, we also study how the query is processed by the
system, which translate it into a MapReduce workflow.
2.1 Pig Latin
Pig Latin is executed over Hadoop, an open-source implementation of the map-
reuce programming model formerly developed at Yahoo Research[14]. It is a
high level procedural languages that implements high level operations similar
to those that we can find in SQL adnd some other interesting operators listed
below:
• FOREACH to process a transformation over every tuple of the set. To
make possible to parallelise this operation, the transformation of one row
should depend on another.
• COGROUP to group related tuples of multiple datasets. It is similar to
the first step of a join.
• LOAD to load the input data and its structure and STORE to save data
in a file.
The main goal of Pig Latin is to reduce the time of development. For this
purpose, it includes feature like a nested data model, user-defined functions and
the possibility of execute some analytic queries over text files without loading
the data.
Unlike SQL, the procedural nature of the languages also allows the program-
mer to have more control over the execution plan, meaning that he can speed
the performance up without relying on this task on the query optimiser.
2.2 HiveQL
It is an open source project initially developed by Facebook. Hive has the system
built on top of Hadoop that efficiently incorporates map reduce for execution,
HDFS for storage and keeps the metadata in an RDBMS.
In simple terms Hive has been called as a data warehouse built on top of
Hadoop. The main advantage of Hive is it makes the system familiar by ex-
tending the functionalities of SQL and also its queries looks similar to it.
Of course scalability and performance are the other two features. Hive tables
can be directly defined on the HDFS. Schemas are stored in RDBMS. Hive
support complex column types such as map, array, struct data types in addition
to the basic types [23].
Hive supports most of the transactional sql queries such as
• Sub queries
3
• Different kinds of joins inner, left outer, right outer and outer joins
• Cartesian products, group bys and aggregations
• Union all
• Create table as select.
Hive uses traditional RDBMs to store the metadata. Usually metadata
storage is accessed more frequently and it is always to keep metadata in random
access rather than sequential access. As HDFS is not well suited for random
access Hive stores the metadata in databases like MySQL and Oracle.
It is also important to focus that there is a low latency whenever HiveQL tries
to access metadata. In spite of this impedance, Hive maintains the consistency
between metadata and data [23].
2.3 JAQL
Jaql[3] is a declarative scripting language built on top of Hadoop and used in
some IBM products (InfoSphere BigInsights and Cognos Consumer Insight).
This language was developed after Pig and Hive and hence it has been designed
with the purpose of making it more scalable, flexible and reusable than the
alternatives that existed at the time.
Simplicity is one of the key goals of the Jaql data model that is clearly
inspired by JSON: values are always trees, there are no references and the textual
representation is very similar. This simplicity has two advantages: it facilitates
the development and also makes easier the distribution between nodes of the
program.
The other main goal of the data model is the adaptability. Jaql can handle
semistructured documents (data without a schema) but also structures records
validated against a schema. In consequence, programs written in Jaql can load
and write information in different sources from relational databases to delimited
plain files.
The flexibility of the language relies on the data model but also on the
control over the evaluation plan because the programmer can work at different
levels of abstraction using Jaql’s syntax:
• Full definition of the execution plan.
• Use of hints to indicate to the optimizer some evaluation features. This
feature is present in most of the database engines that use SQL as query
language.
• Declarative programming, without any control over the flow.
4
2.4 Other languages
2.4.1 Meteor
Stratosphere[2] is a system designed to process massive datasets and one of the
main parts that compound it is the Pact programming model. Pact[1] is an
extension to the MapReduce programming model also inspired by functional
programming. One of the limitations of MapReduce is that it is based on only
two simple second-order functions and this new model addresses this problem
including new operators (called contracts) to perform those analyses easier and
more efficiently:
• Cross: performs the Cartesian product over the input sets.
• CoGroup: Group all the pairs with the same key and process them with
a user-define function.
• Match: it also matches key/value pairs from the input data, but pairs
with the same key might be processed separately by the user function.
In addition to this new programming model, the Stratosphere’s stack in-
cludes Meteor, a query language and Sopremo, the operator model used by
Meteor[8]. Meteor, as the other high level languages presented before, was
designed to facilitate the task of developing parallel programs using the pro-
gramming model (in this case Pact). In consequence, Meteor programs are at
the end translated to a Pact program. Sopremo help us to manage collections
of semantically rich operators that can be extended and that are grouped into
packages.
To speed up the execution of the code, the optimization is applied in two
steps: first the logical plan, which consists of Sopremo operators, is optimized
and secondly Pact’s compiler applies physical optimizations to the result pro-
gramme.
2.4.2 DryadLINQ
Dryad [25] is another execution engine to compute large-scale analysis over
datasets. DryadLINQ gained community’s interest because it is embedded in
.NET programming languages and a large number of programmers are already
familiar with this development platform and with LINQ in particular.
The designers of this system made a big effort to support almost all the
operators available in LINQ plus some extra operators interesting for paral-
lel programming. This framework also allows developers to include their own
implementation of the operators[9].
After the DryadLINQ code is extracted from the programme, it is trans-
lated to a Dryad plan and then optimized. The optimizer mainly performs four
tasks: it pipelines operations that can be execute by a single machine, removes
redundancy, pushes aggregations and reduces the network traffic.
5
3 Comparing high level languages
The design motivations of the languages are diverse and therefore the differences
between the languages are multiple. To compare these three high level languages
we have decided to choose four criteria that are interesting from our point of
view and that are well described in the literature that we have reviewed.
First of all, we have analysed the expressiveness and the general performance
of the languages. In general, developers prefer a language that allow them to
write concise code (expressive power) and that is efficient (performance).
After that we dive into two criteria that also have an important impact in
the performance: the join implementation and the query processing. The join
algorithms are a well-known big burden for the performance working with sets
and therefore we will analyse the different algorithms implemented by those
languages.
We consider these criteria sufficient to consider the solution that better suits
our needs, however there are many other facts that are also mentioned or studied
by the literature written so far like the programming paradigm, the code size or
scalability. Scalability is a very relevant criterion since it motivated the creation
of MapReduce, however it is not easy to find consistent literature concerning all
the topics that we consider significant.
3.1 Expressive power
Robert Stewart[20] classifies the high level languages in three categories in ac-
cordance with their computational power, from less to more powerful:
• Relational complete: a language is considered relational complete if it
includes the primitive operations of the relational algebra: the selection,
the rename, the projection, the set union, the set difference and the cross
(Cartesian) product. The different kinds of joins implemented in each
language are compared in a separate section.
• SQL equivalent: SQL is a standard language querying data stored in re-
lational database management system. It provides all the operations of
the relational algebra plus aggregate functions, which are not part of the
relational algebra although they are a common extension for data compu-
tation.
• Turing complete: a Turing complete language must allow conditional
branching, indefinite iterations by means of recursion and emulation of
an infinite memory model.
Pig Latin and HiveQL are considered SQL equivalent because they are more
powerful than relational algebra (they include numeric aggregation functions).
Hence we can consider their expressive power equivalent, but there are evident
differences. SQL is a standard of the industry that has been developed during
almost 40 years including several extensions and in consequence it is not easy
to develop a SQL compliant system.
6
HiveQL is inspired by SQL but it does not support the full repertoire in-
cluded in the SQL-92 specification. On the other hand, HiveQL has also included
some features notably inspired by MySQL and MapReduce that are not part of
this specification. A comparison between SQL and HiveQL by Tom White in
2009 revealed some limitations of HiveQL like indexes, transactions, subqueries
outside the FROM clause, etc.[24] In the case of Pig Latin, its syntax are func-
tionalities are not inspired by SQL and the differences are more obvious. For
instance, Pig Latin does not have OVER clause and includes the COGROUP
operator that is not present in SQL.
Jaql includes basic flow control using if-else structures and recursion in
higher-order functions and hence it is consider Turing complete. However, tak-
ing into account that Pig and HiveQL programmes can be extends using user
defined functions, Pig Lating and HiveQL might also be considered Turing com-
plete.
Finally, Jaql[3] makes possible to compile high-level declarative expressions
to lower-level function calls. As a result of it, the low-level functions can be
extended. Neither Pig Latin nor HiveQL include this feature, called source-to-
source compilation, which could increase the expressiveness of the language.
3.2 Performance
The usual benchmarking for Pig to measure its performance is PigMix. Pig Mix
is a set of queries to test the performance. These set checks the scalability and
latency [18]. Hive performance benchmark is mainly based on the queries that
are specified by Pavlo et al in the paper [15]. The queries basically cover the
selection task, Aggregation task and a Join task.
There is also a Pig- Latin implementation for the TPC-H queries and HiveQL
implementation of TPC-H queries [19]
Even though the objective of the each of these languages is to generate an
equivalent map reduce jobs for their corresponding input script, the runtime
measure of these languages shows different results for same kind of bench mark-
ing applications experimented in [21].
At first the paper describes Scale up, Scale out, scale in and runtime as their
performance metrics. In scale up the size of the cluster size was fixed and the
computation is increased which means the number of nodes in Hadoop envi-
ronment is kept constant. The performance of all three languages interestingly
varied based on the distribution of data. For skewed data, Pig and Hive seems
to be more effective in handling it compared to JAQL.
In Scale out the computation is fixed in the sense there is no increase in
computation for a given experiment with the increase in the number of nodes.
Again Pig and Hive the paper argues that Pig and Hive better in utilizing
the increase in the cluster size compared JAQL. But at one point there is no
improvement in performance by increasing the nodes.
Moreover Pig and Hive allows the user to explicitly specify the number of
reducers task. It has been argued that this feature has significant influence on
the performance.
7
3.3 Query processing
In order to make a good comparison we should have the basic knowledge on
how these HLQLs are working. In this section we will focus on answering to
the following question. How the abstract user representation of the query or the
script is converted to map reduce jobs? For data intensive parallel computations,
the choice of high level languages mainly depends on the specific application
scenarios[17]. By taking this into account, we can realize the importance of
understanding the query compilation methods implemented by these languages.
3.3.1 Pig Latin
The structure of Pig Latin is similar to SQL style. The goal of writing Pig
Latin script is to produce an equivalent map reduce jobs that can be executed
in the Hadoop environment. Pig is considered to have the basic characteristic
of query language, hence the initial steps of compilation is similar to SQL query
processing [6].
Pig programs are first passed to a parser component, where it checks for
the syntactic correctness of the Pig Latin script. The result of the parser is
a complete logical plan. Unlike SQL where it produces parse tree, the result
of parsing phase in Pig compilation gives a directed acyclic graph (DAG). The
logical plan is then passed to the logical optimizer component where the classi-
cal optimization operations such as pushing of projections are carried out. The
result of logical optimizer is passed to Map-Reduce compiler to compute a se-
quence of Map- Reduce Jobs which is then passed to optimization phase and
finally submitted to Hadoop for execution[6]. The following (Figure 1) example
describes the generation of logical plan for a simple word count program in Pig
Latin. The output of each operator is shown near to the rectangles.
Figure 1: Compilation of Pig Latin to Logical Plan
The Pig translates the logical plan into physical plan and it replaces with
the physical operator in the map reduce jobs. In most cases the logical operator
becomes the equivalent physical operator. In this case LOAD, FOREACH and
STORE remain the same. In Pig the operator GROUP is translated as LOCAL
REARRANGE, GLOBAL REARRANGE AND PACKAGE in physical plan.
8
Rearranging means either it does hashing or sorting by key. The combination
of local and global rearranges produces the result in such a way that the tuples
having same group key will be moved to same machine[6].
The input data is broken down into chunks and the map will all run indepen-
dently in parallel to process the input data. This has been handled by the job
tracker of the map reduce framework. Map takes the input data and constructs
the key value pairs. It also sorts the data by key. The shuffle phase is managed
by Hadoop. It fetches the corresponding partition of data from map phase and
merges it into single sorted list then grouped by key. This will be the input for
the reduce phase which usually performs the aggregation part of the query in
our case it is count. An equivalent map reduce jobs for our example Pig Latin
is shown in the Figure 2
Figure 2: Physical Plan to Map Reduce Jobs
3.3.2 HiveQL
In DBMS, the query processor transforms the user queries into sequence of
database operations and executes those operations. Initially the query is turned
into a parse tree structure in such a way to transform into relation algebraic
notation. This is termed as logical query plan [Garcia]. As HiveQL is SQL
like declarative structure, the query processing is same as the SQL in tradi-
tional database engine. The following steps describes in brief about the query
processing in HiveQL [23]
• It gets the Hive SQL string from the client
• The parser phase converts it into parse tree representation
• The semantic analyser component converts the parse tree into block based
internal query format
9
• The logical query plan generator converts it into logical query represen-
tation and then optimizes it. Prunes the columns early and pushes the
predicates closer to the table
• Finally, the logical plan is converted to physical plan and then map reduce
jobs
3.3.3 JAQL
JAQL written from the application is evaluated first by the compiler. The Map
Reduce jobs can be directly called by using JAQL script. Usually the user
relies on JAQL compiler to convert the script into map reduce jobs. JAQL
includes two higher order functions such as mapReduceFn and mapAggregate
to execute map reduce and aggregate operations respectively. The rewriter
engine generates calls to the mapReduceFn or mapAggregate, by identifying the
parts of the scripts and moving them to map, reduce and aggregate function
parameters. Based on the some rules, rewriter converts them to Expr tree.
Finally it checks for the presence of algebraic aggregates, if it is there then it
invokes mrAggregate function. In otherwords it can complete the task with
single map reduce job. Figure 3
Figure 3: Jaql-Query processing stages
Each language has its own way of implementation for query processing. Dur-
ing the review it is noted that
• Pig currently misses out on optimized storage structures like indexes and
column groups. HiveQL provides more optimization functionalities such
as performs join in map phase instead of reduce phase and in case of
sampling queries prunes the buckets that are not needed
• JAQLs physical transparency is an added value feature because it allows
the user to add new run time operator without affecting JAQLs internals.
10
3.4 JOIN implementation
Join is an essential operation in relational database models. The basic need
for join comes in the fact that the relations are in normalized form. So in the
computation of aggregation or even in much kind of OLAP operations Join
becomes the necessary step to compute the expected results.
3.4.1 Pig Latin
Pig Latin Supports inner join, equijoin and outer join. The JOIN operator
always performs inner join. Pig executes join in two flavours. First join can be
achieved by COGROUP operation followed by FLATTEN [4]. The inner join
can be extended to three specialized joins [16].
• Skewed Joins: The basic idea is to compute a histogram of the key space
and uses this data to allocate reducers for a given key. Currently pig
allows skewed join of only two tables. The join is performed in Reduce
phase.
• Merge Joins: Pig allows merge join only if the input relations are already
sorted. The join is performed in Reduce phase.
• Fragment Replicate joins:This is only possible if one of the two relations
is smaller enough to fit into memory. In this case, the big relation is
distributed across hadoop nodes and the smaller relation is replicated on
each node. Here the entire join operation is performed in Map phase. Of
course this is trivial case.
The choice of join strategy can be specified by the user while writing the
script. As example join operation in Pig Latin is shown in the Figure 4
Figure 4: Join code in Pig Latin
3.4.2 HiveQL
In the early stages, HiveQL was only designed to support the common join
operation. In this join, in the map phase the joining tables are read and a pair
of join key and value is written into an intermediate file in order to pass it to
suffle phase which is handled by Hadoop. In shuffle phase Hadoop sorts and
combines these key value pairs and sends the same tuples having the same key
to the reducers in order to make perform the acutal join operation. Here the
shuffle and reduce phase is more expensive since it involves sorting.
11
In order to overcome this, the map side join was introduced and it is only
possible in case if one of the joining table exactly fits into the memory. It is
similar to the replicate join as in Pig Latin.
3.4.3 JAQL
JAQL supports only equijoin. JOIN is expressed between two or more input
arrays. It supports multiple types of joins, including natural, left-outer, right-
outer, and outer joins. One of the advantages of Jaql is that physical trans-
parency allows the function support of Jaql to add new join operator and use
them in the queries without modifying anything in query compiler.
The following points represent the summary of join implementation
• Both Pig and Hive has the possibility to performs join in map phase
instead of reduce phase
• For skewed distribution of data, the performance of JAQL for join is not
comparable to other two languages
4 Future work
4.1 Interactive queries
One of the main problems of MapReduce all the languages built on top of this
framework (Pig, Hive, etc.) is the latency. As a complement of those technolo-
gies, some new frameworks that allow programmers to query large datasets in
an interactive manner have been developed, like Dremel[12] or the open source
project Apache Drill.
In order to reduce the latency of the queries compared to other tools for
large dataset analysis, Dremel stores the information as nested columns, uses
a multi-level tree architecture in the query execution and balances the load by
means of a query dispatcher.
We do not have too many details of the query language of Dremel, but
we know that is based on SQL and includes the usual operations (selection,
projection, etc.) and features (user define functions or nested subqueries) of
SQL-like languages. The characteristic that distinguish this language is that it
operates with nested tables as inputs and outputs.
4.2 Machine learning
Map reduce being a way to process Big data and it is obvious outperforms for
basic operations such as selection on the other hand it is more complicated to
address complex queries by this processing technique. The challenging aspect
of machine learning algorithms is that is not simply computing aggregates over
datasets and it is to identify some hiding patterns in the given data. Example
of such questions includes, what page will the visitor next visit?
12
Some of the ML learning algorithms and the general approach to process the
data by map reduce is discussed in paper [7]. Bayess classifier requires counting
the occurrences in the training data. In large data set the extraction of features
is intensive and at least the reduce task should be configured to compute the
summation of each (feature, label) pair.
Mahout is an Apache project for building the scalable machine learning
algorithms. These algorithms include clustering, classifications, collaborating
filtering and data mining frequent item set. These in turn are predominantly
used in recommendations. The collaborative filtering supports both user- user
and item-item based similarity [22].
• Pig Latin has the extensions to deal with the predictive analytics capa-
bilities. In this paper, Twitter has implemented learning algorithms by
placing them in Pig Storage functions [11]. The storage functions are
called in the final reduces stage of the overall dataflow.
• There is a recent work on the extensive support of ML in Hive [13]. The
author tries to follow the same that has been implemented for Pig by
Twitter. Here the machine learning is treated as UAFs.
• A new data analytics platform Ricardo is proposed combines the func-
tionalities of R and Jaql. It basically takes the advantage of statistical
computing features provided by R with the high level language which
generate map reduce jobs using Jaql [4].
5 Conclusions
In this literature review we have first introduced the MapReduce programming
model paying attention to its main drawbacks and its main open-source imple-
mentation (Hadoop).
After that we have briefly described some high level languages that try to
address the problems mentioned from different perspectives, focusing on those
that are popular in the literature available at the time of writing (Pig Latin,
HiveQL and Jaql) and some interesting alternatives (DryadLINQ and Meteor).
Based on those consistent and relevant studies reviewed, it is clear that there
is no single language that beat all the other options. Jaql was created after the
other two languages and that probably gave it some advantages in its design.
Based on the first criterion analysed, we can state that Jaql is expressively more
powerful since it includes basic flow control using if-else structures, meanwhile
with the other two this is only possible using UDF functions. However, we have
seen that Jaql also shows the worst performance in the benchmarks described
before. Pig and Hive probably perform better in those benchmarks because they
support map phase JOIN. Hive also adopts advance optimization techniques for
query processing that certainly speed up the resulting code.
Finally, we have seen how high level languages for big data analytics are ad-
dressing some of the problems of this paradigm. Real-time processing demands
13
a very low latency of response and this is one of the main disadvantages of
the MapReduce model. In consequence, some new languages for large dataset
analytics that do not use this model have been designed.
Additionally, some machine learning algorithms are difficult to implement
using this model. Some alternatives have shown up in the last years, for instance
the Apache Software Foundation is developing Mahout, a library that implement
scalable machine learning algorithms using the map/reduce paradigm.
References
[1] Alexander Alexandrov, Stephan Ewen, Max Heimel, Fabian Hueske, Odej
Kao, Volker Markl, Erik Nijkamp, and Daniel Warneke. MapReduce and
PACT - comparing data parallel programming models. In Proceedings of
the 14th Conference on Database Systems for Business, Technology, and
Web (BTW), BTW 2011, pages 25–44, Bonn, Germany, 2011. GI.
[2] Dominic Battr´e, Stephan Ewen, Fabian Hueske, Odej Kao, Volker Markl,
and Daniel Warneke. Nephele/PACTs: A programming model and execu-
tion framework for web-scale analytical processing. In Proceedings of the
1st ACM symposium on Cloud computing, SoCC ’10, pages 119–130, New
York, NY, USA, 2010. ACM.
[3] Kevin S Beyer, Vuk Ercegovac, Rainer Gemulla, Andrey Balmin, Mohamed
Eltabakh, Carl-Christian Kanne, Fatma Ozcan, and Eugene J Shekita.
Jaql: A scripting language for large scale semistructured data analysis.
In Proceedings of VLDB Conference, 2011.
[4] Sudipto Das, Yannis Sismanis, Kevin S Beyer, Rainer Gemulla, Peter J
Haas, and John McPherson. Ricardo: integrating r and hadoop. In Proceed-
ings of the 2010 ACM SIGMOD International Conference on Management
of data, pages 987–998. ACM, 2010.
[5] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing
on large clusters. Communications of the ACM, 51(1):107–113, 2008.
[6] Alan F Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shra-
van M Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh
Srinivasan, and Utkarsh Srivastava. Building a high-level dataflow sys-
tem on top of map-reduce: the pig experience. Proceedings of the VLDB
Endowment, 2(2):1414–1425, 2009.
[7] Dan Gillick, Arlo Faria, and John DeNero. Mapreduce: Distributed com-
puting for machine learning. Berkley (December 18, 2006), 2006.
[8] Arvid Heise, Astrid Rheinl¨ander, Marcus Leich, Ulf Leser, and Felix Nau-
mann. Meteor/sopremo: An extensible query language and operator model.
In Proceedings of the International Workshop on End-to-end Management
of Big Data (BigData) in conjunction with VLDB 2012, 2012.
14
[9] Michael Isard and Yuan Yu. Distributed data-parallel computing using a
high-level programming language. In Proceedings of the 2009 ACM SIG-
MOD International Conference on Management of data, pages 987–994.
ACM, 2009.
[10] Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn Chung, and
Bongki Moon. Parallel data processing with mapreduce: a survey. ACM
SIGMOD Record, 40(4):11–20, 2012.
[11] Jimmy Lin and Alek Kolcz. Large-scale machine learning at twitter. In
Proceedings of the 2012 international conference on Management of Data,
pages 793–804. ACM, 2012.
[12] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva
Shivakumar, Matt Tolton, and Theo Vassilakis. Dremel: interactive analy-
sis of web-scale datasets. Proceedings of the VLDB Endowment, 3(1-2):330–
339, 2010.
[13] Extension of Hive to support Machine Learning. Hiveql.
[14] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and
Andrew Tomkins. Pig latin: a not-so-foreign language for data process-
ing. In Proceedings of the 2008 ACM SIGMOD international conference
on Management of data, pages 1099–1110. ACM, 2008.
[15] Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J Abadi, David J
DeWitt, Samuel Madden, and Michael Stonebraker. A comparison of ap-
proaches to large-scale data analysis. In Proceedings of the 35th SIGMOD
international conference on Management of data, pages 165–178. ACM,
2009.
[16] Pig. Fragment replicate join.
[17] Caetano Sauer and Theo H¨arder. Compilation of query languages into
mapreduce. Datenbank-Spektrum, pages 1–11, 2013.
[18] Benchmarking standards. Pigmix.
[19] Benchmarking standards Hive. Hiveql.
[20] Robert Stewart. Performance and programmability of high level data par-
allel processing languages: Pig, hive, jaql & java-mapreduce, 2010. Heriot-
Watt University.
[21] Robert J Stewart, Phil W Trinder, and Hans-Wolfgang Loidl. Comparing
high level mapreduce query languages. In Advanced Parallel Processing
Technologies, pages 58–72. Springer, 2011.
[22] Ronald C Taylor. An overview of the hadoop/mapreduce/hbase frame-
work and its current applications in bioinformatics. BMC bioinformatics,
11(Suppl 12):S1, 2010.
15
[23] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad
Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy.
Hive: a warehousing solution over a map-reduce framework. Proceedings of
the VLDB Endowment, 2(2):1626–1629, 2009.
[24] Tom White. Hadoop: The definitive guide. O’Reilly Media, 2012.
[25] Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, ´Ulfar Erlingsson,
Pradeep Kumar Gunda, and Jon Currey. Dryadlinq: A system for general-
purpose distributed data-parallel computing using a high-level language.
In Proceedings of the 8th USENIX conference on Operating systems design
and implementation, pages 1–14, 2008.
16

Contenu connexe

Tendances

Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduceFARUK BERKSÖZ
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache sparkdatamantra
 
Java High Level Stream API
Java High Level Stream APIJava High Level Stream API
Java High Level Stream APIApache Apex
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streamingdatamantra
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFramePrashant Gupta
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentationpunesparkmeetup
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Incorta spark integration
Incorta spark integrationIncorta spark integration
Incorta spark integrationDylan Wan
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streamingdatamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streamingdatamantra
 
Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringGeorge Ang
 
Optimizing Supervised and Implementing Unsupervised Machine Learning Algorith...
Optimizing Supervised and Implementing Unsupervised Machine Learning Algorith...Optimizing Supervised and Implementing Unsupervised Machine Learning Algorith...
Optimizing Supervised and Implementing Unsupervised Machine Learning Algorith...HPCC Systems
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Sparkdatamantra
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Soumee Maschatak
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseNick Dimiduk
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 

Tendances (20)

Apache Spark Streaming
Apache Spark StreamingApache Spark Streaming
Apache Spark Streaming
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache spark
 
Java High Level Stream API
Java High Level Stream APIJava High Level Stream API
Java High Level Stream API
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Sumedh Wale's presentation
Sumedh Wale's presentationSumedh Wale's presentation
Sumedh Wale's presentation
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Incorta spark integration
Incorta spark integrationIncorta spark integration
Incorta spark integration
 
Hpcc
HpccHpcc
Hpcc
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Subhabrata Deb Resume
Subhabrata Deb ResumeSubhabrata Deb Resume
Subhabrata Deb Resume
 
Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means Clustering
 
Optimizing Supervised and Implementing Unsupervised Machine Learning Algorith...
Optimizing Supervised and Implementing Unsupervised Machine Learning Algorith...Optimizing Supervised and Implementing Unsupervised Machine Learning Algorith...
Optimizing Supervised and Implementing Unsupervised Machine Learning Algorith...
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)
 
Apache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBaseApache Big Data EU 2015 - HBase
Apache Big Data EU 2015 - HBase
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 

En vedette

En vedette (8)

Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Big data
Big dataBig data
Big data
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
What is big data?
What is big data?What is big data?
What is big data?
 
BigData selon IBM
BigData selon IBM BigData selon IBM
BigData selon IBM
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should Know
 

Similaire à High level languages for Big Data Analytics (Report)

Introduction to HADOOP.pdf
Introduction to HADOOP.pdfIntroduction to HADOOP.pdf
Introduction to HADOOP.pdf8840VinayShelke
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methodspaperpublications3
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System cscpconf
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce cscpconf
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoopManoj Jangalva
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010BOSC 2010
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfDIVYA370851
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Modeinventionjournals
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdfavenkatram
 

Similaire à High level languages for Big Data Analytics (Report) (20)

Cppt Hadoop
Cppt HadoopCppt Hadoop
Cppt Hadoop
 
Cppt
CpptCppt
Cppt
 
Cppt
CpptCppt
Cppt
 
Introduction to HADOOP.pdf
Introduction to HADOOP.pdfIntroduction to HADOOP.pdf
Introduction to HADOOP.pdf
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Survey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization MethodsSurvey on Performance of Hadoop Map reduce Optimization Methods
Survey on Performance of Hadoop Map reduce Optimization Methods
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
 
2.1-HADOOP.pdf
2.1-HADOOP.pdf2.1-HADOOP.pdf
2.1-HADOOP.pdf
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
project report on hadoop
project report on hadoopproject report on hadoop
project report on hadoop
 
Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdf
 
Introduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone ModeIntroduction to Big Data and Hadoop using Local Standalone Mode
Introduction to Big Data and Hadoop using Local Standalone Mode
 
Google Data Engineering.pdf
Google Data Engineering.pdfGoogle Data Engineering.pdf
Google Data Engineering.pdf
 

Plus de Jose Luis Lopez Pino

Lessons learnt from applying PyData to GetYourGuide marketing
Lessons learnt from applying PyData to GetYourGuide marketingLessons learnt from applying PyData to GetYourGuide marketing
Lessons learnt from applying PyData to GetYourGuide marketingJose Luis Lopez Pino
 
BDS14 Big Data Analytics to the masses
BDS14 Big Data Analytics to the massesBDS14 Big Data Analytics to the masses
BDS14 Big Data Analytics to the massesJose Luis Lopez Pino
 
Massive scale analytics with Stratosphere using R
Massive scale analytics with Stratosphere using RMassive scale analytics with Stratosphere using R
Massive scale analytics with Stratosphere using RJose Luis Lopez Pino
 
Scheduling and sharing resources in Data Clusters
Scheduling and sharing resources in Data ClustersScheduling and sharing resources in Data Clusters
Scheduling and sharing resources in Data ClustersJose Luis Lopez Pino
 
RDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use itRDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use itJose Luis Lopez Pino
 
RDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use itRDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use itJose Luis Lopez Pino
 
Firefox Vs. Chromium: Guerra de los navegadores libres
Firefox Vs. Chromium: Guerra de los navegadores libresFirefox Vs. Chromium: Guerra de los navegadores libres
Firefox Vs. Chromium: Guerra de los navegadores libresJose Luis Lopez Pino
 
Presentacion Proyecto Fin De Carrera
Presentacion Proyecto Fin De CarreraPresentacion Proyecto Fin De Carrera
Presentacion Proyecto Fin De CarreraJose Luis Lopez Pino
 
Presentacion Visuse para el Hachathón
Presentacion Visuse para el HachathónPresentacion Visuse para el Hachathón
Presentacion Visuse para el HachathónJose Luis Lopez Pino
 
Desarrollar un módulo para Visuse
Desarrollar un módulo para VisuseDesarrollar un módulo para Visuse
Desarrollar un módulo para VisuseJose Luis Lopez Pino
 

Plus de Jose Luis Lopez Pino (20)

Lessons learnt from applying PyData to GetYourGuide marketing
Lessons learnt from applying PyData to GetYourGuide marketingLessons learnt from applying PyData to GetYourGuide marketing
Lessons learnt from applying PyData to GetYourGuide marketing
 
BDS14 Big Data Analytics to the masses
BDS14 Big Data Analytics to the massesBDS14 Big Data Analytics to the masses
BDS14 Big Data Analytics to the masses
 
Massive scale analytics with Stratosphere using R
Massive scale analytics with Stratosphere using RMassive scale analytics with Stratosphere using R
Massive scale analytics with Stratosphere using R
 
Metadata in Business Intelligence
Metadata in Business IntelligenceMetadata in Business Intelligence
Metadata in Business Intelligence
 
Scheduling and sharing resources in Data Clusters
Scheduling and sharing resources in Data ClustersScheduling and sharing resources in Data Clusters
Scheduling and sharing resources in Data Clusters
 
Distributed streaming k means
Distributed streaming k meansDistributed streaming k means
Distributed streaming k means
 
RDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use itRDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use it
 
RDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use itRDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use it
 
Firefox Vs. Chromium: Guerra de los navegadores libres
Firefox Vs. Chromium: Guerra de los navegadores libresFirefox Vs. Chromium: Guerra de los navegadores libres
Firefox Vs. Chromium: Guerra de los navegadores libres
 
Esteganografia
EsteganografiaEsteganografia
Esteganografia
 
Presentacion Proyecto Fin De Carrera
Presentacion Proyecto Fin De CarreraPresentacion Proyecto Fin De Carrera
Presentacion Proyecto Fin De Carrera
 
Memoria Proyecto Fin de Carrera
Memoria Proyecto Fin de CarreraMemoria Proyecto Fin de Carrera
Memoria Proyecto Fin de Carrera
 
Presentacion CUSL nacional
Presentacion CUSL nacionalPresentacion CUSL nacional
Presentacion CUSL nacional
 
Resumen del proyecto Visuse
Resumen del proyecto VisuseResumen del proyecto Visuse
Resumen del proyecto Visuse
 
Presentacion cusl granadino
Presentacion cusl granadinoPresentacion cusl granadino
Presentacion cusl granadino
 
Como hacer un módulo para Visuse
Como hacer un módulo para VisuseComo hacer un módulo para Visuse
Como hacer un módulo para Visuse
 
Visuse: resumen del I Hackathon
Visuse: resumen del I HackathonVisuse: resumen del I Hackathon
Visuse: resumen del I Hackathon
 
Presentacion Visuse para el Hachathón
Presentacion Visuse para el HachathónPresentacion Visuse para el Hachathón
Presentacion Visuse para el Hachathón
 
Desarrollar un módulo para Visuse
Desarrollar un módulo para VisuseDesarrollar un módulo para Visuse
Desarrollar un módulo para Visuse
 
Control de versiones y Subversion
Control de versiones y SubversionControl de versiones y Subversion
Control de versiones y Subversion
 

Dernier

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Dernier (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 

High level languages for Big Data Analytics (Report)

  • 1. High-level languages for Big Data Analytics Jose Luis Lopez Pino jllopezpino@gmail.com Janani Chakkaradhari chjananicse@yahoo.com June 19, 2013 Abstract This work presents a review of the literature about the high-level lan- guages which came out since the MapReduce programming model and Hadoop implementation shook up the parallel programming over huge datasets. MapReduce was a major step forward in the field, but it has severe limitations that the high-level programming languages try to over- come them in different ways. Our work intends to compare three of the main high level languages (Pig Latin, HiveQl and Jaql) based on four different criteria that we con- sider are very relevant and which studies are consistent in our opinion. Those criteria are expressive power, performance, query processing and JOIN implementation Analyses based on multiple criteria reveals the differences between the languages but it is shown that none of the languages analysed (Pig Latin, HiveQL and Jaql) beats all the other in every criterion. It depends on the scenario or application we need to consider these comparison results to choose which language will be more suitable for implementation Finally, we are going to address two very well-known pitfalls of MapRe- duce: latency and implementation of complex algorithms. 1 Introduction 1.1 The MapReduce programming model The MapReduce programming model was introduced by Google in 2004[5]. This model allows programmers without any experience in parallel coding to write the highly scalable programs and hence process voluminous data sets. This high level of scalability is reached thanks to the decomposition of the problem into a big number of tasks. MapReduce is a completely different approach to big data analysis and it has been proven effective in large cluster systems. This model is based on two functions that are coded by the user: map and reduce. • The Map function produces a set of key/value pairs, taking a single pair key/value as input. 1
  • 2. • The Reduce function takes a key and a set of values related to this key as input and it might also produce a set of values, but commonly it emits only one or zero values as output Using this model, programmers do not have to care about how the data is distributed, how to handle with failures or how to balance the system. However, this model also have important drawbacks: it is complicated to code very simple and common task using this single dataflow, many of the tasks are expensive to perform, the user code is difficult to debug, the absence of schema and indexes and a lot of network bandwidth might be consumed[10]. The purpose of the different high level languages is to address some of those shortcomings. 1.2 Hadoop In current world, the generation of data per day is measured in petabyte scales. These large amounts of data have made the need of data to be stored in more than one system at a time. This means partitioning the data and soring in separate machine. File systems that manage the storage across a network of machines are called distributed file system. The challenging aspect of DFS is in the fault tolerance which leads to data loss. Hadoop becomes the solution for this problem. Doug Cutting, the creator of Hadoop, named it after his son’s toy elephant. Hadoop has two layers such as storage and execution layer. The storage layer is Hadoop Distributed File System (HDFS) designed for storing very large files with streaming data access patterns, running on clusters of commodity hard- ware. The fixed block size of Hadoop is 64MB by default. The execution layer is Hadoop Map Reduce and it is responsible for running a job in parallel on multiple servers at the same time. Since the Hadoop cluster nodes are commodity hardware it is very less cost and the architecture is very simple.A typical Hadoop cluster has a single master server called name node that manages the File System and the job tracker process manages the job.This node should be high quality with high processing speed.It also has multiple slave nodes which are called data nodes that run the tasks on their local server. In general the data will be duplicated (3 times) in different nodes to support the fault tolerance. If any of these data nodes fails the master nodes keeps track of it and immediately replaces the one which is active and updates the data [24]. 2 High level languages After MapReduce was publicly announced and the Hadoop framework was cre- ated, multiple high level languages have been created specifically which intent to deal with some of the problems of the model mentioned before. Also some already existing languages have been integrated to work over this model, like R (MapR, Ricardo[4]) and SQL. 2
  • 3. Concerning the selection of the languages for our comparison, we have chosen those three programming languages that are present in all the comparisons (Pig Latin, HiveQL and Jaql) for sake of consistency but we have also consider important to mention some other interesting high level query languages (Meteor and DryadLINQ). Additionally, we also study how the query is processed by the system, which translate it into a MapReduce workflow. 2.1 Pig Latin Pig Latin is executed over Hadoop, an open-source implementation of the map- reuce programming model formerly developed at Yahoo Research[14]. It is a high level procedural languages that implements high level operations similar to those that we can find in SQL adnd some other interesting operators listed below: • FOREACH to process a transformation over every tuple of the set. To make possible to parallelise this operation, the transformation of one row should depend on another. • COGROUP to group related tuples of multiple datasets. It is similar to the first step of a join. • LOAD to load the input data and its structure and STORE to save data in a file. The main goal of Pig Latin is to reduce the time of development. For this purpose, it includes feature like a nested data model, user-defined functions and the possibility of execute some analytic queries over text files without loading the data. Unlike SQL, the procedural nature of the languages also allows the program- mer to have more control over the execution plan, meaning that he can speed the performance up without relying on this task on the query optimiser. 2.2 HiveQL It is an open source project initially developed by Facebook. Hive has the system built on top of Hadoop that efficiently incorporates map reduce for execution, HDFS for storage and keeps the metadata in an RDBMS. In simple terms Hive has been called as a data warehouse built on top of Hadoop. The main advantage of Hive is it makes the system familiar by ex- tending the functionalities of SQL and also its queries looks similar to it. Of course scalability and performance are the other two features. Hive tables can be directly defined on the HDFS. Schemas are stored in RDBMS. Hive support complex column types such as map, array, struct data types in addition to the basic types [23]. Hive supports most of the transactional sql queries such as • Sub queries 3
  • 4. • Different kinds of joins inner, left outer, right outer and outer joins • Cartesian products, group bys and aggregations • Union all • Create table as select. Hive uses traditional RDBMs to store the metadata. Usually metadata storage is accessed more frequently and it is always to keep metadata in random access rather than sequential access. As HDFS is not well suited for random access Hive stores the metadata in databases like MySQL and Oracle. It is also important to focus that there is a low latency whenever HiveQL tries to access metadata. In spite of this impedance, Hive maintains the consistency between metadata and data [23]. 2.3 JAQL Jaql[3] is a declarative scripting language built on top of Hadoop and used in some IBM products (InfoSphere BigInsights and Cognos Consumer Insight). This language was developed after Pig and Hive and hence it has been designed with the purpose of making it more scalable, flexible and reusable than the alternatives that existed at the time. Simplicity is one of the key goals of the Jaql data model that is clearly inspired by JSON: values are always trees, there are no references and the textual representation is very similar. This simplicity has two advantages: it facilitates the development and also makes easier the distribution between nodes of the program. The other main goal of the data model is the adaptability. Jaql can handle semistructured documents (data without a schema) but also structures records validated against a schema. In consequence, programs written in Jaql can load and write information in different sources from relational databases to delimited plain files. The flexibility of the language relies on the data model but also on the control over the evaluation plan because the programmer can work at different levels of abstraction using Jaql’s syntax: • Full definition of the execution plan. • Use of hints to indicate to the optimizer some evaluation features. This feature is present in most of the database engines that use SQL as query language. • Declarative programming, without any control over the flow. 4
  • 5. 2.4 Other languages 2.4.1 Meteor Stratosphere[2] is a system designed to process massive datasets and one of the main parts that compound it is the Pact programming model. Pact[1] is an extension to the MapReduce programming model also inspired by functional programming. One of the limitations of MapReduce is that it is based on only two simple second-order functions and this new model addresses this problem including new operators (called contracts) to perform those analyses easier and more efficiently: • Cross: performs the Cartesian product over the input sets. • CoGroup: Group all the pairs with the same key and process them with a user-define function. • Match: it also matches key/value pairs from the input data, but pairs with the same key might be processed separately by the user function. In addition to this new programming model, the Stratosphere’s stack in- cludes Meteor, a query language and Sopremo, the operator model used by Meteor[8]. Meteor, as the other high level languages presented before, was designed to facilitate the task of developing parallel programs using the pro- gramming model (in this case Pact). In consequence, Meteor programs are at the end translated to a Pact program. Sopremo help us to manage collections of semantically rich operators that can be extended and that are grouped into packages. To speed up the execution of the code, the optimization is applied in two steps: first the logical plan, which consists of Sopremo operators, is optimized and secondly Pact’s compiler applies physical optimizations to the result pro- gramme. 2.4.2 DryadLINQ Dryad [25] is another execution engine to compute large-scale analysis over datasets. DryadLINQ gained community’s interest because it is embedded in .NET programming languages and a large number of programmers are already familiar with this development platform and with LINQ in particular. The designers of this system made a big effort to support almost all the operators available in LINQ plus some extra operators interesting for paral- lel programming. This framework also allows developers to include their own implementation of the operators[9]. After the DryadLINQ code is extracted from the programme, it is trans- lated to a Dryad plan and then optimized. The optimizer mainly performs four tasks: it pipelines operations that can be execute by a single machine, removes redundancy, pushes aggregations and reduces the network traffic. 5
  • 6. 3 Comparing high level languages The design motivations of the languages are diverse and therefore the differences between the languages are multiple. To compare these three high level languages we have decided to choose four criteria that are interesting from our point of view and that are well described in the literature that we have reviewed. First of all, we have analysed the expressiveness and the general performance of the languages. In general, developers prefer a language that allow them to write concise code (expressive power) and that is efficient (performance). After that we dive into two criteria that also have an important impact in the performance: the join implementation and the query processing. The join algorithms are a well-known big burden for the performance working with sets and therefore we will analyse the different algorithms implemented by those languages. We consider these criteria sufficient to consider the solution that better suits our needs, however there are many other facts that are also mentioned or studied by the literature written so far like the programming paradigm, the code size or scalability. Scalability is a very relevant criterion since it motivated the creation of MapReduce, however it is not easy to find consistent literature concerning all the topics that we consider significant. 3.1 Expressive power Robert Stewart[20] classifies the high level languages in three categories in ac- cordance with their computational power, from less to more powerful: • Relational complete: a language is considered relational complete if it includes the primitive operations of the relational algebra: the selection, the rename, the projection, the set union, the set difference and the cross (Cartesian) product. The different kinds of joins implemented in each language are compared in a separate section. • SQL equivalent: SQL is a standard language querying data stored in re- lational database management system. It provides all the operations of the relational algebra plus aggregate functions, which are not part of the relational algebra although they are a common extension for data compu- tation. • Turing complete: a Turing complete language must allow conditional branching, indefinite iterations by means of recursion and emulation of an infinite memory model. Pig Latin and HiveQL are considered SQL equivalent because they are more powerful than relational algebra (they include numeric aggregation functions). Hence we can consider their expressive power equivalent, but there are evident differences. SQL is a standard of the industry that has been developed during almost 40 years including several extensions and in consequence it is not easy to develop a SQL compliant system. 6
  • 7. HiveQL is inspired by SQL but it does not support the full repertoire in- cluded in the SQL-92 specification. On the other hand, HiveQL has also included some features notably inspired by MySQL and MapReduce that are not part of this specification. A comparison between SQL and HiveQL by Tom White in 2009 revealed some limitations of HiveQL like indexes, transactions, subqueries outside the FROM clause, etc.[24] In the case of Pig Latin, its syntax are func- tionalities are not inspired by SQL and the differences are more obvious. For instance, Pig Latin does not have OVER clause and includes the COGROUP operator that is not present in SQL. Jaql includes basic flow control using if-else structures and recursion in higher-order functions and hence it is consider Turing complete. However, tak- ing into account that Pig and HiveQL programmes can be extends using user defined functions, Pig Lating and HiveQL might also be considered Turing com- plete. Finally, Jaql[3] makes possible to compile high-level declarative expressions to lower-level function calls. As a result of it, the low-level functions can be extended. Neither Pig Latin nor HiveQL include this feature, called source-to- source compilation, which could increase the expressiveness of the language. 3.2 Performance The usual benchmarking for Pig to measure its performance is PigMix. Pig Mix is a set of queries to test the performance. These set checks the scalability and latency [18]. Hive performance benchmark is mainly based on the queries that are specified by Pavlo et al in the paper [15]. The queries basically cover the selection task, Aggregation task and a Join task. There is also a Pig- Latin implementation for the TPC-H queries and HiveQL implementation of TPC-H queries [19] Even though the objective of the each of these languages is to generate an equivalent map reduce jobs for their corresponding input script, the runtime measure of these languages shows different results for same kind of bench mark- ing applications experimented in [21]. At first the paper describes Scale up, Scale out, scale in and runtime as their performance metrics. In scale up the size of the cluster size was fixed and the computation is increased which means the number of nodes in Hadoop envi- ronment is kept constant. The performance of all three languages interestingly varied based on the distribution of data. For skewed data, Pig and Hive seems to be more effective in handling it compared to JAQL. In Scale out the computation is fixed in the sense there is no increase in computation for a given experiment with the increase in the number of nodes. Again Pig and Hive the paper argues that Pig and Hive better in utilizing the increase in the cluster size compared JAQL. But at one point there is no improvement in performance by increasing the nodes. Moreover Pig and Hive allows the user to explicitly specify the number of reducers task. It has been argued that this feature has significant influence on the performance. 7
  • 8. 3.3 Query processing In order to make a good comparison we should have the basic knowledge on how these HLQLs are working. In this section we will focus on answering to the following question. How the abstract user representation of the query or the script is converted to map reduce jobs? For data intensive parallel computations, the choice of high level languages mainly depends on the specific application scenarios[17]. By taking this into account, we can realize the importance of understanding the query compilation methods implemented by these languages. 3.3.1 Pig Latin The structure of Pig Latin is similar to SQL style. The goal of writing Pig Latin script is to produce an equivalent map reduce jobs that can be executed in the Hadoop environment. Pig is considered to have the basic characteristic of query language, hence the initial steps of compilation is similar to SQL query processing [6]. Pig programs are first passed to a parser component, where it checks for the syntactic correctness of the Pig Latin script. The result of the parser is a complete logical plan. Unlike SQL where it produces parse tree, the result of parsing phase in Pig compilation gives a directed acyclic graph (DAG). The logical plan is then passed to the logical optimizer component where the classi- cal optimization operations such as pushing of projections are carried out. The result of logical optimizer is passed to Map-Reduce compiler to compute a se- quence of Map- Reduce Jobs which is then passed to optimization phase and finally submitted to Hadoop for execution[6]. The following (Figure 1) example describes the generation of logical plan for a simple word count program in Pig Latin. The output of each operator is shown near to the rectangles. Figure 1: Compilation of Pig Latin to Logical Plan The Pig translates the logical plan into physical plan and it replaces with the physical operator in the map reduce jobs. In most cases the logical operator becomes the equivalent physical operator. In this case LOAD, FOREACH and STORE remain the same. In Pig the operator GROUP is translated as LOCAL REARRANGE, GLOBAL REARRANGE AND PACKAGE in physical plan. 8
  • 9. Rearranging means either it does hashing or sorting by key. The combination of local and global rearranges produces the result in such a way that the tuples having same group key will be moved to same machine[6]. The input data is broken down into chunks and the map will all run indepen- dently in parallel to process the input data. This has been handled by the job tracker of the map reduce framework. Map takes the input data and constructs the key value pairs. It also sorts the data by key. The shuffle phase is managed by Hadoop. It fetches the corresponding partition of data from map phase and merges it into single sorted list then grouped by key. This will be the input for the reduce phase which usually performs the aggregation part of the query in our case it is count. An equivalent map reduce jobs for our example Pig Latin is shown in the Figure 2 Figure 2: Physical Plan to Map Reduce Jobs 3.3.2 HiveQL In DBMS, the query processor transforms the user queries into sequence of database operations and executes those operations. Initially the query is turned into a parse tree structure in such a way to transform into relation algebraic notation. This is termed as logical query plan [Garcia]. As HiveQL is SQL like declarative structure, the query processing is same as the SQL in tradi- tional database engine. The following steps describes in brief about the query processing in HiveQL [23] • It gets the Hive SQL string from the client • The parser phase converts it into parse tree representation • The semantic analyser component converts the parse tree into block based internal query format 9
  • 10. • The logical query plan generator converts it into logical query represen- tation and then optimizes it. Prunes the columns early and pushes the predicates closer to the table • Finally, the logical plan is converted to physical plan and then map reduce jobs 3.3.3 JAQL JAQL written from the application is evaluated first by the compiler. The Map Reduce jobs can be directly called by using JAQL script. Usually the user relies on JAQL compiler to convert the script into map reduce jobs. JAQL includes two higher order functions such as mapReduceFn and mapAggregate to execute map reduce and aggregate operations respectively. The rewriter engine generates calls to the mapReduceFn or mapAggregate, by identifying the parts of the scripts and moving them to map, reduce and aggregate function parameters. Based on the some rules, rewriter converts them to Expr tree. Finally it checks for the presence of algebraic aggregates, if it is there then it invokes mrAggregate function. In otherwords it can complete the task with single map reduce job. Figure 3 Figure 3: Jaql-Query processing stages Each language has its own way of implementation for query processing. Dur- ing the review it is noted that • Pig currently misses out on optimized storage structures like indexes and column groups. HiveQL provides more optimization functionalities such as performs join in map phase instead of reduce phase and in case of sampling queries prunes the buckets that are not needed • JAQLs physical transparency is an added value feature because it allows the user to add new run time operator without affecting JAQLs internals. 10
  • 11. 3.4 JOIN implementation Join is an essential operation in relational database models. The basic need for join comes in the fact that the relations are in normalized form. So in the computation of aggregation or even in much kind of OLAP operations Join becomes the necessary step to compute the expected results. 3.4.1 Pig Latin Pig Latin Supports inner join, equijoin and outer join. The JOIN operator always performs inner join. Pig executes join in two flavours. First join can be achieved by COGROUP operation followed by FLATTEN [4]. The inner join can be extended to three specialized joins [16]. • Skewed Joins: The basic idea is to compute a histogram of the key space and uses this data to allocate reducers for a given key. Currently pig allows skewed join of only two tables. The join is performed in Reduce phase. • Merge Joins: Pig allows merge join only if the input relations are already sorted. The join is performed in Reduce phase. • Fragment Replicate joins:This is only possible if one of the two relations is smaller enough to fit into memory. In this case, the big relation is distributed across hadoop nodes and the smaller relation is replicated on each node. Here the entire join operation is performed in Map phase. Of course this is trivial case. The choice of join strategy can be specified by the user while writing the script. As example join operation in Pig Latin is shown in the Figure 4 Figure 4: Join code in Pig Latin 3.4.2 HiveQL In the early stages, HiveQL was only designed to support the common join operation. In this join, in the map phase the joining tables are read and a pair of join key and value is written into an intermediate file in order to pass it to suffle phase which is handled by Hadoop. In shuffle phase Hadoop sorts and combines these key value pairs and sends the same tuples having the same key to the reducers in order to make perform the acutal join operation. Here the shuffle and reduce phase is more expensive since it involves sorting. 11
  • 12. In order to overcome this, the map side join was introduced and it is only possible in case if one of the joining table exactly fits into the memory. It is similar to the replicate join as in Pig Latin. 3.4.3 JAQL JAQL supports only equijoin. JOIN is expressed between two or more input arrays. It supports multiple types of joins, including natural, left-outer, right- outer, and outer joins. One of the advantages of Jaql is that physical trans- parency allows the function support of Jaql to add new join operator and use them in the queries without modifying anything in query compiler. The following points represent the summary of join implementation • Both Pig and Hive has the possibility to performs join in map phase instead of reduce phase • For skewed distribution of data, the performance of JAQL for join is not comparable to other two languages 4 Future work 4.1 Interactive queries One of the main problems of MapReduce all the languages built on top of this framework (Pig, Hive, etc.) is the latency. As a complement of those technolo- gies, some new frameworks that allow programmers to query large datasets in an interactive manner have been developed, like Dremel[12] or the open source project Apache Drill. In order to reduce the latency of the queries compared to other tools for large dataset analysis, Dremel stores the information as nested columns, uses a multi-level tree architecture in the query execution and balances the load by means of a query dispatcher. We do not have too many details of the query language of Dremel, but we know that is based on SQL and includes the usual operations (selection, projection, etc.) and features (user define functions or nested subqueries) of SQL-like languages. The characteristic that distinguish this language is that it operates with nested tables as inputs and outputs. 4.2 Machine learning Map reduce being a way to process Big data and it is obvious outperforms for basic operations such as selection on the other hand it is more complicated to address complex queries by this processing technique. The challenging aspect of machine learning algorithms is that is not simply computing aggregates over datasets and it is to identify some hiding patterns in the given data. Example of such questions includes, what page will the visitor next visit? 12
  • 13. Some of the ML learning algorithms and the general approach to process the data by map reduce is discussed in paper [7]. Bayess classifier requires counting the occurrences in the training data. In large data set the extraction of features is intensive and at least the reduce task should be configured to compute the summation of each (feature, label) pair. Mahout is an Apache project for building the scalable machine learning algorithms. These algorithms include clustering, classifications, collaborating filtering and data mining frequent item set. These in turn are predominantly used in recommendations. The collaborative filtering supports both user- user and item-item based similarity [22]. • Pig Latin has the extensions to deal with the predictive analytics capa- bilities. In this paper, Twitter has implemented learning algorithms by placing them in Pig Storage functions [11]. The storage functions are called in the final reduces stage of the overall dataflow. • There is a recent work on the extensive support of ML in Hive [13]. The author tries to follow the same that has been implemented for Pig by Twitter. Here the machine learning is treated as UAFs. • A new data analytics platform Ricardo is proposed combines the func- tionalities of R and Jaql. It basically takes the advantage of statistical computing features provided by R with the high level language which generate map reduce jobs using Jaql [4]. 5 Conclusions In this literature review we have first introduced the MapReduce programming model paying attention to its main drawbacks and its main open-source imple- mentation (Hadoop). After that we have briefly described some high level languages that try to address the problems mentioned from different perspectives, focusing on those that are popular in the literature available at the time of writing (Pig Latin, HiveQL and Jaql) and some interesting alternatives (DryadLINQ and Meteor). Based on those consistent and relevant studies reviewed, it is clear that there is no single language that beat all the other options. Jaql was created after the other two languages and that probably gave it some advantages in its design. Based on the first criterion analysed, we can state that Jaql is expressively more powerful since it includes basic flow control using if-else structures, meanwhile with the other two this is only possible using UDF functions. However, we have seen that Jaql also shows the worst performance in the benchmarks described before. Pig and Hive probably perform better in those benchmarks because they support map phase JOIN. Hive also adopts advance optimization techniques for query processing that certainly speed up the resulting code. Finally, we have seen how high level languages for big data analytics are ad- dressing some of the problems of this paradigm. Real-time processing demands 13
  • 14. a very low latency of response and this is one of the main disadvantages of the MapReduce model. In consequence, some new languages for large dataset analytics that do not use this model have been designed. Additionally, some machine learning algorithms are difficult to implement using this model. Some alternatives have shown up in the last years, for instance the Apache Software Foundation is developing Mahout, a library that implement scalable machine learning algorithms using the map/reduce paradigm. References [1] Alexander Alexandrov, Stephan Ewen, Max Heimel, Fabian Hueske, Odej Kao, Volker Markl, Erik Nijkamp, and Daniel Warneke. MapReduce and PACT - comparing data parallel programming models. In Proceedings of the 14th Conference on Database Systems for Business, Technology, and Web (BTW), BTW 2011, pages 25–44, Bonn, Germany, 2011. GI. [2] Dominic Battr´e, Stephan Ewen, Fabian Hueske, Odej Kao, Volker Markl, and Daniel Warneke. Nephele/PACTs: A programming model and execu- tion framework for web-scale analytical processing. In Proceedings of the 1st ACM symposium on Cloud computing, SoCC ’10, pages 119–130, New York, NY, USA, 2010. ACM. [3] Kevin S Beyer, Vuk Ercegovac, Rainer Gemulla, Andrey Balmin, Mohamed Eltabakh, Carl-Christian Kanne, Fatma Ozcan, and Eugene J Shekita. Jaql: A scripting language for large scale semistructured data analysis. In Proceedings of VLDB Conference, 2011. [4] Sudipto Das, Yannis Sismanis, Kevin S Beyer, Rainer Gemulla, Peter J Haas, and John McPherson. Ricardo: integrating r and hadoop. In Proceed- ings of the 2010 ACM SIGMOD International Conference on Management of data, pages 987–998. ACM, 2010. [5] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008. [6] Alan F Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shra- van M Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, and Utkarsh Srivastava. Building a high-level dataflow sys- tem on top of map-reduce: the pig experience. Proceedings of the VLDB Endowment, 2(2):1414–1425, 2009. [7] Dan Gillick, Arlo Faria, and John DeNero. Mapreduce: Distributed com- puting for machine learning. Berkley (December 18, 2006), 2006. [8] Arvid Heise, Astrid Rheinl¨ander, Marcus Leich, Ulf Leser, and Felix Nau- mann. Meteor/sopremo: An extensible query language and operator model. In Proceedings of the International Workshop on End-to-end Management of Big Data (BigData) in conjunction with VLDB 2012, 2012. 14
  • 15. [9] Michael Isard and Yuan Yu. Distributed data-parallel computing using a high-level programming language. In Proceedings of the 2009 ACM SIG- MOD International Conference on Management of data, pages 987–994. ACM, 2009. [10] Kyong-Ha Lee, Yoon-Joon Lee, Hyunsik Choi, Yon Dohn Chung, and Bongki Moon. Parallel data processing with mapreduce: a survey. ACM SIGMOD Record, 40(4):11–20, 2012. [11] Jimmy Lin and Alek Kolcz. Large-scale machine learning at twitter. In Proceedings of the 2012 international conference on Management of Data, pages 793–804. ACM, 2012. [12] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. Dremel: interactive analy- sis of web-scale datasets. Proceedings of the VLDB Endowment, 3(1-2):330– 339, 2010. [13] Extension of Hive to support Machine Learning. Hiveql. [14] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. Pig latin: a not-so-foreign language for data process- ing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1099–1110. ACM, 2008. [15] Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J Abadi, David J DeWitt, Samuel Madden, and Michael Stonebraker. A comparison of ap- proaches to large-scale data analysis. In Proceedings of the 35th SIGMOD international conference on Management of data, pages 165–178. ACM, 2009. [16] Pig. Fragment replicate join. [17] Caetano Sauer and Theo H¨arder. Compilation of query languages into mapreduce. Datenbank-Spektrum, pages 1–11, 2013. [18] Benchmarking standards. Pigmix. [19] Benchmarking standards Hive. Hiveql. [20] Robert Stewart. Performance and programmability of high level data par- allel processing languages: Pig, hive, jaql & java-mapreduce, 2010. Heriot- Watt University. [21] Robert J Stewart, Phil W Trinder, and Hans-Wolfgang Loidl. Comparing high level mapreduce query languages. In Advanced Parallel Processing Technologies, pages 58–72. Springer, 2011. [22] Ronald C Taylor. An overview of the hadoop/mapreduce/hbase frame- work and its current applications in bioinformatics. BMC bioinformatics, 11(Suppl 12):S1, 2010. 15
  • 16. [23] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2):1626–1629, 2009. [24] Tom White. Hadoop: The definitive guide. O’Reilly Media, 2012. [25] Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, ´Ulfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. Dryadlinq: A system for general- purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th USENIX conference on Operating systems design and implementation, pages 1–14, 2008. 16