Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

FastR+Apache Flink

1 323 vues

Publié le

During the past few years R has become an important language for data analysis, data representation and visualization. R is a very expressive language which combines functional and dynamic aspects, with laziness and object oriented programming. However the default R implementation is neither fast nor distributed, both features crucial for "big data" processing.

Here, FastR-Flink compiler is presented, a compiler based on Oracle's R implementation FastR with support for some operations of Apache Flink, a Java/Scala framework for distributed data processing. The Apache Flink constructs such as map, reduce or filter are integrated at the compiler level to allow the execution of distributed stream and batch data processing applications directly from the R programming language.

Publié dans : Logiciels
  • Soyez le premier à commenter

FastR+Apache Flink

  1. 1. FastR + Apache Flink Enabling distributed data processing in R <juan.fumero@ed.ac.uk>
  2. 2. Disclaimer This is a work in progress
  3. 3. Outline 1. Introduction and motivation ○ Flink and R ○ Oracle Truffle + Graal 2. FastR and Flink ○ Execution Model ○ Memory Management 3. Preliminary Results 4. Conclusions and Future Work 5. [DEMO] within a distributed configuration
  4. 4. Introduction and motivation
  5. 5. Why R? http://spectrum.ieee.org R is one of the most popular languages for data analytics Functional Lazy Object Oriented Dynamic R is:
  6. 6. But, R is slow Benchmark SlowdowncomparedtoC
  7. 7. Problem It is neither fast nor distributed
  8. 8. Problem It is neither fast nor distributed
  9. 9. Apache Flink ● Framework for distributed stream and batch data processing ● Custom scheduler ● Custom optimizer for operations ● Hadoop/Amazon plugins ● YARN connector for cluster execution ● It runs on laptops, clusters and supercomputers with the same code ● High level APIs: Java, Scala and Python
  10. 10. Flink Example, Word Count public class LineSplit … { public void flatMap(...) { for (String token : value.split("W+")) if (token.length() > 0) { out.collect(new Tuple2<String, Integer>(token, 1); } } … DataSet<String> textInput = env.readTextFile(input); DataSet<Tuple2<String, Integer>> counts= textInput.flatMap(new LineSplit()) .groupBy(0) .sum(1); JobManager TaskManager TaskManager Flink Client And Optimizer
  11. 11. We propose a compilation approach built within Truffle/Graal for distributed computing on top of Apache Flink. Our Solution How can we enable distributed data processing in R exploiting Flink?
  12. 12. FastR on Top of Flink * Flink Material FastR
  13. 13. Truffle/Graal Overview
  14. 14. How to Implement Your Own Language? Oracle Labs material
  15. 15. Truffle/Graal Infrastructure Oracle Labs material
  16. 16. Truffle node specialization Oracle Labs material
  17. 17. Deoptimization and rewriting Oracle Labs material
  18. 18. More Details about Truffle/Graal https://wiki.openjdk.java.net/display/Graal/Publications+and+Presentations [Thomas Würthinger, Christian Wimmer, Andreas Wöß, Lukas Stadler, Gilles Duboscq, Christian Humer, Gregor Richards, Doug Simon, Mario Wolczko] One VM to Rule Them All In Proceedings of Onward!, 2013.
  19. 19. OpenJDK Graal - New JIT for Java ● Graal Compiler is a JIT compiler for Java which is written in Java ● The Graal VM is a modification of the HotSpot VM -> It replaces the client and server compilers with the Graal compiler. ● Open Source: http://openjdk.java.net/projects/graal/
  20. 20. FastR + Flink: implementation
  21. 21. Our Solution: FastR - Flink Compiler ● Our goal: ○ Run R data processing applications in a distributed system as easy as possible ● Approach ○ Custom FastR-Flink compiler builtins (library) ○ Minimal R/Flink setup and configuration ○ “Offload” hard computation if the cluster is available
  22. 22. FastR-Flink API Example - Word Count # Word Count in R bigText <- flink.readTextFile("hdfs://") createTuples <- function(text) { words <- strsplit(text, " ")[[1]] tuples = list() for (w in words) { tuples[[length(tuples)+1]] = list(w, 1) } return (tuples) } splitText <- flink.flatMap(bigText, createTuples) groupBy <- flink.groupBy(splitText, 0) count <- flink.sum(groupBy, 1) hdfsPath <- "hdfs://" flink.writeAsText(count, hdfsPath)
  23. 23. Supported Operations Operation Description flink.sapply Local and remove apply (blocking) flink.execute Execute previous operation (blocking) flink.collect Execute the operations (blocking) flink.map Apply local/remove map (non-blocking) flink.sum Sum arrays (non-blocking) flink.groupBy Group tuples (non-blocking)
  24. 24. Supported Operations Operation Description flink.reduce Reduction operation (blocking and non-blocking versions) flink.readTextFile Read from HDFS (non-blocking) flink.filter Filter data according to the function (blocking and non-blocking) flink.arrayMap Map with bigger chunks per Flink thread flink.connectToJobManager Establish the connection with the main server flink.setParallelism Set the parallelism degree for future Flink operations
  25. 25. Assumptions ● Distributed code with no side effects allowed → better for parallelism ● R is mostly a functional programming language alpha <- 0.01652 g <- function(x) { … if (x < 10) { alpha <- 123.46543 } … }
  26. 26. Restrictions ● Apply R function which change the data type f: Integer → (String | Double) f <- function(x) { if (x < 10) { return("YES") } else { return(0.123) } } > f(2) [1] “YES” # String > f(100) [1] 0.123 # Double In FastR + Flink we infered the data type based on the first execution. We can not change the type with Flink
  27. 27. FastR + Flink : Execution model
  28. 28. Cool, so how does it work? Where R you? ipJobManager <- "" flink.connectToJobManager(ipJobManager) flink.setParallelism(2048) userFunction <- function(x) { x * x } result <- flink.sapply(1:10000, userFunction)
  29. 29. VMs cluster organization FastR+ GraalVM TaskManager TaskManager FastR + GraalVM FastR + GraalVM R Source Code R Source Code JobManager
  30. 30. FastR-Flink Execution Workflow
  31. 31. 1a) Identify the skeleton map < − f l i n k . sapply ( input , f ) switch (builtin) { … case "flink.sapply": return new FlinkFastRMap(); … } R User Program: FastR-Flink Compiler: Parallel Map Node
  32. 32. 1b ) Create the JAR File public static ArrayList<String> compileDefaultUserFunctions() { for (String javaSource : ClassesToCompile) { JavaCompileUtils.compileJavaClass(...) ; String jarFileName = ... String jarPath = FastRFlinkUtils.buildTmpPathForJar(jarFileName); jarFiles.add(jarPath); JavaCompileUtils.createJarFile(...); } jarFiles.add("fastr.jar"); return jarFiles; } Jar compilation for User Defined Function:
  33. 33. 2-3. Data Type Inference and Flink Objects private static <T> MapOperator<?, ?> remoteApply(Object input, RFunction function, RAbstractVector[] args, String[] scopeVars, String returns) { Object[] argsPackage = FastRFlinkUtils.getArgsPackage(nArgs, function, (RAbstractVector) input, args, argsName); firstOutput = function.getTarget().call(argsPackage); returnsType = inferOutput(firstOutput); ArrayList<T> arrayInput = prepareArrayList(returnsType, input, args); MapOperator<?, ?> mapOperator = computation(arrayInput, codeSource, args, argsName, returnsType, scopeVars, frame); MetadataPipeline mapPending = new MetadataPipeline(input, firstOutput, returns); mapPending.setIntermediateReturnType(returnsType); FlinkPromises.insertOperation(Operation.REMOTE_MAP, mapPending); return mapOperator; }
  34. 34. 4. Code Distribution ArrayList<String> compileUDF = FlinkUDFCompilation.compileDefaultUserFunctions(); RemoteEnvironment fastREnv = new RemoteEnvironment(); fastREnv.createRemoteEnvironment(FlinkJobManager.jobManagerServer, FLINK_PORT, jarFiles); Send the JAR files: The Flink UDF Class receives also the String which represents the R User code: inputSet.map(new FastRFlinkMap(codeSource)).setParallelism(parallelism)).returns(returns);
  35. 35. 5. Data Transformation RVector Flink DataSets Flink DataSets preparation RVector in FastR are not serializable → Boxing and Unboxing are needed Paper: “Runtime Code Generation and Data Management for Heterogeneous Computing in Java” Creating a view for the data:
  36. 36. 6. FastR-Flink Execution, JIT Compilation @Override public R map(T t) throws Exception { if (!setup) { setup(); } Object[] argsFunction = composeArgsForRFunction(t); Object[] argumentsPackage = RArguments.create(userFunction, null, null, 0, argsFunction, rArguments, null); Object result = target.call(argumentsPackage); return (R) new Tuple2<>(t.getField(0), result); }
  37. 37. 6. FastR-Flink Execution, JIT Compilation ● Each Flink execution thread has a copy of the R code ● It builds the AST and start the interpretation using Truffle ● After a while it will reach the compiled code (JIT) ● The language context (RContext) is shared among the threads ● Scope variables are passed by broadcast
  38. 38. 7. Data Transformation - Get the result RVectorFlink DataSets Build RVectors: The R user gets the final results within the right R data representation
  39. 39. Flink Web Interface ● It works with FastR!!! ● Track Jobs ● Check history ● Check configuration But, Java trace. ● Maybe not very useful for R users.
  40. 40. Memory Management
  41. 41. And Memory Management on the TaskManager? 70% of the heap for hashing by default for Apache Flink Solution: taskmanager.memory.fraction:0.5, in the config file. User Space: ● Graal Compilation Threads ● Truffle Compilation Threads ● R Interpreter ● R memory space
  42. 42. Some Questions ● Is there any heuristic to know the ideal heap distribution? ● Is there any way to format the name of the operations for the Flink Web Interface? R functions names? Custom names for easy track ● Is there any way to attach custom initialization code into TaskManager?
  43. 43. Preliminary Results
  44. 44. Evaluation Setup ● Shared memory: ○ Big server machine ■ Xeon(R) 32 real cores ■ Heap: 12GB ■ FastR compiled with OpenJDK 1.8_60 ● 4 Benchmarks: ○ PI approximation ○ Binary Trees ○ Binomial ○ Nested-Loops (high computation)
  45. 45. Preliminary Results
  46. 46. Preliminary Results
  47. 47. And Distributed Memory? ● It works (I will show you in demo in a bit) ● Some performance issues concerning: ○ Multi-thread context issues. ○ Each Truffle language has a context initialized within the application ○ It will need pre-initialization within the Flink TaskManager ● This is work in progress
  48. 48. Conclusions and Future work
  49. 49. Conclusions ● Prototype: R implementation with + some Flink operations included in the compiler ● FastR-Flink in the compiler ● Easy installation and easy programmability ● Good speedups in shared memory for high computation ● Investigating distributed memory to increase speedups
  50. 50. Future Work ● A world to explore: ○ Solve distribute performance issues ○ More realistic benchmarks (big data) ○ Distributed benchmarks ○ Support more Flink operations ○ Support Flink for other Truffle languages using same approach ○ Define a stable API for high level programming languages ○ Optimisations (caching) ○ GPU support based on our previous work [1] [1] Juan Fumero, Toomas Remmelg, Michel Steuwer and Christophe Dubach. Runtime Code Generation and Data Management for Heterogeneous Computing in Java. 2015 International Conference on Principles and Practices of Programming on the Java Platform
  51. 51. Check it out! It is Open Source Clone and compile: $ mx sclone https://bitbucket.org/allr/fastr-flink $ hg update -r rflink-0.1 $ mx build Run: $ mx R # start the interpreter with Flink local environment
  52. 52. Thanks for your attention > flink.ask(questions) AND DEMO ! Contact: Juan Fumero <juan.fumero@ed.ac.uk> Thanks to Oracle Labs, TU Berlin and The University of Edinburgh