Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

20190705 py data_paris_meetup

40 vues

Publié le

Warp 10 is a time series database with optional geo support written in Java. One of its key differencing factor is WarpScript, which not only is a query language, but is also a full-fledge programming language tailored to ease time series processing. In this presentation, we will explain how Python and WarpScript can interoperate efficiently, using bridges built between both the Python and Java ecosystems by libraries such as Py4J, Pyrolite and PySpark. We will see the benefits of doing so through examples and Jupyter notebooks.

Publié dans : Données & analyses
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

20190705 py data_paris_meetup

  1. 1. Combining the Strengths of Python and Java to leverage Time Series and Geo Time Series™
  2. 2. Agenda Part I Presentation of Time Series DataBases and Warp 10 Part II Leverage the Strengths of both the Java and Python Worlds Part III Incentives to combine Python and WarpScript Conclusion
  3. 3. Presentation of Time Series DataBases and Warp 10™ Part I
  4. 4. What are Time Series?
  5. 5. Domains and use cases
  6. 6. Why it’s not easy to build a TSDB Storage Analytics
  7. 7. Why it’s not easy to build a TSDB Storage ➢ Scalability ➢ Ingestion / Fetch performance ➢ Security ➢ Deployment (e.g. standalone vs edge vs distributed) Analytics
  8. 8. Why it’s not easy to build a TSDB Storage ➢ Scalability ➢ Ingestion / Fetch performance ➢ Security ➢ Deployment (e.g. standalone vs edge vs distributed) Analytics ➢ Complex queries, at least w.r.t. time ➢ Concurrent access ➢ Interoperable with other programs / languages / libraries ➢ Parallelizable when storage is distributed
  9. 9. Here comes Warp 10™!
  10. 10. The Geo Time Series™ data model Metadata Datapoints
  11. 11. The Geo Time Series™ data model Metadata Datapoints key1: value1 key2: value2 . . . labels: immutable attributes: mutable classname identifies a GTS
  12. 12. The Geo Time Series™ data model Metadata Datapoints key1: value1 key2: value2 . . . timestamps values geostamps (optional)labels: immutable attributes: mutable classname identifies a GTS Long, Double, String, Bytes, Multi-values, nested GTS, . . .
  13. 13. Warp 10 Storage Engine ➢ Geo Time Series™ data model support
  14. 14. Warp 10 Storage Engine ➢ Geo Time Series™ data model support ➢ Built for performance and scalability
  15. 15. Warp 10 Storage Engine ➢ Geo Time Series™ data model support ➢ Built for performance and scalability ➢ Secure by design, strong authentication/authorization
  16. 16. Warp 10 Storage Engine ➢ Geo Time Series™ data model support ➢ Built for performance and scalability ➢ Secure by design, strong authentication/authorization ➢ Use of standard protocols and formats
  17. 17. Warp 10 Storage Engine ➢ Geo Time Series™ data model support ➢ Built for performance and scalability ➢ Secure by design, strong authentication/authorization ➢ Use of standard protocols and formats ➢ Scale from a single computer to a large cluster
  18. 18. A library of 900+ functions 18 Warp 10 Analytics Engine
  19. 19. A library of 900+ functions 19 Warp 10 Analytics Engine
  20. 20. A library of 900+ functions 20 Warp 10 Analytics Engine
  21. 21. A library of 900+ functions 21 Warp 10 Analytics Engine
  22. 22. A library of 900+ functions 22 Warp 10 Analytics Engine $data FUNC1 FUNC2 FUNC3 ...
  23. 23. 80% of effort Scope of Warp 10™
  24. 24. Advantages of Warp 10™ vs other TSDBs ➢ Broader scope: from storage to analytics
  25. 25. Advantages of Warp 10™ vs other TSDBs ➢ Broader scope: from storage to analytics ➢ More complex queries and analytics
  26. 26. Advantages of Warp 10™ vs other TSDBs ➢ Broader scope: from storage to analytics ➢ More complex queries and analytics ➢ Optional support for Geo
  27. 27. Advantages of Warp 10™ vs other TSDBs ➢ Broader scope: from storage to analytics ➢ More complex queries and analytics ➢ Optional support for Geo ➢ Both storage and analytics are distributable
  28. 28. Advantages of Warp 10™ vs other TSDBs ➢ Broader scope: from storage to analytics ➢ More complex queries and analytics ➢ Optional support for Geo ➢ Both storage and analytics are distributable ➢ Strongly interoperable with other tools
  29. 29. Get your hands on Warp 10™ in no time https://sandbox.senx.io
  30. 30. Leverage the Strengths of both the Java and Python Worlds Part II
  31. 31. Py4J: a bridge between Python and Java Gateway
  32. 32. Py4J: a bridge between Python and Java Gateway
  33. 33. Py4J: a bridge between Python and Java Gateway
  34. 34. Py4J: a bridge between Python and Java Gateway
  35. 35. Py4J: a bridge between Python and Java Gateway WarpScript execution environment (called stack) createsinteracts
  36. 36. Convert, pickle and unpickle WarpScript execution environment (called stack)
  37. 37. Convert, pickle and unpickle WarpScript execution environment (called stack)
  38. 38. Convert, pickle and unpickle Pickle ➢ Fast and straightforward conversions ➢ Much faster and more compact than JSON ➢ Secured unpickling under the hood (Pyrolite) WarpScript execution environment (called stack)
  39. 39. How to combine Python with WarpScript
  40. 40. How to combine Python with WarpScript
  41. 41. How to combine Python with WarpScript
  42. 42. How to combine Python with WarpScript
  43. 43. a) Use the cell magic: %%warpscript pip install warp10-jupyter %load_ext warpscript %%warpscript [--stack STACK] [--address ADDRESS] [--port PORT] Starting connection with ADDRESS:PORT. Creating a new WarpScript stack accessible under variable "STACK".
  44. 44. b) Use the gateway of a Warp 10™ instance
  45. 45. b) Use the gateway of a Warp 10™ instance from py4j.java_gateway import (JavaGateway, GatewayParameters) params = GatewayParameters('ADRESS', 'PORT', auto_convert=True) gateway = JavaGateway(gateway_parameters=params)
  46. 46. b) Use the gateway of a Warp 10™ instance from py4j.java_gateway import (JavaGateway, GatewayParameters) params = GatewayParameters('ADRESS', 'PORT', auto_convert=True) gateway = JavaGateway(gateway_parameters=params) import warpscript # pip install warp10-jupyter stack = gateway.entry_point.newStack() stack.exec('hello world') print(stack) top: 'hello world'
  47. 47. c) Launch a local gateway from python
  48. 48. c) Launch a local gateway from python %%bash wget https://dl.bintray.com/senx/generic/io/warp10/warp10/2.0.3/warp10-2.0.3.tar.gz tar xvzf warp10-2.0.3.tar.gz
  49. 49. c) Launch a local gateway from python %%bash wget https://dl.bintray.com/senx/generic/io/warp10/warp10/2.0.3/warp10-2.0.3.tar.gz tar xvzf warp10-2.0.3.tar.gz port = launch_gateway(classpath='warp10-2.0.3/bin/warp10-2.0.3.jar') gateway = JavaGateway(gateway_parameters=GatewayParameters(port=port)) conf = {'warp.timeunits':'us'} entry_point = gateway.jvm.io.warp10.Py4JEntryPoint(conf)
  50. 50. c) Launch a local gateway from python %%bash wget https://dl.bintray.com/senx/generic/io/warp10/warp10/2.0.3/warp10-2.0.3.tar.gz tar xvzf warp10-2.0.3.tar.gz port = launch_gateway(classpath='warp10-2.0.3/bin/warp10-2.0.3.jar') gateway = JavaGateway(gateway_parameters=GatewayParameters(port=port)) conf = {'warp.timeunits':'us'} entry_point = gateway.jvm.io.warp10.Py4JEntryPoint(conf)
  51. 51. Incentives to combine Python and WarpScript™ Part III
  52. 52. Rationales for combining WarpScript and Python ➢
  53. 53. Rationales for combining WarpScript and Python ➢ ➢
  54. 54. Rationales for combining WarpScript and Python ➢ ➢ ➢
  55. 55. Rationales for combining WarpScript and Python ➢ ➢ ➢ ➢
  56. 56. Rationales for combining WarpScript and Python ➢ ➢ ➢ ➢ ➢
  57. 57. Rationales for combining WarpScript and Python ➢ ➢ ➢ ➢ ➢ ➢
  58. 58. Rationales for combining WarpScript and Python ➢ ➢ ➢ ➢ ➢ ➢ ➢
  59. 59. When working with data coming from a TSDB ➢
  60. 60. When working with data coming from a TSDB ➢ ➢
  61. 61. When working with data coming from a TSDB ➢ ➢ ➢
  62. 62. When working with data coming from a TSDB ➢ ➢ ➢ ➢
  63. 63. When working with data coming from a TSDB ➢ %%warpscript [ $gts [ 'key1' 'key2' ] reducer.mean ] REDUCE ➢ ➢ ➢
  64. 64. When working with any Series data, Do not reinvent the wheel...
  65. 65. … Use the WarpScript Library!
  66. 66. … Use the WarpScript Library! ➢ Ex: much more missing values in data frames (due to unaligned ticks)
  67. 67. … Use the WarpScript Library! ➢ Ex: much more missing values in data frames (due to unaligned ticks) ➢
  68. 68. … Use the WarpScript Library! ➢ Ex: much more missing values in data frames (due to unaligned ticks) ➢ ➢
  69. 69. … Use the WarpScript Library! ➢ Ex: much more missing values in data frames (due to unaligned ticks) ➢ ➢ ➢
  70. 70. … Use the WarpScript Library! ➢ Ex: much more missing values in data frames (due to unaligned ticks) ➢ ➢ ➢ Focus on your core business
  71. 71. Example: use a WarpScript function . . . stack = entry_point.newStack() stack.push(pickle.dumps(ticks)) stack.push(pickle.dumps(values)) %%warpscript --stack stack [ 'ticks' 'values' ] STORE $ticks PICKLE-> [] [] [] $values PICKLE-> MAKEGTS 1 d 2 'piece' TIMESPLIT VALUES ->PICKLE result = pickle.loads(stack.pop()) . . .
  72. 72. Example: use a WarpScript function . . . stack = entry_point.newStack() stack.push(pickle.dumps(ticks)) stack.push(pickle.dumps(values)) %%warpscript --stack stack [ 'ticks' 'values' ] STORE $ticks PICKLE-> [] [] [] $values PICKLE-> MAKEGTS 1 d 2 'piece' TIMESPLIT VALUES ->PICKLE result = pickle.loads(stack.pop()) . . .
  73. 73. Example: use a WarpScript function . . . stack = entry_point.newStack() stack.push(pickle.dumps(ticks)) stack.push(pickle.dumps(values)) %%warpscript --stack stack [ 'ticks' 'values' ] STORE $ticks PICKLE-> [] [] [] $values PICKLE-> MAKEGTS 1 d 2 'piece' TIMESPLIT VALUES ->PICKLE result = pickle.loads(stack.pop()) . . .
  74. 74. Example: use a WarpScript function . . . stack = entry_point.newStack() stack.push(pickle.dumps(ticks)) stack.push(pickle.dumps(values)) %%warpscript --stack stack [ 'ticks' 'values' ] STORE $ticks PICKLE-> [] [] [] $values PICKLE-> MAKEGTS 1 d 2 'piece' TIMESPLIT VALUES ->PICKLE result = pickle.loads(stack.pop()) . . .
  75. 75. Example: draw graphical content from GTS Native Processing* support In WarpScript * “Processing is a language for learning how to code within the context of the visual arts. Since 2001, Processing has promoted software literacy within the visual arts and visual literacy within technology. There are tens of thousands of students, artists, designers, researchers, and hobbyists who use Processing for learning and prototyping”
  76. 76. Some overlap with Pandas (but with differences)
  77. 77. Some overlap with Pandas (but with differences) .resample() BUCKETIZE ➢ pandas: warpscript:
  78. 78. Some overlap with Pandas (but with differences) .resample() BUCKETIZE .rolling() MAP ➢ ➢ pandas: pandas: warpscript: warpscript:
  79. 79. Share your work
  80. 80. Share your work <% 'hello world' %> http://MY/MACRO/REPOSITORY/macros/hello.mc2
  81. 81. Share your work . . . conf['warpfleet.macros.repos'] = 'http://MY/MACRO/REPOSITORY' entry_point = gateway.jvm.io.warp10.Py4JEntryPoint(conf) stack = gateway.entry_point.newStack() stack.exec('@macros/hello.mc2') print(stack) <% 'hello world' %> http://MY/MACRO/REPOSITORY/macros/hello.mc2 top: 'hello world' Local python script
  82. 82. Distribute and Parallelize using PySpark
  83. 83. Distribute and Parallelize using PySpark df = rdd.toDF() df.createOrReplaceTempView('NAMED') df = sqlContext.sql("SELECT func('%warpscript.mc2',_1) AS result FROM NAMED")
  84. 84. Distribute and Parallelize using PySpark df = rdd.toDF() df.createOrReplaceTempView('NAMED') df = sqlContext.sql("SELECT func('%warpscript.mc2',_1) AS result FROM NAMED") conf = {} conf['warpscript.inputformat.class'] = 'org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat' conf['warpscript.inputformat.script'] = '%warpscript.mc2' ... rdd = sc.newAPIHadoopRDD(inputFormatClass='io.warp10.spark.SparkWarpScriptInputFormat', keyClass='org.apache.hadoop.io.Text', valueClass='org.apache.hadoop.io.BytesWritable', conf=conf)
  85. 85. Conclusion
  86. 86. Takeaways ➢ Use a time series database and the GTS data model!
  87. 87. Takeaways ➢ Use a time series database and the GTS data model! ➢ Warp 10 scope is much broader, from storage to analytics
  88. 88. Takeaways ➢ Use a time series database and the GTS data model! ➢ Warp 10 scope is much broader, from storage to analytics ➢ Try it for free on https://sandbox.senx.io
  89. 89. Takeaways ➢ Use a time series database and the GTS data model! ➢ Warp 10 scope is much broader, from storage to analytics ➢ Try it for free on https://sandbox.senx.io ➢ Py4J is an efficient bridge between Python and Java worlds
  90. 90. Takeaways ➢ Use a time series database and the GTS data model! ➢ Warp 10 scope is much broader, from storage to analytics ➢ Try it for free on https://sandbox.senx.io ➢ Py4J is an efficient bridge between Python and Java worlds ➢ Combine WarpScript and Python - When working with TSDB data - Add WarpScript extensive libraries of functions to Python - Leverage WarpScript extensiblity and shareability - Write once, deploy a warscript at any scale - Use warpscript to read and process records on-the-fly at any scale
  91. 91. Thank you!
  92. 92. Supplementary slides
  93. 93. Rationales for using Geo Time Series Some features ● Store raw data ● Inner relations: time (and optionally geo) ● Outer relations: group by classname, group by key/value Some benefits ● Chunkable / Parallelizable ● Easy manipulation ● Easier implementation of analytics
  94. 94. WarpScript has over 900 functions String Function (32) Maths (74) Geo Time Series® (145) Stack (66) Composite Types (52) Processing (94) Platform (39) Logic (10) Time Related (26) Cryptographic (16) Logic Structure & Flow Control (21) Constants (9) Quaternions (8) Mappers (93) reducers (37) Bucketizers (23) Operations (18) Filters (12) Conversions (24) Geo (19) 94
  95. 95. Mode and ecosystem interoperability
  96. 96. APIs FetchIngress Find Meta Delete REPLEgress Py4J gateway Mobius Interacting with the storage engine: Interacting with the analytics engine: Stream update Plasma . . . . . .
  97. 97. Interaction example Analytics engine Fetch Ingress Egress Fetch raw data Complex query Push data GTS Input format GTS Output format Storage engine Client program API JSON e.g. Python scripts
  98. 98. How the analytics engine works op 1 op 2 . . . Analytics engine Incoming warpscript Create stack exec op 1 exec op 2 . . . Serialize result Close stack{ “c”: “name” “l”: {k=v} “v”: [[...]] } JSON response Storage engine Data pushed by exec op 1 Egress
  99. 99. Advanced interaction using Py4J in Python Create stack Python Connect to gateway Exec op 1 Statements Create stack Variable conversions Analytics engine Py4J gateway WarpScript stack Exec op 1 Exec op 2 Exec op 2 . . .
  100. 100. Benefits Workflow ● Direct interaction with a WarpScript stack ● Can keep WarpScript data in memory between python statements ● Storage engine is optional Conversions ● No need for Json serialization / deserialization ● Can use Py4J automatic conversion ● WarpScript support Pickle
  101. 101. WarpScript basics args... FUNCTION syntax 1 ‘a’ STORE Assign value $a Use variable <% ‘some operations’ %> ‘macro’ STORE Define a macro (i.e. a custom function) args... @macro Evaluate macro 1 1 + example args... @trusted/repo/macro Evaluate macro from trusted repository
  102. 102. Example of WarpScript [ args ] FETCH
  103. 103. Example of WarpScript [ args ] FETCH [ args ] BUCKETIZE [ args ] REDUCE
  104. 104. Shareability / Extensibility Easily share macros (no installation required) Retrieve and publish plugins, extensions, macros warpfleet.macros.repos = http://MY/MACRO/REPOSITORY @my/macro Configuration file Warpscript $wf get --conf my/conf/file group artifact Command line
  105. 105. Initializing PySpark A PySpark job needs to create a SparkSession A SQLContext instance is also needed for UDF registration from pyspark.sql import SparkSession from pyspark.sql import SQLContext spark = SparkSession.builder.appName("NAME YOUR PySpark JOB").getOrCreate() sc = spark.sparkContext sqlContext = SQLContext(sc)
  106. 106. Registering WarpScript related UDFs WarpScript related UDFs are registered using the SQLContext instance The value of x is the number of arguments of the declared function, from 1 to 22 The type of the return value must be declared explicitly This declaration can become complex, do not hesitate to return a STRING and use SNAPSHOT in the WarpScript code if that makes sense from pyspark.sql.types import * sqlContext.registerJavaFunction("func", "io.warp10.spark.WarpScriptUDFx", StringType())
  107. 107. Use Case Examples ● ● ● IT Monitoring
  108. 108. Use Case Examples ● ● ● ● Aeronautics
  109. 109. Use Case Examples Industrial equipments ● ● ● ●

×