A First Look at HPCC Systems 7.0, Innovation in Action

•

1 j'aime•187 vues

As part of the 2018 HPCC Systems Summit Community Day event: The latest version of the platform contains improvements to functionality, usability and interoperability. This talk gives an overview of the changes and explains how you might find them useful. Gavin Halliday's primary focus is on the code generator, which converts ECL into the queries which run on the platform. Gavin enjoys working on problems together with the development team and the varied nature of the work keeps him engaged. Gavin shares how the platform compares with competitive platforms, including scalability and coding simplicity. He enjoys working on the platform and the elegant solutions the development team is able to implement. Gavin encourages people to give it a try!

Données & analyses

Innovation and
Reinvention Driving
Transformation
OCTOBER 9, 2018
2018 HPCC Systems® Community
Day
Gavin Halliday
A First Look at HPCC Systems 7.0, Innovation in Action

Renewing the foundations
• File processing
• ECLWatch workunit interface
• Visualization Framework
• DESDL
• Configuration manager
HPCC 7.0 2

ECL Watch
Goals
• Highlight important information
• Make it easier to understand queries
• Improved support for very large queries
Examples:
• Gantt
• Graph Viewer
• Timings
• Log data visualizer
HPCC 7.0 4

Visualization Framework
• Version 2.0 now available
• https://github.com/hpcc-systems/Visualization
• Rebranded as hpcc-js in the node npm repository
• New documentation, demos and gallery
• Includes non visualization items like ESP comms layer
• Dashy beta
• Not tied to HPCC Systems
• Visualizer Bundle 1.1
HPCC 7.0 9

ECL libraries
• Ecl Library extensions
• Date – timestamps, time zones, formatting
• Unicode – words, prefixes and suffixes
• Maths – infinity, fmod
• Bundles
• Data Patterns
• ML – Gradient boosted trees, boosted forests
• Visualizer
HPCC 7.0 10

ESP improvements
• DESDL improvements
• Custom mappings
• Fully integrated into ESP
• Mixing DESDL and ESDL in one service
• Allow disconnection from Dali
• Support for persistent connections.
HPCC 7.0 11

ECL Compiler
• Activities in other languages.
EXPORT streamed dataset(r) myDataset(unsigned numRows = numRows) :=
EMBED(javascript : activity) …
• Multi-line string constants
message := '''One
Two
Three''';
• Code generator improvements
• Faster archive generation
• Faster syntax checking
HPCC 7.0 12

Spark
• “An open source distributed general-purpose cluster-computing framework”
• Reading from spark
• Files and indexes.
• Filter rows
• Select fields required
• N to M parallel reads
• Writing from spark
• File security
• Spark cluster installation
HPCC 7.0 14

Log Data Visualizations
HPCC 7.0 17
https://hpccsystems.com/blog/ELK_visualizations

VS Code
HPCC 7.0 18
https://code.visualstudio.com/

User Security
• Session management
• Avoid resending credentials
• Users can log out
• Allow sessions lock and time out
• Minimize time passwords retained
HPCC 7.0 21

System security
• Spark
• File access rights
• Dafilesrv authentication of requests
• The cloud
• Verifying components
• Encryption in transit
• ROXIE HTTPS support
HPCC 7.0 22

Thor
• Keyed Join (HPCC-16476)
HPCC 7.0 24

Thor
• LOOP
• Synchronization overhead
• LOCAL LOOP bodies
• Child Queries
• Reduced overhead
• Improvements to buffering
• Faster Startup
HPCC 7.0 25

Index improvements
HPCC 7.0 26
•60K rows
•0.02% of totalHourly
•1.4M rows
•0.6% of totalDaily
•10M rows
•4% of totalWeekly
•43M rows
•17% of totalMonthly
•520M rows
•100% of totalHistorical
• Example database containing 250M unique items with 1000 updates each minute

Index improvements
• Bloom filters
• Supports multiple filters per index
• User configurable probability
• Automatically created.
• Richard’s blog post hpccsystems.com/blog/bloom-filters
• Hash distributed keys.
• When distribution fields are filtered with equalities
• Easier to create co-distributed keys
• Lower overhead calculating the part containing a match
HPCC 7.0 27

Finally
• WsSQL – now part of the core
• Over 1,000 pull requests since 6.4
HPCC 7.0 28

Talk to us!
• Bloom filters - Richard Chapman
• DESDL - Yanrui Ma
• ELK - Rodrigo Pastrana
• Thor - Jake Cobbett-Smith
• Visualizations - Gordon Smith
• Security - Tony Fishbeck
• Spark - Rodrigo Pastrana
• Config Manager - Ken Rowland
HPCC 7.0 29

Recommandé

openATTIC Technology Overview - Ceph Managementit-novum

NFVO based on ManageIQ - OPNFV Summit 2016 DemoManageIQ

High Availability - Brett Thurber - ManageIQ Design Summit 2016ManageIQ

Nova Updates - Kilo EditionOpenStack Foundation

Ceph Management and Monitoring with Dashboard v2 - Lenz GrimmerCeph Community

"What's New With Globus" Webinar: Spring 2018Globus

6.0 is comingChristoph Wurm

RedisConf18 - Redis Cluster Provisioning with Kubernetes Service-Catalog Exte...Redis Labs

Recommandé

openATTIC Technology Overview - Ceph Managementit-novum

NFVO based on ManageIQ - OPNFV Summit 2016 DemoManageIQ

High Availability - Brett Thurber - ManageIQ Design Summit 2016ManageIQ

Nova Updates - Kilo EditionOpenStack Foundation

Ceph Management and Monitoring with Dashboard v2 - Lenz GrimmerCeph Community

"What's New With Globus" Webinar: Spring 2018Globus

6.0 is comingChristoph Wurm

RedisConf18 - Redis Cluster Provisioning with Kubernetes Service-Catalog Exte...Redis Labs

OSDC 2018 | Monitoring Kubernetes at Scale by Monica SarbuNETWAYS

OpenNebulaConf2017EU: Transforming an Old Supercomputer into a Cloud Platform...OpenNebula Project

Kubecon 2019_eu-k8s-secrets-csiRita Zhang

HNSciCloud Info Day, 7 Sept 2016, Functional Requirements by Helge MeinhardHelix Nebula The Science Cloud

Everything you wanted to know about RadosGW - Orit Wasserman, Matt BenjaminCeph Community

openATTIC & Ceph Management @ Suse Monthly Open Source Talks - 2016-06-07it-novum

Orchestrating Shared Networks, Physical LB and DNS on CloudstackMarcus Vinicius Cesário

Cloud Networking - Greg Blomquist, Scott Drennan, Lokesh Jain - ManageIQ Desi...ManageIQ

Storage Monitoring in openATTIC - Monitoring Workshop - 2016-09-07Lenz Grimmer

Kubernetes Fundamentals on Azure 2017Vadim Zendejas

ManageIQ Overview at Management and Orchestration Developer (MODM) Meet-upJerome Marc

Sprint 38 reviewManageIQ

OSMC 2018 | SLA Monitoring mit Icinga & Prometheus by Moritz TanzerNETWAYS

Serhiy Kalinets "Building .NET Services for Kubernetes"Fwdays

Ceph and Storage Management with openATTIC, Ceph Tech Talks 2016-06-23Lenz Grimmer

Cortex: Horizontally Scalable, Highly Available PrometheusGrafana Labs

CoreOS fest 2016 Summary - DevOps BP 2016 JuneZsolt Molnar

OpenNebulaConf2017EU: FairShare Scheduling by Valentina Zaccolo, INDIGOOpenNebula Project

Intro to creating kubernetes operators Juraj Hantak

Kong in 1.x TerritoryThibault Charbonnier

Path to 8.0 HPCC Systems

Innovation with Connection, The new HPCC Systems Plugins and ModulesHPCC Systems

Contenu connexe

Tendances

OSDC 2018 | Monitoring Kubernetes at Scale by Monica SarbuNETWAYS

OpenNebulaConf2017EU: Transforming an Old Supercomputer into a Cloud Platform...OpenNebula Project

Kubecon 2019_eu-k8s-secrets-csiRita Zhang

HNSciCloud Info Day, 7 Sept 2016, Functional Requirements by Helge MeinhardHelix Nebula The Science Cloud

Everything you wanted to know about RadosGW - Orit Wasserman, Matt BenjaminCeph Community

openATTIC & Ceph Management @ Suse Monthly Open Source Talks - 2016-06-07it-novum

Orchestrating Shared Networks, Physical LB and DNS on CloudstackMarcus Vinicius Cesário

Cloud Networking - Greg Blomquist, Scott Drennan, Lokesh Jain - ManageIQ Desi...ManageIQ

Storage Monitoring in openATTIC - Monitoring Workshop - 2016-09-07Lenz Grimmer

Kubernetes Fundamentals on Azure 2017Vadim Zendejas

ManageIQ Overview at Management and Orchestration Developer (MODM) Meet-upJerome Marc

Sprint 38 reviewManageIQ

OSMC 2018 | SLA Monitoring mit Icinga & Prometheus by Moritz TanzerNETWAYS

Serhiy Kalinets "Building .NET Services for Kubernetes"Fwdays

Ceph and Storage Management with openATTIC, Ceph Tech Talks 2016-06-23Lenz Grimmer

Cortex: Horizontally Scalable, Highly Available PrometheusGrafana Labs

CoreOS fest 2016 Summary - DevOps BP 2016 JuneZsolt Molnar

OpenNebulaConf2017EU: FairShare Scheduling by Valentina Zaccolo, INDIGOOpenNebula Project

Intro to creating kubernetes operators Juraj Hantak

Kong in 1.x TerritoryThibault Charbonnier

Tendances (20)

OSDC 2018 | Monitoring Kubernetes at Scale by Monica Sarbu

OpenNebulaConf2017EU: Transforming an Old Supercomputer into a Cloud Platform...

Kubecon 2019_eu-k8s-secrets-csi

HNSciCloud Info Day, 7 Sept 2016, Functional Requirements by Helge Meinhard

Everything you wanted to know about RadosGW - Orit Wasserman, Matt Benjamin

openATTIC & Ceph Management @ Suse Monthly Open Source Talks - 2016-06-07

Orchestrating Shared Networks, Physical LB and DNS on Cloudstack

Cloud Networking - Greg Blomquist, Scott Drennan, Lokesh Jain - ManageIQ Desi...

Storage Monitoring in openATTIC - Monitoring Workshop - 2016-09-07

Kubernetes Fundamentals on Azure 2017

ManageIQ Overview at Management and Orchestration Developer (MODM) Meet-up

Sprint 38 review

OSMC 2018 | SLA Monitoring mit Icinga & Prometheus by Moritz Tanzer

Serhiy Kalinets "Building .NET Services for Kubernetes"

Ceph and Storage Management with openATTIC, Ceph Tech Talks 2016-06-23

Cortex: Horizontally Scalable, Highly Available Prometheus

CoreOS fest 2016 Summary - DevOps BP 2016 June

OpenNebulaConf2017EU: FairShare Scheduling by Valentina Zaccolo, INDIGO

Intro to creating kubernetes operators

Kong in 1.x Territory

Similaire à A First Look at HPCC Systems 7.0, Innovation in Action

Path to 8.0 HPCC Systems

Innovation with Connection, The new HPCC Systems Plugins and ModulesHPCC Systems

HPCC Systems 6.0.0 HighlightsHPCC Systems

The Download: Tech Talks by the HPCC Systems Community, Episode 11HPCC Systems

Kubernetes meetup bangalore december 2017 - v02Kumar Gaurav

Technical Introduction to RHEL8vidalinux

Webinar: What's new in CDAP 3.5?Cask Data

Chef and OpenStack Workshop from ChefConf 2013Matt Ray

KubeCon USA 2017 brief Overview - from Kubernetes meetup BangaloreKrishna-Kumar

Red Hat Storage RoadmapColleen Corrice

Red Hat Storage RoadmapRed_Hat_Storage

Introduction to Apache Mesos and DC/OSSteve Wong

Tech-Spark: SQL Server on LinuxRalph Attard

HPCC Platform + VisualizationGordon Smith

DEVNET-1136 Cisco ONE Enterprise Cloud Suite for Infrastructure Management.Cisco DevNet

Vijfhart thema-avond-oracle-12c-new-featuresmkorremans

OCP Telco Engineering Workshop at BCE2017Radisys Corporation

Running secured Spark job in Kubernetes compute cluster and integrating with ...DataWorks Summit

Introduction to the Container Network Interface (CNI)Weaveworks

Accelerate Big Data Processing with High-Performance Computing TechnologiesIntel® Software

Similaire à A First Look at HPCC Systems 7.0, Innovation in Action (20)

Path to 8.0

Innovation with Connection, The new HPCC Systems Plugins and Modules

HPCC Systems 6.0.0 Highlights

The Download: Tech Talks by the HPCC Systems Community, Episode 11

Kubernetes meetup bangalore december 2017 - v02

Technical Introduction to RHEL8

Webinar: What's new in CDAP 3.5?

Chef and OpenStack Workshop from ChefConf 2013

KubeCon USA 2017 brief Overview - from Kubernetes meetup Bangalore

Red Hat Storage Roadmap

Introduction to Apache Mesos and DC/OS

Tech-Spark: SQL Server on Linux

HPCC Platform + Visualization

DEVNET-1136 Cisco ONE Enterprise Cloud Suite for Infrastructure Management.

Vijfhart thema-avond-oracle-12c-new-features

OCP Telco Engineering Workshop at BCE2017

Running secured Spark job in Kubernetes compute cluster and integrating with ...

Introduction to the Container Network Interface (CNI)

Accelerate Big Data Processing with High-Performance Computing Technologies

Plus de HPCC Systems

Natural Language to SQL Query conversion using Machine Learning Techniques on...HPCC Systems

Improving Efficiency of Machine Learning Algorithms using HPCC SystemsHPCC Systems

Towards Trustable AI for Complex SystemsHPCC Systems

WelcomeHPCC Systems

Closing / Adjourn HPCC Systems

Community Website: Virtual Ribbon CuttingHPCC Systems

Release Cycle ChangesHPCC Systems

Geohashing with Uber’s H3 Geospatial Index HPCC Systems

Advancements in HPCC Systems Machine LearningHPCC Systems

Docker Support HPCC Systems

Expanding HPCC Systems Deep Neural Network CapabilitiesHPCC Systems

Leveraging Intra-Node Parallelization in HPCC SystemsHPCC Systems

DataPatterns - Profiling in ECL Watch HPCC Systems

Leveraging the Spark-HPCC Ecosystem HPCC Systems

Work Unit Analysis ToolHPCC Systems

Community Award Ceremony HPCC Systems

Dapper Tool - A Bundle to Make your ECL NeaterHPCC Systems

A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...HPCC Systems

Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...HPCC Systems

Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...HPCC Systems

Plus de HPCC Systems (20)

Natural Language to SQL Query conversion using Machine Learning Techniques on...

Improving Efficiency of Machine Learning Algorithms using HPCC Systems

Towards Trustable AI for Complex Systems

Welcome

Closing / Adjourn

Community Website: Virtual Ribbon Cutting

Release Cycle Changes

Geohashing with Uber’s H3 Geospatial Index

Advancements in HPCC Systems Machine Learning

Docker Support

Expanding HPCC Systems Deep Neural Network Capabilities

Leveraging Intra-Node Parallelization in HPCC Systems

DataPatterns - Profiling in ECL Watch

Leveraging the Spark-HPCC Ecosystem

Work Unit Analysis Tool

Community Award Ceremony

Dapper Tool - A Bundle to Make your ECL Neater

A Success Story of Challenging the Status Quo: Gadget Girls and the Inclusion...

Beyond the Spectrum – Creating an Environment of Diversity and Empowerment wi...

Using High Dimensional Representation of Words (CBOW) to Find Domain Based Co...

Dernier

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375

Learn How Data Science Changes Our WorldEduminds Learning

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster

RadioAdProWritingCinderellabyButleri.pdfgstagge

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss

Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss

Multiple time frame trading analysis -brianshannon.pdfchwongval

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534

Easter Eggs From Star Wars and in cars 1 and 217djon017

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics

Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research

GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03

DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档208367051

Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy

From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Universitat Politècnica de Catalunya

Dernier (20)

Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...

Learn How Data Science Changes Our World

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024

RadioAdProWritingCinderellabyButleri.pdf

办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一

Heart Disease Classification Report: A Data Analysis Project

专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改

Multiple time frame trading analysis -brianshannon.pdf

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...

Easter Eggs From Star Wars and in cars 1 and 2

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...

Biometric Authentication: The Evolution, Applications, Benefits and Challenge...

GA4 Without Cookies [Measure Camp AMS]

Top 5 Best Data Analytics Courses In Queens

DBA Basics: Getting Started with Performance Tuning.pdf

Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT

原版1:1定制南十字星大学毕业证（SCU毕业证）#文凭成绩单#真实留信学历认证永久存档

Student Profile Sample report on improving academic performance by uniting gr...

From idea to production in a day – Leveraging Azure ML and Streamlit to build...

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)

A First Look at HPCC Systems 7.0, Innovation in Action

1. Innovation and Reinvention Driving Transformation OCTOBER 9, 2018 2018 HPCC Systems® Community Day Gavin Halliday A First Look at HPCC Systems 7.0, Innovation in Action

2. Renewing the foundations • File processing • ECLWatch workunit interface • Visualization Framework • DESDL • Configuration manager HPCC 7.0 2

3. Usability and Productivity

4. ECL Watch Goals • Highlight important information • Make it easier to understand queries • Improved support for very large queries Examples: • Gantt • Graph Viewer • Timings • Log data visualizer HPCC 7.0 4

5. Gantt chart HPCC 7.0 5

6. New Graph Viewer HPCC 7.0 6

7. New Graph Viewer HPCC 7.0 7

8. Stats and Timings HPCC 7.0 8

9. Visualization Framework • Version 2.0 now available • https://github.com/hpcc-systems/Visualization • Rebranded as hpcc-js in the node npm repository • New documentation, demos and gallery • Includes non visualization items like ESP comms layer • Dashy beta • Not tied to HPCC Systems • Visualizer Bundle 1.1 HPCC 7.0 9

10. ECL libraries • Ecl Library extensions • Date – timestamps, time zones, formatting • Unicode – words, prefixes and suffixes • Maths – infinity, fmod • Bundles • Data Patterns • ML – Gradient boosted trees, boosted forests • Visualizer HPCC 7.0 10

11. ESP improvements • DESDL improvements • Custom mappings • Fully integrated into ESP • Mixing DESDL and ESDL in one service • Allow disconnection from Dali • Support for persistent connections. HPCC 7.0 11

12. ECL Compiler • Activities in other languages. EXPORT streamed dataset(r) myDataset(unsigned numRows = numRows) := EMBED(javascript : activity) … • Multi-line string constants message := '''One Two Three'''; • Code generator improvements • Faster archive generation • Faster syntax checking HPCC 7.0 12

13. Interoperability

14. Spark • “An open source distributed general-purpose cluster-computing framework” • Reading from spark • Files and indexes. • Filter rows • Select fields required • N to M parallel reads • Writing from spark • File security • Spark cluster installation HPCC 7.0 14

15. Log Data Visualizations HPCC 7.0 15

16. Log Data Visualizations HPCC 7.0 16

17. Log Data Visualizations HPCC 7.0 17 https://hpccsystems.com/blog/ELK_visualizations

18. VS Code HPCC 7.0 18 https://code.visualstudio.com/

19. VS Code HPCC 7.0 19

20. Security

21. User Security • Session management • Avoid resending credentials • Users can log out • Allow sessions lock and time out • Minimize time passwords retained HPCC 7.0 21

22. System security • Spark • File access rights • Dafilesrv authentication of requests • The cloud • Verifying components • Encryption in transit • ROXIE HTTPS support HPCC 7.0 22

23. Performance

24. Thor • Keyed Join (HPCC-16476) HPCC 7.0 24

25. Thor • LOOP • Synchronization overhead • LOCAL LOOP bodies • Child Queries • Reduced overhead • Improvements to buffering • Faster Startup HPCC 7.0 25

26. Index improvements HPCC 7.0 26 •60K rows •0.02% of totalHourly •1.4M rows •0.6% of totalDaily •10M rows •4% of totalWeekly •43M rows •17% of totalMonthly •520M rows •100% of totalHistorical • Example database containing 250M unique items with 1000 updates each minute

27. Index improvements • Bloom filters • Supports multiple filters per index • User configurable probability • Automatically created. • Richard’s blog post hpccsystems.com/blog/bloom-filters • Hash distributed keys. • When distribution fields are filtered with equalities • Easier to create co-distributed keys • Lower overhead calculating the part containing a match HPCC 7.0 27

28. Finally • WsSQL – now part of the core • Over 1,000 pull requests since 6.4 HPCC 7.0 28

29. Talk to us! • Bloom filters - Richard Chapman • DESDL - Yanrui Ma • ELK - Rodrigo Pastrana • Thor - Jake Cobbett-Smith • Visualizations - Gordon Smith • Security - Tony Fishbeck • Spark - Rodrigo Pastrana • Config Manager - Ken Rowland HPCC 7.0 29

Notes de l'éditeur

Good afternoon. In this presentation I am going to guide you through some of the main changes in the new version of the platform. If something catches your eye and you want to find out more, please come and chat afterwards in one of the breaks. Hopefully by the end you’ll all be dying to try it out for yourselves. [20]
So, each major version of the platform is a chance for us to make significant changes to some of the foundations. The changes in 7.0 have enabled us to introduce various new features, but just as importantly they provide the scope for improvements in future releases. Let’s take the first of these as an example. The file changes came about through a combination of different requirements: First of all we wanted to make it easier for ECL developers when file formats change. Previously if the format of file changed, then you needed to update your own copy of the ECL definition before you could read it. It would be much better if you could continue to use the old definition until it was convenient for you to update your sources. Secondly, it can be slow reading files and indexes between clusters because the network capacity between them is often much smaller than within a cluster. If the data being transferred could be reduced by filtering and projecting remotely, it should progress much faster. Thirdly, there was a need to improve integration with other platforms particularly Spark. So we revamped the file processing code to make it more flexible. As a bonus in future versions, it will make it easier to read other file formats, and even reduce the size of the generated c++ code. I’ll return to some of the others items in this list later, but for the rest of this presentation I’m going to group the changes into 4 main areas.. [1:40]
The first area is changes that improve your day to day experience as a developer. [10]
EclWatch is something that all ECL developers spend quite a lot of time using – whether it is directly in a browser web page, or embedded within the eclide. We wanted to bring important information to your attention. For instance if something is wrong with your query or with the system it should be clearly presented to you, ideally on a dashboard, rather than needing to go and hunt for it. We also wanted to give you better tools to understand your queries, to dig into the detail, for example where is the time going, and what was happening at a particular point in your query. Let’s look at a few of the changes in more detail. [50]
The workunit timings and graph pages have gained a gantt chart at the top. It includes all the events in a workunit’s lifetime, tooltips provide extra details and you can zoom in on any part of the chart. Here are 3 different examples. The first example comes from a system that is busy. It isn’t always obvious why your job took a long time to run. Was it the compiler was slow, thor was busy, or it is just a slow job. Here you can quickly see that although the workunit took about 80 seconds to execute, almost one minute of that time was taken up waiting for a Thor to become available before the graph could run. The second example is that same chart zoomed in to highlight the time taken compiling a query, with a tooltip highlighting details from one of the stages. The final example is from a workunit with multiple workflow actions like persists, or independents. You can quickly see where the time has gone, and the order the graphs and subgraphs were executed in. [1:00]
A new java script graph viewer was introduced in 6.0, and in 7.0 it has been fully integrated into Gordon’s visualisation framework. As well as meaning it is available for anyone to use in their visualisations, it also allows other components of the visualisation framework to be easily included in the graph. For the moment Gordon has used that to add little tweaks like icons for the activity types, but I suspect he has many other ideas. [30]
One problem with large queries is that the graphs can be unmanageable and take forever to display. One significant change is the graph viewer can now request a much smaller subset – for instance clicking on a subgraph in the timings list brings you to this view – which can be rendered much more quickly. [20]
Our goal for improving the timings tab is simple enough – to make it easy to examine the performance of your query. Unfortunately it isn’t immediately obvious the best way to present all the information that is available, but hopefully the changes we have made will be a step in the right direction. This example shows 4 different timings for a graph that reads from disk, sorts, and then writes to disk. The purple bars represent the total time within that activity, and the other colour bars represent times for different tasks within the activity. It helps gives a better idea of where the time is going and why. Again this is another area I expect to change and improve in future versions. So please let us know what sorts of comparisons would be useful to you, and how you would like them displayed. [50]
Many of these changes in eclwatch rely on the improvements to the visualisation framework, which I think is worth highlighting in its own right. If you are producing any visualisations – with or without HPCC – it would be well worth your time investigating it further. For those who don’t know the visualisation framework is a separate open source project, held in its own github repository. It provides visualisations that can pull data from various sources especially big data. It is designed to work well with all common java script frameworks, and is published in the node npm repository, which makes it trivial to include in any project. There are really two different components to the library – visualisations and communications. The visualisation side provides great functionality – like the gantt charts and graph viewer that you saw earlier. But the framework really comes in to its own when it is used in combination with HPCC. For instance you can directly render the results of your roxie query to a chart embedded on a web page. If you are including visualizations in an ecl queries, then go along to the breakout session that Gordon is hosting later will cover the new version of the visualizer bundle in much more detail. [1:20]
I am not going to delve into any detail on the changes within the ecl library. What I want to bring to your attention is that there are improvements in each of these areas. So whether you need to split Unicode strings into words, or process dates in different timezones, there may well be changes in 7.0 that make your job easier. We have already heard details from Dan and Roger about some of the bundle changes, and more about the visualizer is coming up in the following breakout. [30]
The ESP improvements really help those who are developing web services. Dynamic ESDL has been around since 5.0, allowing service definitions to be directly deployed to esp. But up until now quite a few services could not take advantage of it because the query received from esp needed to be modified before being passed on to roxie - and that modification required the use of custom c++. In 7.0 a big improvement is the introduction of custom transforms. Along with the esdl definition you can include a specification in an xml file that takes inputs like the request, security values, etc and uses them to modify the query that gets sent to roxie. What it means to the web service developer is that that custom c++ code can now be replaced with an xml definition. That is probably worthwhile it itself – reducing the scope for mistakes. Even better it means that the vast majority of services can now use DESDL and be deployed directly from the command line without having to compile c++. Perhaps most significantly you avoid the need to bring esp down, deploy the compiled mappoiong code, and then bring it up again every time a new service definition is required. DESDL is now fully integrated into ESP – it is really more like an ESP v2. It is now just another way of configuring ESP services. A few other improvements to esp have allowed greater control when they are acting as stand alone web servers. For instance being able to connect and disconnect from dali means that operations has control over when service definitions are updated, and allows them isolate esp from other parts of the system. [2:00]
Version 6 added support for embedded languages like python, or MySQL, but their use was a bit restricted. For example there was no EMBED equivalent to an output statement that takes a stream of input records and is executed in parallel over all the nodes. The new activity attribute on an EMBED now allows you to achieve that. Other changes in the compiler focus on improving working with a local repository. Some examples include speeding up local syntax checking and generating the archives that are sent to eclccserver, and providing support for auto completion in editors. [40]
We don’t have the resources (or the skills) to solve to every problem within the HPCC code base. Instead Richard’s team concentrates on improving and extending our core functionality, but also providing you with the ability to integrate other open source projects into your solutions. Allowing other languages to create activities is part of those improvements, what else have we done? [30]
You have probably heard of it, but what is Spark? According to Wikipedia it is “An open source distributed general-purpose cluster-computing framework”. That sounds awfully like HPCC, so why would you want to use it? They are similar, but HPCC and Spark have different strengths and development communities. For example Spark is particularly strong in the machine learning community, and many researchers use it to develop new machine learning algorithms. If you want to apply that work to your data you will be much more successful running those algorithms on spark, rather than trying to port them to HPCC. Another reason to use Spark might be familiarity. If your data analysts are already using spark, with a development environment they are familiar with, then they will want to continue using it. But if a group want to use Spark, and all your data is on HPCC, you have a problem. Well no longer. Version 7 allows Spark to read both files and indexes from HPCC. This allows you to use HPCC for the bulk of your data processing, and use Spark for the areas that particularly suit it. You can then export your results back to HPCC ready to be processed along with the rest of your data. If you want to experiment, then to make life even easier there will also be an optional package which will install and configure a Spark cluster on the same nodes that are used to run HPCC. Of course in 5 years time there may well be a new trendy platform. If so we will make sure that HPCC can also integrate with that platform, whatever it may be. [1:45]
The log files generated by the system contain really useful information, but it can be a real pain in the neck to get at. Version 7 makes it easy to integrate an ELK stack with the system, including the ability to add Kibana dashboards into eclwatch. This integration is highly configurable, and can be useful for many different roles. For example operations can track system health, segfaults, and many other significant events. Developers can search log entries and identify problems. Here, for example, is a dashboard that shows the summary status of a complete cluster. [40]
This example on the other hand provides details about a single machine within the cluster. [10]
And this dashboard item can track the number of transactions per minute going through esp. If you want to know more, there is a blog post to get you started that contains various recipes for extracting different pieces of information from the logs and then visualising them within eclwatch. [20]
A bit of a change of focus. What is VS code and why do I care? Well, if you’re writing ECL on a windows machine then eclide provides a good development environment. If you’re not then what can you do? VS Code provides the cross platform equivalent. For those who haven’t heard of it VS code is a lightweight source code editor, which is gaining widespread adoption. It is designed from the start to be highly customizable and extensible. It has numerous downloadable extensions for different languages, different source control systems, spell checkers, and much much more. Gordon has developed an ECL extension which allows you use vscode in a very similar way to eclide. It is fully functional, even including auto completion, and he is actively developing it. A few brave souls might even be tempted to swap from eclide to VS Code – especially if you are writing code in multiple languages, or particularly value its customizability. [60]
Here is an example of what it looks like when you are editing ecl code. You can see a tree of attributes on the right, the syntax colouring in the editor and integration of the compiler errors just like ecl ide. If you want to find out more then go to Arjuna’s breakout session later today. [20]
Improving security is a continual task. It was improved in 6.0, and I’m sure it will be in the list of improvements for 8.0, and the foreseeable future. So what has changed? [15]
Previously there were a couple of potential problems with the way that browsers connect to eclwatch. The scheme used for authenticating users meant the user name and password were sent with each request, and because the browser sends them automatically there wasn’t a natural way to log out or connect as a different user. This has now changed so the user and password is authenticated once, and after that the connection continues using a session cookie. What practical difference will it make? You now see a different dialog to request the username and password, and once logged in there are options in the top right corner to log out and lock your session, and sessions will lock automatically after a period of inactivity. [45]
Adding the capability for Spark to read Thor files is great, but it raises some security issues. There is no point verifying ECL users have the rights to access files, if Spark users can read any file they want. So along with the spark integration, work needed to be done to ensure the access rights are checked and enforced consistently. And the move to host environments in the cloud also poses extra security challenges. Depending on your level of paranoia you may want the system to Verify that you are really talking to the server you think you are. Signing messages to verify the source of the message is who they claim to be. Add encryption in transit to ensure that no one can read the data being sent between components Version 7 contains several changes to improve this situation – for instance roxie now supports https which allows end to end encryption for roxie queries in the cloud. [55]
Finally of the four, performance is another long term goal that is always going to be on the improvements list. Here are a few areas that are worth highlighting: [10]
Thor has historically been very good at performing standard joins, but not so good at keyed joins. Indeed, sometimes it has been quicker to perform a full join against a index than a keyed join. To tackle this Jake has completely reimplemented keyed joins in Thor. To give you some idea of the improvement, here is a graph of the timings from the performance suite. As you can see it is fairly dramatic! There are more details in the jira issue if you are interested. Obviously your mileage is going to vary, but I would be very surprised if you did not see a fairly dramatic improvement in your own examples. [40]
Some of the extensions to the ML library have really stretched (and sometimes broken) the LOOP activity. As a result there are fixes to the code generator and improvements to Thor, particularly reducing the synchronization between the slave nodes. The other entries on this slide are all examples of improvements to performance, which have come about in response to issues that have been reported. Hopefully they will benefit many users. [35]
The final performance improvement involves indexes. Indexes are used by roxie queries to provide quick access to data. They are however read only and do not support incremental updates, and if they are large they can be slow to build. That causes a problem if the data you are storing is constantly being updated. The common solution to this problem is to use a superindex. This is where a collection of indexes with the same structure are treated as a single index. Those sub indexes are updated at different frequencies – for example on this diagram hourly, daily, weekly, monthly ,yearly. [If have also included some typical figures for numbers of rows]. This scheme retains the quick access to the data, but also allows quick updates since the hourly index takes a fraction of the time to build because it is much smaller. This approach does though have a disadvantage. Now instead of searching a single index file for a match, the system has to search all 5 of the sub indexes. And since only a small proportion of the records are changed each hour most of those searches are not going to find any matches. [1:20]
This is where bloom filters help. They allow the system to quickly exclude indexes from consideration. That means that most of the time that 5 index look up will be reduced to 2 or 3. If you want to understand how they work, and how you use them from ECL, then Richard has written a great blog post for you to read. Hash distributed keys are linked because they will help you to build those incremental updates. They provide a simpler way to build distributed keys that are consistently distributed, and don’t develop problems with skew over time. [35]
There have been a lot of bug fixes, improvements and new features. When I last looked there were more than 1,000 changes that were not part of the 6.x series. [40]
So, while you are at the conference please make the most of your opportunity to talk to the developers. Come and ask us questions, give us feedback and suggest your crazy new ideas. If you want to know who to talk to, here are some suggestions to get you started. [15]