As part of the 2018 HPCC Systems Summit Community Day event:
The latest version of the platform contains improvements to functionality, usability and interoperability. This talk gives an overview of the changes and explains how you might find them useful.
Gavin Halliday's primary focus is on the code generator, which converts ECL into the queries which run on the platform. Gavin enjoys working on problems together with the development team and the varied nature of the work keeps him engaged. Gavin shares how the platform compares with competitive platforms, including scalability and coding simplicity. He enjoys working on the platform and the elegant solutions the development team is able to implement. Gavin encourages people to give it a try!
4. ECL Watch
Goals
• Highlight important information
• Make it easier to understand queries
• Improved support for very large queries
Examples:
• Gantt
• Graph Viewer
• Timings
• Log data visualizer
HPCC 7.0 4
9. Visualization Framework
• Version 2.0 now available
• https://github.com/hpcc-systems/Visualization
• Rebranded as hpcc-js in the node npm repository
• New documentation, demos and gallery
• Includes non visualization items like ESP comms layer
• Dashy beta
• Not tied to HPCC Systems
• Visualizer Bundle 1.1
HPCC 7.0 9
10. ECL libraries
• Ecl Library extensions
• Date – timestamps, time zones, formatting
• Unicode – words, prefixes and suffixes
• Maths – infinity, fmod
• Bundles
• Data Patterns
• ML – Gradient boosted trees, boosted forests
• Visualizer
HPCC 7.0 10
11. ESP improvements
• DESDL improvements
• Custom mappings
• Fully integrated into ESP
• Mixing DESDL and ESDL in one service
• Allow disconnection from Dali
• Support for persistent connections.
HPCC 7.0 11
21. User Security
• Session management
• Avoid resending credentials
• Users can log out
• Allow sessions lock and time out
• Minimize time passwords retained
HPCC 7.0 21
22. System security
• Spark
• File access rights
• Dafilesrv authentication of requests
• The cloud
• Verifying components
• Encryption in transit
• ROXIE HTTPS support
HPCC 7.0 22
26. Index improvements
HPCC 7.0 26
•60K rows
•0.02% of totalHourly
•1.4M rows
•0.6% of totalDaily
•10M rows
•4% of totalWeekly
•43M rows
•17% of totalMonthly
•520M rows
•100% of totalHistorical
• Example database containing 250M unique items with 1000 updates each minute
27. Index improvements
• Bloom filters
• Supports multiple filters per index
• User configurable probability
• Automatically created.
• Richard’s blog post hpccsystems.com/blog/bloom-filters
• Hash distributed keys.
• When distribution fields are filtered with equalities
• Easier to create co-distributed keys
• Lower overhead calculating the part containing a match
HPCC 7.0 27
28. Finally
• WsSQL – now part of the core
• Over 1,000 pull requests since 6.4
HPCC 7.0 28
29. Talk to us!
• Bloom filters - Richard Chapman
• DESDL - Yanrui Ma
• ELK - Rodrigo Pastrana
• Thor - Jake Cobbett-Smith
• Visualizations - Gordon Smith
• Security - Tony Fishbeck
• Spark - Rodrigo Pastrana
• Config Manager - Ken Rowland
HPCC 7.0 29
Notes de l'éditeur
Good afternoon. In this presentation I am going to guide you through some of the main changes in the new version of the platform. If something catches your eye and you want to find out more, please come and chat afterwards in one of the breaks. Hopefully by the end you’ll all be dying to try it out for yourselves.
[20]
So, each major version of the platform is a chance for us to make significant changes to some of the foundations. The changes in 7.0 have enabled us to introduce various new features, but just as importantly they provide the scope for improvements in future releases.
Let’s take the first of these as an example. The file changes came about through a combination of different requirements:
First of all we wanted to make it easier for ECL developers when file formats change. Previously if the format of file changed, then you needed to update your own copy of the ECL definition before you could read it. It would be much better if you could continue to use the old definition until it was convenient for you to update your sources.
Secondly, it can be slow reading files and indexes between clusters because the network capacity between them is often much smaller than within a cluster. If the data being transferred could be reduced by filtering and projecting remotely, it should progress much faster.
Thirdly, there was a need to improve integration with other platforms particularly Spark.
So we revamped the file processing code to make it more flexible. As a bonus in future versions, it will make it easier to read other file formats, and even reduce the size of the generated c++ code.
I’ll return to some of the others items in this list later, but for the rest of this presentation I’m going to group the changes into 4 main areas..
[1:40]
The first area is changes that improve your day to day experience as a developer.
[10]
EclWatch is something that all ECL developers spend quite a lot of time using – whether it is directly in a browser web page, or embedded within the eclide.
We wanted to bring important information to your attention. For instance if something is wrong with your query or with the system it should be clearly presented to you, ideally on a dashboard, rather than needing to go and hunt for it.
We also wanted to give you better tools to understand your queries, to dig into the detail, for example where is the time going, and what was happening at a particular point in your query.
Let’s look at a few of the changes in more detail.
[50]
The workunit timings and graph pages have gained a gantt chart at the top. It includes all the events in a workunit’s lifetime, tooltips provide extra details and you can zoom in on any part of the chart.
Here are 3 different examples.
The first example comes from a system that is busy. It isn’t always obvious why your job took a long time to run. Was it the compiler was slow, thor was busy, or it is just a slow job. Here you can quickly see that although the workunit took about 80 seconds to execute, almost one minute of that time was taken up waiting for a Thor to become available before the graph could run.
The second example is that same chart zoomed in to highlight the time taken compiling a query, with a tooltip highlighting details from one of the stages.
The final example is from a workunit with multiple workflow actions like persists, or independents. You can quickly see where the time has gone, and the order the graphs and subgraphs were executed in.
[1:00]
A new java script graph viewer was introduced in 6.0, and in 7.0 it has been fully integrated into Gordon’s visualisation framework. As well as meaning it is available for anyone to use in their visualisations, it also allows other components of the visualisation framework to be easily included in the graph. For the moment Gordon has used that to add little tweaks like icons for the activity types, but I suspect he has many other ideas.
[30]
One problem with large queries is that the graphs can be unmanageable and take forever to display. One significant change is the graph viewer can now request a much smaller subset – for instance clicking on a subgraph in the timings list brings you to this view – which can be rendered much more quickly.
[20]
Our goal for improving the timings tab is simple enough – to make it easy to examine the performance of your query. Unfortunately it isn’t immediately obvious the best way to present all the information that is available, but hopefully the changes we have made will be a step in the right direction.
This example shows 4 different timings for a graph that reads from disk, sorts, and then writes to disk. The purple bars represent the total time within that activity, and the other colour bars represent times for different tasks within the activity. It helps gives a better idea of where the time is going and why. Again this is another area I expect to change and improve in future versions. So please let us know what sorts of comparisons would be useful to you, and how you would like them displayed.
[50]
Many of these changes in eclwatch rely on the improvements to the visualisation framework, which I think is worth highlighting in its own right. If you are producing any visualisations – with or without HPCC – it would be well worth your time investigating it further.
For those who don’t know the visualisation framework is a separate open source project, held in its own github repository. It provides visualisations that can pull data from various sources especially big data. It is designed to work well with all common java script frameworks, and is published in the node npm repository, which makes it trivial to include in any project.
There are really two different components to the library – visualisations and communications. The visualisation side provides great functionality – like the gantt charts and graph viewer that you saw earlier. But the framework really comes in to its own when it is used in combination with HPCC. For instance you can directly render the results of your roxie query to a chart embedded on a web page. If you are including visualizations in an ecl queries, then go along to the breakout session that Gordon is hosting later will cover the new version of the visualizer bundle in much more detail.
[1:20]
I am not going to delve into any detail on the changes within the ecl library. What I want to bring to your attention is that there are improvements in each of these areas. So whether you need to split Unicode strings into words, or process dates in different timezones, there may well be changes in 7.0 that make your job easier.
We have already heard details from Dan and Roger about some of the bundle changes, and more about the visualizer is coming up in the following breakout.
[30]
The ESP improvements really help those who are developing web services.
Dynamic ESDL has been around since 5.0, allowing service definitions to be directly deployed to esp. But up until now quite a few services could not take advantage of it because the query received from esp needed to be modified before being passed on to roxie - and that modification required the use of custom c++.
In 7.0 a big improvement is the introduction of custom transforms. Along with the esdl definition you can include a specification in an xml file that takes inputs like the request, security values, etc and uses them to modify the query that gets sent to roxie.
What it means to the web service developer is that that custom c++ code can now be replaced with an xml definition. That is probably worthwhile it itself – reducing the scope for mistakes. Even better it means that the vast majority of services can now use DESDL and be deployed directly from the command line without having to compile c++. Perhaps most significantly you avoid the need to bring esp down, deploy the compiled mappoiong code, and then bring it up again every time a new service definition is required.
DESDL is now fully integrated into ESP – it is really more like an ESP v2. It is now just another way of configuring ESP services.
A few other improvements to esp have allowed greater control when they are acting as stand alone web servers. For instance being able to connect and disconnect from dali means that operations has control over when service definitions are updated, and allows them isolate esp from other parts of the system.
[2:00]
Version 6 added support for embedded languages like python, or MySQL, but their use was a bit restricted. For example there was no EMBED equivalent to an output statement that takes a stream of input records and is executed in parallel over all the nodes. The new activity attribute on an EMBED now allows you to achieve that.
Other changes in the compiler focus on improving working with a local repository. Some examples include speeding up local syntax checking and generating the archives that are sent to eclccserver, and providing support for auto completion in editors.
[40]
We don’t have the resources (or the skills) to solve to every problem within the HPCC code base. Instead Richard’s team concentrates on improving and extending our core functionality, but also providing you with the ability to integrate other open source projects into your solutions.
Allowing other languages to create activities is part of those improvements, what else have we done?
[30]
You have probably heard of it, but what is Spark? According to Wikipedia it is “An open source distributed general-purpose cluster-computing framework”. That sounds awfully like HPCC, so why would you want to use it?
They are similar, but HPCC and Spark have different strengths and development communities. For example Spark is particularly strong in the machine learning community, and many researchers use it to develop new machine learning algorithms. If you want to apply that work to your data you will be much more successful running those algorithms on spark, rather than trying to port them to HPCC.
Another reason to use Spark might be familiarity. If your data analysts are already using spark, with a development environment they are familiar with, then they will want to continue using it. But if a group want to use Spark, and all your data is on HPCC, you have a problem.
Well no longer. Version 7 allows Spark to read both files and indexes from HPCC. This allows you to use HPCC for the bulk of your data processing, and use Spark for the areas that particularly suit it. You can then export your results back to HPCC ready to be processed along with the rest of your data.
If you want to experiment, then to make life even easier there will also be an optional package which will install and configure a Spark cluster on the same nodes that are used to run HPCC.
Of course in 5 years time there may well be a new trendy platform. If so we will make sure that HPCC can also integrate with that platform, whatever it may be.
[1:45]
The log files generated by the system contain really useful information, but it can be a real pain in the neck to get at. Version 7 makes it easy to integrate an ELK stack with the system, including the ability to add Kibana dashboards into eclwatch.
This integration is highly configurable, and can be useful for many different roles. For example operations can track system health, segfaults, and many other significant events. Developers can search log entries and identify problems.
Here, for example, is a dashboard that shows the summary status of a complete cluster.
[40]
This example on the other hand provides details about a single machine within the cluster.
[10]
And this dashboard item can track the number of transactions per minute going through esp.
If you want to know more, there is a blog post to get you started that contains various recipes for extracting different pieces of information from the logs and then visualising them within eclwatch.
[20]
A bit of a change of focus. What is VS code and why do I care? Well, if you’re writing ECL on a windows machine then eclide provides a good development environment. If you’re not then what can you do? VS Code provides the cross platform equivalent.
For those who haven’t heard of it VS code is a lightweight source code editor, which is gaining widespread adoption. It is designed from the start to be highly customizable and extensible. It has numerous downloadable extensions for different languages, different source control systems, spell checkers, and much much more.
Gordon has developed an ECL extension which allows you use vscode in a very similar way to eclide. It is fully functional, even including auto completion, and he is actively developing it. A few brave souls might even be tempted to swap from eclide to VS Code – especially if you are writing code in multiple languages, or particularly value its customizability.
[60]
Here is an example of what it looks like when you are editing ecl code. You can see a tree of attributes on the right, the syntax colouring in the editor and integration of the compiler errors just like ecl ide.
If you want to find out more then go to Arjuna’s breakout session later today.
[20]
Improving security is a continual task. It was improved in 6.0, and I’m sure it will be in the list of improvements for 8.0, and the foreseeable future. So what has changed?
[15]
Previously there were a couple of potential problems with the way that browsers connect to eclwatch. The scheme used for authenticating users meant the user name and password were sent with each request, and because the browser sends them automatically there wasn’t a natural way to log out or connect as a different user.
This has now changed so the user and password is authenticated once, and after that the connection continues using a session cookie. What practical difference will it make?
You now see a different dialog to request the username and password, and once logged in there are options in the top right corner to log out and lock your session, and sessions will lock automatically after a period of inactivity.
[45]
Adding the capability for Spark to read Thor files is great, but it raises some security issues. There is no point verifying ECL users have the rights to access files, if Spark users can read any file they want. So along with the spark integration, work needed to be done to ensure the access rights are checked and enforced consistently.
And the move to host environments in the cloud also poses extra security challenges. Depending on your level of paranoia you may want the system to
Verify that you are really talking to the server you think you are.
Signing messages to verify the source of the message is who they claim to be.
Add encryption in transit to ensure that no one can read the data being sent between components
Version 7 contains several changes to improve this situation – for instance roxie now supports https which allows end to end encryption for roxie queries in the cloud.
[55]
Finally of the four, performance is another long term goal that is always going to be on the improvements list. Here are a few areas that are worth highlighting:
[10]
Thor has historically been very good at performing standard joins, but not so good at keyed joins. Indeed, sometimes it has been quicker to perform a full join against a index than a keyed join.
To tackle this Jake has completely reimplemented keyed joins in Thor. To give you some idea of the improvement, here is a graph of the timings from the performance suite. As you can see it is fairly dramatic! There are more details in the jira issue if you are interested. Obviously your mileage is going to vary, but I would be very surprised if you did not see a fairly dramatic improvement in your own examples.
[40]
Some of the extensions to the ML library have really stretched (and sometimes broken) the LOOP activity. As a result there are fixes to the code generator and improvements to Thor, particularly reducing the synchronization between the slave nodes.
The other entries on this slide are all examples of improvements to performance, which have come about in response to issues that have been reported. Hopefully they will benefit many users.
[35]
The final performance improvement involves indexes. Indexes are used by roxie queries to provide quick access to data. They are however read only and do not support incremental updates, and if they are large they can be slow to build. That causes a problem if the data you are storing is constantly being updated.
The common solution to this problem is to use a superindex. This is where a collection of indexes with the same structure are treated as a single index. Those sub indexes are updated at different frequencies – for example on this diagram hourly, daily, weekly, monthly ,yearly. [If have also included some typical figures for numbers of rows]. This scheme retains the quick access to the data, but also allows quick updates since the hourly index takes a fraction of the time to build because it is much smaller.
This approach does though have a disadvantage. Now instead of searching a single index file for a match, the system has to search all 5 of the sub indexes. And since only a small proportion of the records are changed each hour most of those searches are not going to find any matches.
[1:20]
This is where bloom filters help. They allow the system to quickly exclude indexes from consideration. That means that most of the time that 5 index look up will be reduced to 2 or 3. If you want to understand how they work, and how you use them from ECL, then Richard has written a great blog post for you to read.
Hash distributed keys are linked because they will help you to build those incremental updates. They provide a simpler way to build distributed keys that are consistently distributed, and don’t develop problems with skew over time.
[35]
There have been a lot of bug fixes, improvements and new features. When I last looked there were more than 1,000 changes that were not part of the 6.x series.
[40]
So, while you are at the conference please make the most of your opportunity to talk to the developers. Come and ask us questions, give us feedback and suggest your crazy new ideas. If you want to know who to talk to, here are some suggestions to get you started.
[15]