SlideShare une entreprise Scribd logo
1  sur  54
1
TOPIC
You will have
KUSTO and
nothing else!
A new way to interpret the word "Data
Engineer" is rising up. A friction less approach
is possible to this new mantra?
Thanks for collaboration
Who I am
@RiccardoZamana (personal)
@ZamanaRiccardo (work)
zama202
https://www.linkedin.com/in/riccardozamana/
RICCARDO ZAMANA
Summary
1. Understand KUSTO Engine (ADX Pro and Cons, Query
Processing and Concurrency)
2. How to use ADX to Kusto-mize data pipeline (Trigger2fill
& Rewrite Patterns)
3. The real role of the Data Engineer (CD/CI for the Data
Engineer, Git-ize Kusto statements, External Data integration)
“No more sessions
starting from 2022
Why is there this session?
1. Understand KUSTO Engine
ADX Pro and Cons, Query Processing and Concurrency
Kusto driven Customer base
The problem is CONFIDENCE!
Data Engineer today wants SQL because they know only that!
But… after a KUSTO TUTORING…:
Customers starting with historical analysis and then move to more and more real time analysis
as the teams are getting more comfortable with the service.
• LOG ANALYSIS: Kusto to analyze unified logs, i.e. logs from on-premise systems and
different clouds
• IOT TELEMETRY ANALYSIS: mine telemetry data to find anomalies in asset utilization
• SALES INSIGHTS: understand customer behaviours, predict trends or spike and optimize
go-to-market strategy
Azure Data Explorer overview
1. Capability for many data types,
formats, and sources
Structured (numbers), semi-structured
(JSONXML), and free text
2. Batch or streaming ingestion
Use managed ingestion pipeline or
queue a request for pull ingestion
3. Compute and storage isolation
• Independent scale out / scale in
• Persistent data in Azure Blob Storage
• Caching for low-latency on compute
4. Multiple options to support
data consumption
Use out-of-the box tools and connectors
or use APIs/SDKs for custom solution
Data Lake
/ Blob
IoT
Ingested Data
Engine
Data
Management
Azure Data Explorer
Azure Storage
Event Hub
IoT Hub
Customer Data
Lake
Kafka Sync
Logstash Plugin
Event Grid
Azure Portal
Power BI
ADX Web UI
ODBC / JDBC Apps
Apps (Via API)
Logstash Plugin
Apps (Via API)
Create,
Manage
Stream
Batch
Grafana
Query,
Control Commands
Azure OSS Applications
Active Data
Connections
9
The role of ADX
Raw data DWH
Refined data
Real time
derived data
Data
comparison
and fast kpi
ADX
THREE KEY USERS IN ONE TOOL:
• IoT Developer (data check, rule engine for insights)
• Data engineer (data exploration/enrichment/manipulation.. Like
‘’Smandruppation’’?)
• Data scientist (data selection and … what else?)
10
How ADX is Organized
INSTANCE DATABASE SOURCES
DB Users/Apps
Ingestion URL
Querying URL
Cache storage
Blob storage
EXTERNAL
SOURCES
EXTERNAL
DESTINATIONS
IotHUB
EventHub
Storage
ADLS
Sql Server
MANY..
Cluster
|___database 1
| |___table 1
| | |___extent data
| | | |___column 0
| | | | |___data blocks
| | | | |___policy:authorization;data
retention...
| | | |___column 1
| | |___schema,ordered list of fields
| | |___policy objects:authorization;data
retention...
| |___table 2
| |___policy objects:authorization;data retention...
|___database 2
Why Kusto is Fast in Nutshell
WHY KUSTO SPEED IS SO HIGH?
• distributed structure
• store the data in columnar form
• node cluster
• designed for data that are read-only,
delete-rarely, and no updates at all.
Compare with SQL Server, Kusto’s high-speed query is not sourced from magic, the speed is
a tradeoff of data processing, wanting some features and also giving up some.
Remember the old (but good) pricing calculator… and
now?
How is it composed inside?
1) Admin Node
2) Query Head
3) Data Node
4) Gateway Node
The four elements of a Kusto Table
1. Table Metadata
2. Extent Directory
3. Extent
4. Column Index
Data extent & Kusto Index
Data Extent (aka Data Shard)
• Kusto data extent is kind like a ‘’mini Kusto table’’
• columnar data subdivided into segments
• one Kusto query will need only parse the required
columns in the project section only
• Project section is a must
Kusto Index
Two kind of indexes:
 String column index: inverted term index as a B-
tree. This kind of index grants Kusto a powerful
capability of text processing (Similar to
ElasticSearch). The “contains” operator is way
faster than “like” in T-SQL. Numeric column
(include DateTime and TimeSpan) index: range-
based forward index.
 Dynamic column index: inverted term index as a
B-tree, during data ingestion, the engine will
enumerate all elements within the dynamic value
and forward them to the index builder.
Data Shards (Extents) and Column Store
When you ingest some small data twice in table, you will see the following 2
extents after ingestion.
.show table StormEvents extents
After a while, these extents will be merged into a single extent.
Merge Policy
This merge policy (settings) can be seen by
running the following command.
.show database db01 policy merge
"PolicyName": ExtentsMergePolicy,
"EntityName": [db01],
"Policy": {
"RowCountUpperBoundForMerge": 16000000,
"OriginalSizeMBUpperBoundForMerge": 0,
"MaxExtentsToMerge": 100,
"LoopPeriod": "01:00:00",
"MaxRangeInHours": 24,
"AllowRebuild": true,
"AllowMerge": true,
"Lookback": {
"Kind": "Default",
"CustomPeriod": null
}
},
"ChildEntities": [
"StormEvents"
],
A journey of Data Ingestion
Imagine you have a CSV log
file in hand and want to load
it to Kusto.
3. Extent will be
created, and new
infos will sent to
admin
4. Admin will add
shard ref. to
metadata &
commit new
snapshot to db
data
2. finds an
available Data
node & forwards
the command
1. Ingest
command arrives
to ADMIN NODE
Data deletion
1. What happens when a data shard is deleted?
2. What if I am querying the going delete data just before the
data deletion command is executed?
3. Can I recover the deleted data by reverting metadata to a
previous version?
The only exception is the “data purge” command.
Remember: “With no regrets”.
Query processing
When you submit some query written
by Kusto Query Language (KQL), the
query analyzer parses into Abstract
Syntax Tree (AST) and builds an initial
Relational Operators’ Tree (RelOp tree).
It then finally builds a query plan as
follows.
The generated plan is eventually
translated into the distributed query
plan, which is a shard-level access
tree.
Kusto query execution
Script A, Put where before the aggregation
summarize
UsageDailyb
| where DateKey > 20190101
| summarize
DailyUsage_sum =
sum(DailyUsage)
by DateKey
| order by DateKey desc
| take 10
Script B, Put where after the aggregation
summarize
UsageDaily
| summarize
DailyUsage_sum =
sum(DailyUsage)
by DateKey
| where DateKey > 20190101
| order by DateKey desc
| take 10
Which script will return the result first, will script A use less time than script B? the result is Almost The Same. How
could it be?! Let’s go deep and find out why.
Kusto query execution
Abstract Syntax Tree(AST) and Relational Operators
Tree(RelOp Tree)
• Parse the coming script into an Abstract Syntax Tree(AST), and
performs semantic pass over the AST.
• Check names, see if the reference table, functions,pre-defined
variables exist in the database and query context.
• Verify the user has the permissions to access the relevant
entity.
• Check data type and reference, e.g. is an int function dealing a
string?
• After the semantic pass, the query engine will build an initial
Relational Operators Tree(RelOp Tree) based on the AST.
• Next, the Kusto engine will further attempt to optimize the
query by applying one or multiple predefined rewriting rules.
KEEP ATTENTION:
• Aggregations ops are split into the “leaf”.
• Top n operators are replicated to each data extent.
After optimization, both Script A and Script B will
share a common RelOp tree like this:
Join or summarize internal strategy
What ADX does when we ask for a Join or a Summarize?
Broadcast join strategy :
[If one of join sides is
significantly smaller than the
other]
Shuffled join strategy :
[If both join sides are large, it will
apply same partitioning scheme for
both sides]
Other :
[Both join sides are not so large]
Partition
By default (when partitioning policy is not assigned),
extents are partitioned by ingest-time based
partitioning.
When you change the partitioning policy
for existing table, please clear data and re-
ingest all data under new partitioning
policy.
.alter table SalesLogs policy partitioning ```{
"PartitionKeys": [
{
"ColumnName": "City",
"Kind":"Hash",
"Properties": {
"Function": "XxHash64",
"MaxPartitionCount":128,
"Seed": 1,
"PartitionAssignmentMode":"Default"
}
}
]
}```
By setting this custom policy, the extents in this table will be re-
partitioned by the hash of City. This will be run in the
background process after data ingestion.
Other topics for data sharing and distributions
Querying a materialized view is more performant
than the query for source table, in which the
aggregation will be performed each query.
The result of materialized view is always up-to-
date.
After a while, the background process will process
“delta” and merge into “materialized part”.
MATERIALIZED VIEW
MV is made of two components:
• A materialized part - an Azure Data Explorer table
holding aggregated records from the source table,
which have already been processed. This table
always holds a single record per the aggregation's
group-by combination.
• A delta - the newly ingested records in the source
table that haven't yet been processed.
.show materialized-view MaterializedViewName
.show materialized-view MaterializedViewName failures
Other topics for data sharing and distributions
In Data Explorer, you can also use leader-follower
pattern for distributing query workloads across
multiple clusters.
When follower database in a different cluster is
attached to the original database called “leader”
database, the follower database will synchronize
changes in leader database. With read-only follower
database, you can view data of the leader database in
a different cluster. (The followers must be in the same
region with leader.)
You can use this pattern for scale-out purposes in
large system.
You can also specify different SKUs and caching
policies in follower clusters. You can distribute the
read query workloads into multiple clusters, especially
when heavy ingestion’s workload occurs in leader
database.
LEADER AND FOLLOWER
Kusto Limitations
1) Limit on query concurrency
You can estimate the max concurrent
number by
[Cores per node] x 10
You can also view the actual number by
running this Kusto command if you have
permission to run it.
.show cluster policy querythrottling
2) Limit on the node memory
Your Kusto administrator may set the maximum
memory usage by setting an option to override it.
set
max_memory_consumption_per_query_per_node=6
8719476736;
MyTable | ...
.show Queries
| where StartedOn > ago(1d)
| extend MemoryPeak = tolong(ResourcesUtilization.MemoryPeak)
| project StartedOn, CommandType, ClientActivityId, TotalCpu,
MemoryPeak
| top 10 by MemoryPeak
Kusto Limitations
3) Limit on memory per iterator
Whenever there is a join or summarize, the
Kusto engine uses a pull iterator to fulfill the
request (the limitation is set to 5 GB)
you can increase this value by up to half of the
physical memory of the node.
set
maxmemoryconsumptionperiterator=68719476
736;
MyTable | ...
If your query hits this limitation, you may see an
error message “…exceeded memory budget…”.
4) Limit on result set size
You will hit this limitation when your query’s result
dataset rows number exceeds 500,000 or the data
size exceeds 64 MB.
if your script hits this limitation, you will see an error
message containing “partial query failure”.
To solve or avoid this limitation. you can
• summarize the data to output only interesting
results
• use a take operator to see a small sample of the
result.
• use the project operator to output the columns
you need.
What is MILLIBYTE?
BUT.. IF you insist that you want to output the data and copy it
to Excel. you can use this command to remove the limitation:
set truncationmaxsize=1048576;
set truncationmaxrecords=100000;
MyTable | where User=="UserId1"
Kusto Limitations
5) Limit on query complexity
Usually, you won’t hit this limitation unless your Kusto query is extremely complex. for example, you have 5,000
conditions in the where clause.
T
| where Column == "value1" or
Column == "value2" or
.... or
Column == "valueN"
each query will be transformed to a RelOp tree, if the tree depth exceeds the threshold, you hit the limitation. You
can rewrite the script logic to solve it.
T
| where Column in ("value1", "value2".... "valueN")
What ADX isn’t optimal for / stretch scenarios
Since we do not own the hardware the workloads are running on, we do not have to get married with one technology and
run everything on it to amortise the cost of said hardware / licence. We can use the best tool for the job.
Scenario Why Azure PaaS Alternatives
Data warehouse It isn’t transactional, doesn’t have log journals,
etc. . This is part of the reasons it is so fast, but
also part of the reasons it is a poor fit for a
Datawarehouse.
Azure Synapse & Power BI Premium
Application Back end ADX isn’t built as a transactional workload. Cosmos DB, Azure SQL DB, Azure
PostgreSQL, Azure MySQL, Azure
MariaDB
Machine Learning (ML)
Training
Even if ADX supports some built-in ML
algorithms , it isn’t an ML training platform.
Azure ML, Spark (Azure
Databricks or Azure HD Insight), Azure
Batch & Data Science Virtual
Machine (DSVM)
Sub-second streaming ADX can go as low as seconds of latency in
ingesting data and be able to do analytics. Most
“near real time” scenarios fall comfortably within
that window.
Structured Streaming in Continuous
Mode in Spark (Azure
Databricks or Azure HD Insight), Kafka
Streams on Azure HD
Insight, Flink on Azure HD Insight
Some ADX/SDX consideration
The way they imagine our data-world
1. How many languages I need?
SQL, PYTHON, KUSTO ?
2. How many services are using
KUSTO?
3. How can you use Kusto to
manage /troubleshoot caveats
within Azure Solutions?
Key Differences with SYNAPSE DATA EXPLORER POOL
Category Capability Azure Data Explorer Synapse Data Explorer
Security VNET Supports VNet Injection and
Azure Private Link
Support for Azure Private link automatically integrated as part of Synapse
Managed VNET
PARI
CMK ✓ Automatically inherited from Synapse workspace configuration PARI
Firewall ✗ Automatically inherited from Synapse workspace configuration PARI
Business Continuity Availability Zones Optional Enabled by default where Availability Zones are available ADX
SKU Compute options 22+ Azure VM SKUs to choose
from
Simplified to Synapse workload types SKUs ADX
Integrations Built-in ingestion
pipelines
Event Hub, Event Grid, IoT Hub Event Hub, Event Grid, and IoT Hub supported via the Azure portal for non-
managed VNet
ADX
Spark integration Azure Data Explorer linked
service: Built-in Kusto Spark
integration with support for Azure
Active Directory pass-though
authentication, Synapse
Workspace MSI, and Service
Principal
Built-in Kusto Spark connector integration with support for Azure Active
Directory pass-though authentication, Synapse Workspace MSI, and Service
Principal
PARI
KQL artifacts
management
✗ Save KQL queries and integrate with Git SYN?
Metadata sync ✗ ✗ PARI
Features KQL queries ✓ ✓ PARI
API and SDKs ✓ ✓ PARI
Connectors ✓ ✓ PARI
Query tools ✓ ✓ PARI
Pricing Business Model Cost plus billing model with VCore billing model with two meters: VCore and Storage ADX
Delta Kusto - CI/CD for Azure Data Explorer (ADX)
WHAT IS DELTA KUSTO?
Command-line interface (CLI) enabling (CI / CD) automation with Kusto objects (e.g.
tables, functions, policies, security roles, etc.)
It can work on a single database, multiple databases, or an entire cluster. It also
supports multi-tenant scenarios.
• single-file executable available on both Windows and Linux
• accepts the path to a parameter YAML file instructing Delta
Kusto on what job to perform.
• A single call to Delta Kusto can run multiple jobs.
• enables change management on multi-tenant solutions
within Azure Data Explorer.
Delta Kusto - CI/CD for Azure Data Explorer (ADX)
HOW DELTA KUSTO WORKS?
Delta Kusto parses scripts and / or load database configuration
into a database model.
It can then compare two models to compute a Delta.
This approach might seem overkilled when considering functions
for instance where a simple create-or-alter can overwrite a function.
It does offer some advantages though:
• Computes a minimalistic set of delta commands since it
doesn’t need to create-or-alter everything just in case
• Detects drops (e.g. table columns) and can treat them as such
• Can do offline delta, i.e. compare two scripts without any
Kusto runtime involved.
GIT-IZE KUSTO Statements
REQUIREMENT
we hit issues where a developer would make a mistake directly
editing the function and it would mess up our production
assets
SOLUTION
• Sync Kusto lets the user pick either the local file system or a
Kusto database as either the source or the target.
• The Compare button checks both schemas and determines
the delta between the source and the target.
• After viewing the differences, the user can put a checkmark
next to the ones they want to publish and then press the
Update button.
• Visualize the differences between the source and the
target before updating the target.
This tool is now available for everyone on GitHub: https://github.com/microsoft/synckusto.
How to use ADX to Kusto-
mize data pipeline
Trigger2fill & Rewrite Patterns
Trigger2Fill and ReWrite Pattern
You can:
 Send daily reports containing tables and charts.
 Set notifications based on query results.
 Schedule control commands on clusters.
 Export and import data between Azure Data Explorer and other databases.
Blob
Storage
RawTables
Logic App Kusto
Queries
Batch
ingestion
New Data
stream
Stream
Ingestion
Trigger2Fill pattern
Refined
Tables
Continuous
Export
Blob
Storage
Batch
ingestion
ReWrite pattern
DEMO
NO EXCUSES… NOW IT’S FREE!
• Microsoft account or an
Azure Active Directory
user
• No Azure subscription or
a credit card needed!
Setting Suggested value Description
Cluster display name MyFreeCluster The display name for your cluster. A unique cluster name will be generated as part of the deployment
and the domain name [region].kusto.windows.net is appended to it.
Database name MyDatabase The name of database to create. The name must be unique within the cluster.
Select location Europe The location where the cluster will be created.
FREE CLUSTER FEATURES
With FREE
Item Value
Storage (uncompressed) ~100 GB
Databases Up to 10
Tables per database Up to 100
Columns per table Up to 200
Materialized views per database Up to 5
Only with FULL
• External tables
• Continuous export
• Workload groups
• Purge
• Follower clusters
• Partitioning policy
• Streaming ingestion
• Python and R plugins
• Enterprise readiness (Customer managed keys, VNet,
disk encryption, managed identities)
• Autoscale
• Azure Monitor and Insights
• Event Hub and Event Grid connectors
The real role of the Data
Engineer
… and some fun @work
The real Role of Data Engineer
What is Data Engineering?
Data engineering is the practice designing and building systems for collecting, storing, and analyzing data at scale.
What does a data engineer do?
Data engineers work in a variety of settings to build systems that collect, manage, and convert raw data into usable information for data scientists
and business analysts to interpret. Their goal is to make data accessible so that organizations can use it to evaluate and optimize their performance.
What are some common tasks of the Data Engineer?
• Acquire datasets that align with business needs
• Develop algorithms to transform data into useful, actionable information
• Build, test, and maintain database pipeline architectures
• Collaborate with management to understand company objectives
• Create new data validation methods and data analysis tools
• Ensure compliance with data governance and security policies
What’s the difference between a data analyst and a data engineer?
Data scientists and data analysts analyze data sets to glean knowledge and insights.
Data Engineer career path – 1 of 4
Learn the fundamentals of cloud computing, coding skills, and database design as a starting
point for a career in data science.
 Coding: Proficiency in coding languages is essential to this role
 Relational and non-relational databases
 ETL (extract, transform, and load) systems
 Data storage: data lake or DWH?
 Automation and scripting: You should be able to write scripts to automate repetitive tasks.
 Machine learning: it can be helpful to have a grasp of the basic concepts to better
understand the needs of data scientists on your team.
 Big data tools: Data engineers are often tasked with managing big data (Hadoop,
MongoDB, and Kafka).
 Cloud computing. You’ll need to understand cloud storage and cloud computing as
companies increasingly trade physical servers for cloud services.
 Data security: many data engineers are still tasked with securely managing and storing data
to protect it from loss.
1. Develop your data engineering skills
Data Engineer career path – 2 of 4
2. Get certified.
A certification can validate your skills to potential employers
and preparing for a certification exam is an excellent way to
develop your skills and knowledge.
If you notice a particular certification is frequently listed as
required or recommended, that might be a good place to
start.
Data Engineer career path – 3 of 4
3. Build a portfolio of data engineering projects.
You can add data engineering projects you've completed independently or as part of
coursework to a portfolio website.
Alternately, post your work to the Projects section of your LinkedIn profile or to a site
like GitHub.
Brush up on your big data skills with a portfolio-ready Guided Project that you can
complete in under two hours.
Data Engineer career path
4. Start with an entry-level position.
Many data engineers start off in entry-level roles, such as business intelligence
analyst or database administrator.
As you gain experience, you can pick up new skills and qualify for more
advanced roles.
ADX ‘’WOW’’ PLUGINS – COSMOS DB CALLOUT
Enrich Telemetry with Cosmos DB
cosmosdb_sql_request plugin
Why this plugin Exists?
The cosmosdb_sql_request plugin sends a SQL query
to a Cosmos DB SQL network endpoint and returns the
results of the query. This plugin is primarily designed
for querying small datasets, for example, enriching
data with reference data stored in Azure Cosmos DB.
The plugin is invoked with the evaluate operator.
Syntax
evaluate cosmosdb_sql_request ( ConnectionString ,
SqlQuery [, SqlParameters [, Options]] )
Argument name Description Required/optional
ConnectionString A string literal indicating the connection string that points to the
Cosmos DB collection to query. It must
include AccountEndpoint, Database, and Collection. It may
include AccountKey if a master key is used for authentication.
Example: 'AccountEndpoint=https://cosmosdbacc.documents.azure.
com/
;Database=MyDatabase;Collection=MyCollection;AccountKey='
h'R8PM...;'
Required
SqlQuery A string literal indicating the query to execute. Required
SqlParameters A constant value of type dynamic that holds key-value pairs to pass
as parameters along with the query. Parameter names must begin
with @.
Optional
Options A constant value of type dynamic that holds more advanced settings
as key-value pairs.
Optional
armResourceId Retrieve the API key from the Azure Resource Manager
Example: /subscriptions/a0cd6542-7eaf-43d2-bbdd-
b678a869aad1/resourceGroups/
cosmoddbresourcegrouput/providers/Microsoft.DocumentDb/data
baseAccounts/cosmosdbacc
token Provide the Azure AD access token used to authenticate with the
Azure Resource Manager.
preferredLocations Control which region the data is queried from.
Example: ['East US']
IMPORTANT: Set the callout policy !!
[
{
"CalloutType": "CosmosDB",
"CalloutUriRegex":
"my_endpoint1.documents.azure.com",
"CanCall": true
},
{
"CalloutType": "CosmosDB",
"CalloutUriRegex":
"my_endpoint2.documents.azure.com",
"CanCall": true
}
]
.alter cluster policy callout @'[{"CalloutType": "cosmosdb",
"CalloutUriRegex": ".documents.azure.com", "CanCall":
true}]'
Example: Query Cosmos DB
The following example uses the cosmosdb_sql_request plugin to send a SQL query to
fetch data from Cosmos DB using its SQL API.
evaluate cosmosdb_sql_request(
'AccountEndpoint=https://cosmosdbacc.documents.azure.com/;Database=MyDatabase;C
ollection=MyCollection;AccountKey=' h'R8PM...;',
'SELECT * from c’)
Example: Query Cosmos DB with parameters
The following example uses SQL query parameters and queries the data from an
alternate region. For more information, see preferredLocations.
evaluate cosmosdb_sql_request(
'AccountEndpoint=https://cosmosdbacc.documents.azure.com/;Database=MyDatabase;C
ollection=MyCollection;AccountKey=' h'R8PM...;',
"SELECT c.id, c.lastName, @param0 as Column0 FROM c WHERE c.dob >= '1970-01-
01T00:00:00Z'",
dynamic({'@param0': datetime(2019-04-16 16:47:26.7423305)}),
dynamic({'preferredLocations': ['East US']}))
| where lastName == 'Smith'
ADX ‘’WOW’’ PLUGINS – HTTPS CALL
MAKE INFERENCES WITH HTTPS
PLUGIN
http_request plugin / http_request_post plugin
Why this plugin Exists?
The http_request (GET) and http_request_post
(POST) plugins send an HTTP request and convert
the response into a table to retrieve particular
elaboration and merge it with dataset.
Syntax
evaluate http_request ( Uri [, RequestHeaders [,
Options]] )
evaluate http_request_post ( Uri [, RequestHeaders
[, Options [, Content]]] )
Name Type Required Description
Uri string ✓ The destination URI for the HTTP or
HTTPS request.
RequestHeaders dynamic A property bag containing HTTP
headers to send with the request.
Options dynamic A property bag containing additional
properties of the request.
Content string The body content to send with the
request. The content is encoded
in UTF-8 and the media type for
the Content-Type attribute
is application/json.
WHY IS … SO DIFFICULT?
Returns
Both plugins return a table that has a single record with the following dynamic columns:
• ResponseHeaders: A property bag with the response header.
• ResponseBody: The response body parsed as a value of type dynamic.
Prerequisites
1. CALLOUT POLICY
2. USE HTTPS
Authentication
Argument Description
Uri The URI to authenticate with.
RequestHeaders Using the HTTP standard Authorization header or any custom header supported by the web service.
Options Using the HTTP standard Authorization header.
If you want to use Azure Active Directory (Azure AD) authentication, you must use an HTTPS URI for the request and set the following
values:
* azure_active_directory to Active Directory Integrated
* AadResourceId to the Azure AD ResourceId value of the target web service.
WARNING, WARNING, WARNING !!!!
SECRET INFORMATION MUST BE REALLY SECRET!!!
• Be extra careful not to send secret information, such as authentication tokens, over HTTP connections.
• if the query includes confidential information, make sure that the relevant parts of the query text are obfuscated so that
they'll be omitted from any tracing.
• Uus obfuscated string literals !!!
HEADERS vs HEADACHE
The RequestHeaders argument can be used to add custom headers to the outgoing HTTP request. In addition to the standard
HTTP request headers and the user-provided custom headers, the plugin also adds the following custom headers:
Name Description
x-ms-client-request-id A correlation ID that identifies the request.
x-ms-readonly A flag indicating that the processor of this request shouldn't make any persistent changes.
READ <> READWRITE PERMISSION
The x-ms-readonly flag is set for every HTTP request sent by the plugin that was triggered by a query and not a control
command.
HTTPS PLUGIN: An Example
EXAMPLE NO.1
evaluate
http_request('http://services.groupkt.com/country/get/all')
| project CC=ResponseBody.RestResponse.result
| mv-expand CC limit 10000
| project
name = tostring(CC.name),
alpha2_code = tostring(CC.alpha2_code),
alpha3_code = tostring(CC.alpha3_code)
| where name startswith 'b’
EXAMPLE NO.2
let uri='https://example.com/node/js/on/eniac';
let headers=dynamic({'x-ms-correlation-vector':'abc.0.1.0'});
let options=dynamic({'Authentication':'Active Directory Integrated',
'AadResourceId':'https://eniac.to.the.max.example.com/’});
evaluate http_request_post(uri, headers, options)
Etc etc etc
evaluate http_request_post ( Uri [, RequestHeaders [, Options [, Content]]] )
RESULT
name alpha2_code alpha3_code
Bahamas BS BHS
Bahrain BH BHR
Bangladesh BD BGD
WHERE IS ADX.. IN THIS TYPICAL USE CASE?
“Let your data drive.
But.. Sir… Data driven or data informed?
Thanks
Questions?
zama202 @RiccardoZamana
@ZamanaRiccardo
https://www.linkedin.
com/in/riccardozama
na/

Contenu connexe

Tendances

חזרה למבחן הנהלה בנושא מערכות עור שלד ושרירים
חזרה למבחן הנהלה בנושא מערכות עור שלד ושריריםחזרה למבחן הנהלה בנושא מערכות עור שלד ושרירים
חזרה למבחן הנהלה בנושא מערכות עור שלד ושריריםzdaka75k
 
Implementing DevOps Automation: Best Practices & Common Mistakes - DevOps Eas...
Implementing DevOps Automation: Best Practices & Common Mistakes - DevOps Eas...Implementing DevOps Automation: Best Practices & Common Mistakes - DevOps Eas...
Implementing DevOps Automation: Best Practices & Common Mistakes - DevOps Eas...Derek Ashmore
 
Microsoft SQL Server Database Administration.pptx
Microsoft SQL Server Database Administration.pptxMicrosoft SQL Server Database Administration.pptx
Microsoft SQL Server Database Administration.pptxsamtakke1
 
Working with Azure Cosmos DB in Azure Functions
Working with Azure Cosmos DB in Azure FunctionsWorking with Azure Cosmos DB in Azure Functions
Working with Azure Cosmos DB in Azure FunctionsWill Velida
 
Devops Devops Devops, at Froscon
Devops Devops Devops, at FrosconDevops Devops Devops, at Froscon
Devops Devops Devops, at FrosconKris Buytaert
 
Mongodb 특징 분석
Mongodb 특징 분석Mongodb 특징 분석
Mongodb 특징 분석Daeyong Shin
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to ElasticsearchClifford James
 
Une introduction à Javascript et ECMAScript 6
Une introduction à Javascript et ECMAScript 6Une introduction à Javascript et ECMAScript 6
Une introduction à Javascript et ECMAScript 6Jean-Baptiste Vigneron
 
Jfrog artifactory as private docker registry
Jfrog artifactory as private docker registryJfrog artifactory as private docker registry
Jfrog artifactory as private docker registryVipin Mandale
 
Sizing MongoDB Clusters
Sizing MongoDB Clusters Sizing MongoDB Clusters
Sizing MongoDB Clusters MongoDB
 
Deploying & Scaling your Odoo Server
Deploying & Scaling your Odoo ServerDeploying & Scaling your Odoo Server
Deploying & Scaling your Odoo ServerOdoo
 
MongoDB Administration 101
MongoDB Administration 101MongoDB Administration 101
MongoDB Administration 101MongoDB
 
Intro to vue.js
Intro to vue.jsIntro to vue.js
Intro to vue.jsTechMagic
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLJim Mlodgenski
 

Tendances (20)

חזרה למבחן הנהלה בנושא מערכות עור שלד ושרירים
חזרה למבחן הנהלה בנושא מערכות עור שלד ושריריםחזרה למבחן הנהלה בנושא מערכות עור שלד ושרירים
חזרה למבחן הנהלה בנושא מערכות עור שלד ושרירים
 
Implementing DevOps Automation: Best Practices & Common Mistakes - DevOps Eas...
Implementing DevOps Automation: Best Practices & Common Mistakes - DevOps Eas...Implementing DevOps Automation: Best Practices & Common Mistakes - DevOps Eas...
Implementing DevOps Automation: Best Practices & Common Mistakes - DevOps Eas...
 
Microsoft SQL Server Database Administration.pptx
Microsoft SQL Server Database Administration.pptxMicrosoft SQL Server Database Administration.pptx
Microsoft SQL Server Database Administration.pptx
 
CouchDB Vs MongoDB
CouchDB Vs MongoDBCouchDB Vs MongoDB
CouchDB Vs MongoDB
 
Awx
AwxAwx
Awx
 
Working with Azure Cosmos DB in Azure Functions
Working with Azure Cosmos DB in Azure FunctionsWorking with Azure Cosmos DB in Azure Functions
Working with Azure Cosmos DB in Azure Functions
 
Devops Devops Devops, at Froscon
Devops Devops Devops, at FrosconDevops Devops Devops, at Froscon
Devops Devops Devops, at Froscon
 
Google Cloud Dataflow
Google Cloud DataflowGoogle Cloud Dataflow
Google Cloud Dataflow
 
Mongodb 특징 분석
Mongodb 특징 분석Mongodb 특징 분석
Mongodb 특징 분석
 
DevOps Best Practices
DevOps Best PracticesDevOps Best Practices
DevOps Best Practices
 
Intro to Elasticsearch
Intro to ElasticsearchIntro to Elasticsearch
Intro to Elasticsearch
 
Introduction to container based virtualization with docker
Introduction to container based virtualization with dockerIntroduction to container based virtualization with docker
Introduction to container based virtualization with docker
 
Une introduction à Javascript et ECMAScript 6
Une introduction à Javascript et ECMAScript 6Une introduction à Javascript et ECMAScript 6
Une introduction à Javascript et ECMAScript 6
 
Jenkins-CI
Jenkins-CIJenkins-CI
Jenkins-CI
 
Jfrog artifactory as private docker registry
Jfrog artifactory as private docker registryJfrog artifactory as private docker registry
Jfrog artifactory as private docker registry
 
Sizing MongoDB Clusters
Sizing MongoDB Clusters Sizing MongoDB Clusters
Sizing MongoDB Clusters
 
Deploying & Scaling your Odoo Server
Deploying & Scaling your Odoo ServerDeploying & Scaling your Odoo Server
Deploying & Scaling your Odoo Server
 
MongoDB Administration 101
MongoDB Administration 101MongoDB Administration 101
MongoDB Administration 101
 
Intro to vue.js
Intro to vue.jsIntro to vue.js
Intro to vue.js
 
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQLTop 10 Mistakes When Migrating From Oracle to PostgreSQL
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
 

Similaire à KUSTO and the New Role of Data Engineer

Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMark Kromer
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevAltinity Ltd
 
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...Dave Stokes
 
MySQL 8.0 Featured for Developers
MySQL 8.0 Featured for DevelopersMySQL 8.0 Featured for Developers
MySQL 8.0 Featured for DevelopersDave Stokes
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksGrega Kespret
 
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Codemotion
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudAmazon Web Services
 
cPanel now supports MySQL 8.0 - My Top Seven Features
cPanel now supports MySQL 8.0 - My Top Seven FeaturescPanel now supports MySQL 8.0 - My Top Seven Features
cPanel now supports MySQL 8.0 - My Top Seven FeaturesDave Stokes
 
MongoDB What's new in 3.2 version
MongoDB What's new in 3.2 versionMongoDB What's new in 3.2 version
MongoDB What's new in 3.2 versionHéliot PERROQUIN
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
 
Steps towards business intelligence
Steps towards business intelligenceSteps towards business intelligence
Steps towards business intelligenceAhsan Kabir
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptxAndrew Lamb
 
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...DataStax
 
Data Warehousing AWS 12345
Data Warehousing AWS 12345Data Warehousing AWS 12345
Data Warehousing AWS 12345AkhilSinghal21
 
Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture Rajesh Kumar
 
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...Karthik K Iyengar
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarinn5712036
 

Similaire à KUSTO and the New Role of Data Engineer (20)

Msbi Architecture
Msbi ArchitectureMsbi Architecture
Msbi Architecture
 
Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander ZaitsevMigration to ClickHouse. Practical guide, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
 
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
PHP UK 2020 Tutorial: MySQL Indexes, Histograms And other ways To Speed Up Yo...
 
MySQL 8.0 Featured for Developers
MySQL 8.0 Featured for DevelopersMySQL 8.0 Featured for Developers
MySQL 8.0 Featured for Developers
 
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and DatabricksSelf-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
 
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
cPanel now supports MySQL 8.0 - My Top Seven Features
cPanel now supports MySQL 8.0 - My Top Seven FeaturescPanel now supports MySQL 8.0 - My Top Seven Features
cPanel now supports MySQL 8.0 - My Top Seven Features
 
MongoDB What's new in 3.2 version
MongoDB What's new in 3.2 versionMongoDB What's new in 3.2 version
MongoDB What's new in 3.2 version
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 
Steps towards business intelligence
Steps towards business intelligenceSteps towards business intelligence
Steps towards business intelligence
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
DataStax | DSE Search 5.0 and Beyond (Nick Panahi & Ariel Weisberg) | Cassand...
 
MCT Virtual Summit 2021
MCT Virtual Summit 2021MCT Virtual Summit 2021
MCT Virtual Summit 2021
 
Data Warehousing AWS 12345
Data Warehousing AWS 12345Data Warehousing AWS 12345
Data Warehousing AWS 12345
 
Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture
 
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...
World2016_T1_S8_How to upgrade your cubes from 9.x to 10 and turn on optimize...
 
Presentation_BigData_NenaMarin
Presentation_BigData_NenaMarinPresentation_BigData_NenaMarin
Presentation_BigData_NenaMarin
 

Plus de Riccardo Zamana

Copilot Prompting Toolkit_All Resources.pdf
Copilot Prompting Toolkit_All Resources.pdfCopilot Prompting Toolkit_All Resources.pdf
Copilot Prompting Toolkit_All Resources.pdfRiccardo Zamana
 
Data saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewData saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewRiccardo Zamana
 
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...Riccardo Zamana
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Riccardo Zamana
 
Azure Industrial Iot Edge
Azure Industrial Iot EdgeAzure Industrial Iot Edge
Azure Industrial Iot EdgeRiccardo Zamana
 
Time Series Analytics Azure ADX
Time Series Analytics Azure ADXTime Series Analytics Azure ADX
Time Series Analytics Azure ADXRiccardo Zamana
 
Azure satpn19 time series analytics with azure adx
Azure satpn19   time series analytics with azure adxAzure satpn19   time series analytics with azure adx
Azure satpn19 time series analytics with azure adxRiccardo Zamana
 
Industrial iot: dalle parole ai fatti
Industrial iot: dalle parole ai fatti Industrial iot: dalle parole ai fatti
Industrial iot: dalle parole ai fatti Riccardo Zamana
 
Azure dayroma java, il lato oscuro del cloud
Azure dayroma   java, il lato oscuro del cloudAzure dayroma   java, il lato oscuro del cloud
Azure dayroma java, il lato oscuro del cloudRiccardo Zamana
 
Industrial Iot - IotSaturday
Industrial Iot - IotSaturday Industrial Iot - IotSaturday
Industrial Iot - IotSaturday Riccardo Zamana
 

Plus de Riccardo Zamana (12)

Copilot Prompting Toolkit_All Resources.pdf
Copilot Prompting Toolkit_All Resources.pdfCopilot Prompting Toolkit_All Resources.pdf
Copilot Prompting Toolkit_All Resources.pdf
 
Data saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewData saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overview
 
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
Time series Analytics - a deep dive into ADX Azure Data Explorer @Data Saturd...
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020
 
Azure Industrial Iot Edge
Azure Industrial Iot EdgeAzure Industrial Iot Edge
Azure Industrial Iot Edge
 
Time Series Analytics Azure ADX
Time Series Analytics Azure ADXTime Series Analytics Azure ADX
Time Series Analytics Azure ADX
 
Azure satpn19 time series analytics with azure adx
Azure satpn19   time series analytics with azure adxAzure satpn19   time series analytics with azure adx
Azure satpn19 time series analytics with azure adx
 
Industrial iot: dalle parole ai fatti
Industrial iot: dalle parole ai fatti Industrial iot: dalle parole ai fatti
Industrial iot: dalle parole ai fatti
 
Azure dayroma java, il lato oscuro del cloud
Azure dayroma   java, il lato oscuro del cloudAzure dayroma   java, il lato oscuro del cloud
Azure dayroma java, il lato oscuro del cloud
 
Industrial Iot - IotSaturday
Industrial Iot - IotSaturday Industrial Iot - IotSaturday
Industrial Iot - IotSaturday
 
Azure reactive systems
Azure reactive systemsAzure reactive systems
Azure reactive systems
 
Industrial IoT on azure
Industrial IoT on azureIndustrial IoT on azure
Industrial IoT on azure
 

Dernier

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 

Dernier (20)

NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 

KUSTO and the New Role of Data Engineer

  • 1. 1 TOPIC You will have KUSTO and nothing else! A new way to interpret the word "Data Engineer" is rising up. A friction less approach is possible to this new mantra?
  • 3. Who I am @RiccardoZamana (personal) @ZamanaRiccardo (work) zama202 https://www.linkedin.com/in/riccardozamana/ RICCARDO ZAMANA
  • 4. Summary 1. Understand KUSTO Engine (ADX Pro and Cons, Query Processing and Concurrency) 2. How to use ADX to Kusto-mize data pipeline (Trigger2fill & Rewrite Patterns) 3. The real role of the Data Engineer (CD/CI for the Data Engineer, Git-ize Kusto statements, External Data integration)
  • 5. “No more sessions starting from 2022 Why is there this session?
  • 6. 1. Understand KUSTO Engine ADX Pro and Cons, Query Processing and Concurrency
  • 7. Kusto driven Customer base The problem is CONFIDENCE! Data Engineer today wants SQL because they know only that! But… after a KUSTO TUTORING…: Customers starting with historical analysis and then move to more and more real time analysis as the teams are getting more comfortable with the service. • LOG ANALYSIS: Kusto to analyze unified logs, i.e. logs from on-premise systems and different clouds • IOT TELEMETRY ANALYSIS: mine telemetry data to find anomalies in asset utilization • SALES INSIGHTS: understand customer behaviours, predict trends or spike and optimize go-to-market strategy
  • 8. Azure Data Explorer overview 1. Capability for many data types, formats, and sources Structured (numbers), semi-structured (JSONXML), and free text 2. Batch or streaming ingestion Use managed ingestion pipeline or queue a request for pull ingestion 3. Compute and storage isolation • Independent scale out / scale in • Persistent data in Azure Blob Storage • Caching for low-latency on compute 4. Multiple options to support data consumption Use out-of-the box tools and connectors or use APIs/SDKs for custom solution Data Lake / Blob IoT Ingested Data Engine Data Management Azure Data Explorer Azure Storage Event Hub IoT Hub Customer Data Lake Kafka Sync Logstash Plugin Event Grid Azure Portal Power BI ADX Web UI ODBC / JDBC Apps Apps (Via API) Logstash Plugin Apps (Via API) Create, Manage Stream Batch Grafana Query, Control Commands Azure OSS Applications Active Data Connections
  • 9. 9 The role of ADX Raw data DWH Refined data Real time derived data Data comparison and fast kpi ADX THREE KEY USERS IN ONE TOOL: • IoT Developer (data check, rule engine for insights) • Data engineer (data exploration/enrichment/manipulation.. Like ‘’Smandruppation’’?) • Data scientist (data selection and … what else?)
  • 10. 10 How ADX is Organized INSTANCE DATABASE SOURCES DB Users/Apps Ingestion URL Querying URL Cache storage Blob storage EXTERNAL SOURCES EXTERNAL DESTINATIONS IotHUB EventHub Storage ADLS Sql Server MANY.. Cluster |___database 1 | |___table 1 | | |___extent data | | | |___column 0 | | | | |___data blocks | | | | |___policy:authorization;data retention... | | | |___column 1 | | |___schema,ordered list of fields | | |___policy objects:authorization;data retention... | |___table 2 | |___policy objects:authorization;data retention... |___database 2
  • 11. Why Kusto is Fast in Nutshell WHY KUSTO SPEED IS SO HIGH? • distributed structure • store the data in columnar form • node cluster • designed for data that are read-only, delete-rarely, and no updates at all. Compare with SQL Server, Kusto’s high-speed query is not sourced from magic, the speed is a tradeoff of data processing, wanting some features and also giving up some. Remember the old (but good) pricing calculator… and now?
  • 12. How is it composed inside? 1) Admin Node 2) Query Head 3) Data Node 4) Gateway Node
  • 13. The four elements of a Kusto Table 1. Table Metadata 2. Extent Directory 3. Extent 4. Column Index
  • 14. Data extent & Kusto Index Data Extent (aka Data Shard) • Kusto data extent is kind like a ‘’mini Kusto table’’ • columnar data subdivided into segments • one Kusto query will need only parse the required columns in the project section only • Project section is a must Kusto Index Two kind of indexes:  String column index: inverted term index as a B- tree. This kind of index grants Kusto a powerful capability of text processing (Similar to ElasticSearch). The “contains” operator is way faster than “like” in T-SQL. Numeric column (include DateTime and TimeSpan) index: range- based forward index.  Dynamic column index: inverted term index as a B-tree, during data ingestion, the engine will enumerate all elements within the dynamic value and forward them to the index builder.
  • 15. Data Shards (Extents) and Column Store When you ingest some small data twice in table, you will see the following 2 extents after ingestion. .show table StormEvents extents After a while, these extents will be merged into a single extent. Merge Policy This merge policy (settings) can be seen by running the following command. .show database db01 policy merge "PolicyName": ExtentsMergePolicy, "EntityName": [db01], "Policy": { "RowCountUpperBoundForMerge": 16000000, "OriginalSizeMBUpperBoundForMerge": 0, "MaxExtentsToMerge": 100, "LoopPeriod": "01:00:00", "MaxRangeInHours": 24, "AllowRebuild": true, "AllowMerge": true, "Lookback": { "Kind": "Default", "CustomPeriod": null } }, "ChildEntities": [ "StormEvents" ],
  • 16. A journey of Data Ingestion Imagine you have a CSV log file in hand and want to load it to Kusto. 3. Extent will be created, and new infos will sent to admin 4. Admin will add shard ref. to metadata & commit new snapshot to db data 2. finds an available Data node & forwards the command 1. Ingest command arrives to ADMIN NODE
  • 17. Data deletion 1. What happens when a data shard is deleted? 2. What if I am querying the going delete data just before the data deletion command is executed? 3. Can I recover the deleted data by reverting metadata to a previous version? The only exception is the “data purge” command. Remember: “With no regrets”.
  • 18. Query processing When you submit some query written by Kusto Query Language (KQL), the query analyzer parses into Abstract Syntax Tree (AST) and builds an initial Relational Operators’ Tree (RelOp tree). It then finally builds a query plan as follows. The generated plan is eventually translated into the distributed query plan, which is a shard-level access tree.
  • 19. Kusto query execution Script A, Put where before the aggregation summarize UsageDailyb | where DateKey > 20190101 | summarize DailyUsage_sum = sum(DailyUsage) by DateKey | order by DateKey desc | take 10 Script B, Put where after the aggregation summarize UsageDaily | summarize DailyUsage_sum = sum(DailyUsage) by DateKey | where DateKey > 20190101 | order by DateKey desc | take 10 Which script will return the result first, will script A use less time than script B? the result is Almost The Same. How could it be?! Let’s go deep and find out why.
  • 20. Kusto query execution Abstract Syntax Tree(AST) and Relational Operators Tree(RelOp Tree) • Parse the coming script into an Abstract Syntax Tree(AST), and performs semantic pass over the AST. • Check names, see if the reference table, functions,pre-defined variables exist in the database and query context. • Verify the user has the permissions to access the relevant entity. • Check data type and reference, e.g. is an int function dealing a string? • After the semantic pass, the query engine will build an initial Relational Operators Tree(RelOp Tree) based on the AST. • Next, the Kusto engine will further attempt to optimize the query by applying one or multiple predefined rewriting rules. KEEP ATTENTION: • Aggregations ops are split into the “leaf”. • Top n operators are replicated to each data extent. After optimization, both Script A and Script B will share a common RelOp tree like this:
  • 21. Join or summarize internal strategy What ADX does when we ask for a Join or a Summarize? Broadcast join strategy : [If one of join sides is significantly smaller than the other] Shuffled join strategy : [If both join sides are large, it will apply same partitioning scheme for both sides] Other : [Both join sides are not so large]
  • 22. Partition By default (when partitioning policy is not assigned), extents are partitioned by ingest-time based partitioning. When you change the partitioning policy for existing table, please clear data and re- ingest all data under new partitioning policy. .alter table SalesLogs policy partitioning ```{ "PartitionKeys": [ { "ColumnName": "City", "Kind":"Hash", "Properties": { "Function": "XxHash64", "MaxPartitionCount":128, "Seed": 1, "PartitionAssignmentMode":"Default" } } ] }``` By setting this custom policy, the extents in this table will be re- partitioned by the hash of City. This will be run in the background process after data ingestion.
  • 23. Other topics for data sharing and distributions Querying a materialized view is more performant than the query for source table, in which the aggregation will be performed each query. The result of materialized view is always up-to- date. After a while, the background process will process “delta” and merge into “materialized part”. MATERIALIZED VIEW MV is made of two components: • A materialized part - an Azure Data Explorer table holding aggregated records from the source table, which have already been processed. This table always holds a single record per the aggregation's group-by combination. • A delta - the newly ingested records in the source table that haven't yet been processed. .show materialized-view MaterializedViewName .show materialized-view MaterializedViewName failures
  • 24. Other topics for data sharing and distributions In Data Explorer, you can also use leader-follower pattern for distributing query workloads across multiple clusters. When follower database in a different cluster is attached to the original database called “leader” database, the follower database will synchronize changes in leader database. With read-only follower database, you can view data of the leader database in a different cluster. (The followers must be in the same region with leader.) You can use this pattern for scale-out purposes in large system. You can also specify different SKUs and caching policies in follower clusters. You can distribute the read query workloads into multiple clusters, especially when heavy ingestion’s workload occurs in leader database. LEADER AND FOLLOWER
  • 25. Kusto Limitations 1) Limit on query concurrency You can estimate the max concurrent number by [Cores per node] x 10 You can also view the actual number by running this Kusto command if you have permission to run it. .show cluster policy querythrottling 2) Limit on the node memory Your Kusto administrator may set the maximum memory usage by setting an option to override it. set max_memory_consumption_per_query_per_node=6 8719476736; MyTable | ... .show Queries | where StartedOn > ago(1d) | extend MemoryPeak = tolong(ResourcesUtilization.MemoryPeak) | project StartedOn, CommandType, ClientActivityId, TotalCpu, MemoryPeak | top 10 by MemoryPeak
  • 26. Kusto Limitations 3) Limit on memory per iterator Whenever there is a join or summarize, the Kusto engine uses a pull iterator to fulfill the request (the limitation is set to 5 GB) you can increase this value by up to half of the physical memory of the node. set maxmemoryconsumptionperiterator=68719476 736; MyTable | ... If your query hits this limitation, you may see an error message “…exceeded memory budget…”. 4) Limit on result set size You will hit this limitation when your query’s result dataset rows number exceeds 500,000 or the data size exceeds 64 MB. if your script hits this limitation, you will see an error message containing “partial query failure”. To solve or avoid this limitation. you can • summarize the data to output only interesting results • use a take operator to see a small sample of the result. • use the project operator to output the columns you need. What is MILLIBYTE? BUT.. IF you insist that you want to output the data and copy it to Excel. you can use this command to remove the limitation: set truncationmaxsize=1048576; set truncationmaxrecords=100000; MyTable | where User=="UserId1"
  • 27. Kusto Limitations 5) Limit on query complexity Usually, you won’t hit this limitation unless your Kusto query is extremely complex. for example, you have 5,000 conditions in the where clause. T | where Column == "value1" or Column == "value2" or .... or Column == "valueN" each query will be transformed to a RelOp tree, if the tree depth exceeds the threshold, you hit the limitation. You can rewrite the script logic to solve it. T | where Column in ("value1", "value2".... "valueN")
  • 28. What ADX isn’t optimal for / stretch scenarios Since we do not own the hardware the workloads are running on, we do not have to get married with one technology and run everything on it to amortise the cost of said hardware / licence. We can use the best tool for the job. Scenario Why Azure PaaS Alternatives Data warehouse It isn’t transactional, doesn’t have log journals, etc. . This is part of the reasons it is so fast, but also part of the reasons it is a poor fit for a Datawarehouse. Azure Synapse & Power BI Premium Application Back end ADX isn’t built as a transactional workload. Cosmos DB, Azure SQL DB, Azure PostgreSQL, Azure MySQL, Azure MariaDB Machine Learning (ML) Training Even if ADX supports some built-in ML algorithms , it isn’t an ML training platform. Azure ML, Spark (Azure Databricks or Azure HD Insight), Azure Batch & Data Science Virtual Machine (DSVM) Sub-second streaming ADX can go as low as seconds of latency in ingesting data and be able to do analytics. Most “near real time” scenarios fall comfortably within that window. Structured Streaming in Continuous Mode in Spark (Azure Databricks or Azure HD Insight), Kafka Streams on Azure HD Insight, Flink on Azure HD Insight
  • 29.
  • 31. The way they imagine our data-world 1. How many languages I need? SQL, PYTHON, KUSTO ? 2. How many services are using KUSTO? 3. How can you use Kusto to manage /troubleshoot caveats within Azure Solutions?
  • 32. Key Differences with SYNAPSE DATA EXPLORER POOL Category Capability Azure Data Explorer Synapse Data Explorer Security VNET Supports VNet Injection and Azure Private Link Support for Azure Private link automatically integrated as part of Synapse Managed VNET PARI CMK ✓ Automatically inherited from Synapse workspace configuration PARI Firewall ✗ Automatically inherited from Synapse workspace configuration PARI Business Continuity Availability Zones Optional Enabled by default where Availability Zones are available ADX SKU Compute options 22+ Azure VM SKUs to choose from Simplified to Synapse workload types SKUs ADX Integrations Built-in ingestion pipelines Event Hub, Event Grid, IoT Hub Event Hub, Event Grid, and IoT Hub supported via the Azure portal for non- managed VNet ADX Spark integration Azure Data Explorer linked service: Built-in Kusto Spark integration with support for Azure Active Directory pass-though authentication, Synapse Workspace MSI, and Service Principal Built-in Kusto Spark connector integration with support for Azure Active Directory pass-though authentication, Synapse Workspace MSI, and Service Principal PARI KQL artifacts management ✗ Save KQL queries and integrate with Git SYN? Metadata sync ✗ ✗ PARI Features KQL queries ✓ ✓ PARI API and SDKs ✓ ✓ PARI Connectors ✓ ✓ PARI Query tools ✓ ✓ PARI Pricing Business Model Cost plus billing model with VCore billing model with two meters: VCore and Storage ADX
  • 33. Delta Kusto - CI/CD for Azure Data Explorer (ADX) WHAT IS DELTA KUSTO? Command-line interface (CLI) enabling (CI / CD) automation with Kusto objects (e.g. tables, functions, policies, security roles, etc.) It can work on a single database, multiple databases, or an entire cluster. It also supports multi-tenant scenarios. • single-file executable available on both Windows and Linux • accepts the path to a parameter YAML file instructing Delta Kusto on what job to perform. • A single call to Delta Kusto can run multiple jobs. • enables change management on multi-tenant solutions within Azure Data Explorer.
  • 34. Delta Kusto - CI/CD for Azure Data Explorer (ADX) HOW DELTA KUSTO WORKS? Delta Kusto parses scripts and / or load database configuration into a database model. It can then compare two models to compute a Delta. This approach might seem overkilled when considering functions for instance where a simple create-or-alter can overwrite a function. It does offer some advantages though: • Computes a minimalistic set of delta commands since it doesn’t need to create-or-alter everything just in case • Detects drops (e.g. table columns) and can treat them as such • Can do offline delta, i.e. compare two scripts without any Kusto runtime involved.
  • 35. GIT-IZE KUSTO Statements REQUIREMENT we hit issues where a developer would make a mistake directly editing the function and it would mess up our production assets SOLUTION • Sync Kusto lets the user pick either the local file system or a Kusto database as either the source or the target. • The Compare button checks both schemas and determines the delta between the source and the target. • After viewing the differences, the user can put a checkmark next to the ones they want to publish and then press the Update button. • Visualize the differences between the source and the target before updating the target. This tool is now available for everyone on GitHub: https://github.com/microsoft/synckusto.
  • 36. How to use ADX to Kusto- mize data pipeline Trigger2fill & Rewrite Patterns
  • 37. Trigger2Fill and ReWrite Pattern You can:  Send daily reports containing tables and charts.  Set notifications based on query results.  Schedule control commands on clusters.  Export and import data between Azure Data Explorer and other databases. Blob Storage RawTables Logic App Kusto Queries Batch ingestion New Data stream Stream Ingestion Trigger2Fill pattern Refined Tables Continuous Export Blob Storage Batch ingestion ReWrite pattern
  • 38. DEMO
  • 39. NO EXCUSES… NOW IT’S FREE! • Microsoft account or an Azure Active Directory user • No Azure subscription or a credit card needed! Setting Suggested value Description Cluster display name MyFreeCluster The display name for your cluster. A unique cluster name will be generated as part of the deployment and the domain name [region].kusto.windows.net is appended to it. Database name MyDatabase The name of database to create. The name must be unique within the cluster. Select location Europe The location where the cluster will be created.
  • 40. FREE CLUSTER FEATURES With FREE Item Value Storage (uncompressed) ~100 GB Databases Up to 10 Tables per database Up to 100 Columns per table Up to 200 Materialized views per database Up to 5 Only with FULL • External tables • Continuous export • Workload groups • Purge • Follower clusters • Partitioning policy • Streaming ingestion • Python and R plugins • Enterprise readiness (Customer managed keys, VNet, disk encryption, managed identities) • Autoscale • Azure Monitor and Insights • Event Hub and Event Grid connectors
  • 41. The real role of the Data Engineer … and some fun @work
  • 42. The real Role of Data Engineer What is Data Engineering? Data engineering is the practice designing and building systems for collecting, storing, and analyzing data at scale. What does a data engineer do? Data engineers work in a variety of settings to build systems that collect, manage, and convert raw data into usable information for data scientists and business analysts to interpret. Their goal is to make data accessible so that organizations can use it to evaluate and optimize their performance. What are some common tasks of the Data Engineer? • Acquire datasets that align with business needs • Develop algorithms to transform data into useful, actionable information • Build, test, and maintain database pipeline architectures • Collaborate with management to understand company objectives • Create new data validation methods and data analysis tools • Ensure compliance with data governance and security policies What’s the difference between a data analyst and a data engineer? Data scientists and data analysts analyze data sets to glean knowledge and insights.
  • 43. Data Engineer career path – 1 of 4 Learn the fundamentals of cloud computing, coding skills, and database design as a starting point for a career in data science.  Coding: Proficiency in coding languages is essential to this role  Relational and non-relational databases  ETL (extract, transform, and load) systems  Data storage: data lake or DWH?  Automation and scripting: You should be able to write scripts to automate repetitive tasks.  Machine learning: it can be helpful to have a grasp of the basic concepts to better understand the needs of data scientists on your team.  Big data tools: Data engineers are often tasked with managing big data (Hadoop, MongoDB, and Kafka).  Cloud computing. You’ll need to understand cloud storage and cloud computing as companies increasingly trade physical servers for cloud services.  Data security: many data engineers are still tasked with securely managing and storing data to protect it from loss. 1. Develop your data engineering skills
  • 44. Data Engineer career path – 2 of 4 2. Get certified. A certification can validate your skills to potential employers and preparing for a certification exam is an excellent way to develop your skills and knowledge. If you notice a particular certification is frequently listed as required or recommended, that might be a good place to start.
  • 45. Data Engineer career path – 3 of 4 3. Build a portfolio of data engineering projects. You can add data engineering projects you've completed independently or as part of coursework to a portfolio website. Alternately, post your work to the Projects section of your LinkedIn profile or to a site like GitHub. Brush up on your big data skills with a portfolio-ready Guided Project that you can complete in under two hours.
  • 46. Data Engineer career path 4. Start with an entry-level position. Many data engineers start off in entry-level roles, such as business intelligence analyst or database administrator. As you gain experience, you can pick up new skills and qualify for more advanced roles.
  • 47. ADX ‘’WOW’’ PLUGINS – COSMOS DB CALLOUT Enrich Telemetry with Cosmos DB cosmosdb_sql_request plugin Why this plugin Exists? The cosmosdb_sql_request plugin sends a SQL query to a Cosmos DB SQL network endpoint and returns the results of the query. This plugin is primarily designed for querying small datasets, for example, enriching data with reference data stored in Azure Cosmos DB. The plugin is invoked with the evaluate operator. Syntax evaluate cosmosdb_sql_request ( ConnectionString , SqlQuery [, SqlParameters [, Options]] ) Argument name Description Required/optional ConnectionString A string literal indicating the connection string that points to the Cosmos DB collection to query. It must include AccountEndpoint, Database, and Collection. It may include AccountKey if a master key is used for authentication. Example: 'AccountEndpoint=https://cosmosdbacc.documents.azure. com/ ;Database=MyDatabase;Collection=MyCollection;AccountKey=' h'R8PM...;' Required SqlQuery A string literal indicating the query to execute. Required SqlParameters A constant value of type dynamic that holds key-value pairs to pass as parameters along with the query. Parameter names must begin with @. Optional Options A constant value of type dynamic that holds more advanced settings as key-value pairs. Optional armResourceId Retrieve the API key from the Azure Resource Manager Example: /subscriptions/a0cd6542-7eaf-43d2-bbdd- b678a869aad1/resourceGroups/ cosmoddbresourcegrouput/providers/Microsoft.DocumentDb/data baseAccounts/cosmosdbacc token Provide the Azure AD access token used to authenticate with the Azure Resource Manager. preferredLocations Control which region the data is queried from. Example: ['East US']
  • 48. IMPORTANT: Set the callout policy !! [ { "CalloutType": "CosmosDB", "CalloutUriRegex": "my_endpoint1.documents.azure.com", "CanCall": true }, { "CalloutType": "CosmosDB", "CalloutUriRegex": "my_endpoint2.documents.azure.com", "CanCall": true } ] .alter cluster policy callout @'[{"CalloutType": "cosmosdb", "CalloutUriRegex": ".documents.azure.com", "CanCall": true}]' Example: Query Cosmos DB The following example uses the cosmosdb_sql_request plugin to send a SQL query to fetch data from Cosmos DB using its SQL API. evaluate cosmosdb_sql_request( 'AccountEndpoint=https://cosmosdbacc.documents.azure.com/;Database=MyDatabase;C ollection=MyCollection;AccountKey=' h'R8PM...;', 'SELECT * from c’) Example: Query Cosmos DB with parameters The following example uses SQL query parameters and queries the data from an alternate region. For more information, see preferredLocations. evaluate cosmosdb_sql_request( 'AccountEndpoint=https://cosmosdbacc.documents.azure.com/;Database=MyDatabase;C ollection=MyCollection;AccountKey=' h'R8PM...;', "SELECT c.id, c.lastName, @param0 as Column0 FROM c WHERE c.dob >= '1970-01- 01T00:00:00Z'", dynamic({'@param0': datetime(2019-04-16 16:47:26.7423305)}), dynamic({'preferredLocations': ['East US']})) | where lastName == 'Smith'
  • 49. ADX ‘’WOW’’ PLUGINS – HTTPS CALL MAKE INFERENCES WITH HTTPS PLUGIN http_request plugin / http_request_post plugin Why this plugin Exists? The http_request (GET) and http_request_post (POST) plugins send an HTTP request and convert the response into a table to retrieve particular elaboration and merge it with dataset. Syntax evaluate http_request ( Uri [, RequestHeaders [, Options]] ) evaluate http_request_post ( Uri [, RequestHeaders [, Options [, Content]]] ) Name Type Required Description Uri string ✓ The destination URI for the HTTP or HTTPS request. RequestHeaders dynamic A property bag containing HTTP headers to send with the request. Options dynamic A property bag containing additional properties of the request. Content string The body content to send with the request. The content is encoded in UTF-8 and the media type for the Content-Type attribute is application/json.
  • 50. WHY IS … SO DIFFICULT? Returns Both plugins return a table that has a single record with the following dynamic columns: • ResponseHeaders: A property bag with the response header. • ResponseBody: The response body parsed as a value of type dynamic. Prerequisites 1. CALLOUT POLICY 2. USE HTTPS Authentication Argument Description Uri The URI to authenticate with. RequestHeaders Using the HTTP standard Authorization header or any custom header supported by the web service. Options Using the HTTP standard Authorization header. If you want to use Azure Active Directory (Azure AD) authentication, you must use an HTTPS URI for the request and set the following values: * azure_active_directory to Active Directory Integrated * AadResourceId to the Azure AD ResourceId value of the target web service.
  • 51. WARNING, WARNING, WARNING !!!! SECRET INFORMATION MUST BE REALLY SECRET!!! • Be extra careful not to send secret information, such as authentication tokens, over HTTP connections. • if the query includes confidential information, make sure that the relevant parts of the query text are obfuscated so that they'll be omitted from any tracing. • Uus obfuscated string literals !!! HEADERS vs HEADACHE The RequestHeaders argument can be used to add custom headers to the outgoing HTTP request. In addition to the standard HTTP request headers and the user-provided custom headers, the plugin also adds the following custom headers: Name Description x-ms-client-request-id A correlation ID that identifies the request. x-ms-readonly A flag indicating that the processor of this request shouldn't make any persistent changes. READ <> READWRITE PERMISSION The x-ms-readonly flag is set for every HTTP request sent by the plugin that was triggered by a query and not a control command.
  • 52. HTTPS PLUGIN: An Example EXAMPLE NO.1 evaluate http_request('http://services.groupkt.com/country/get/all') | project CC=ResponseBody.RestResponse.result | mv-expand CC limit 10000 | project name = tostring(CC.name), alpha2_code = tostring(CC.alpha2_code), alpha3_code = tostring(CC.alpha3_code) | where name startswith 'b’ EXAMPLE NO.2 let uri='https://example.com/node/js/on/eniac'; let headers=dynamic({'x-ms-correlation-vector':'abc.0.1.0'}); let options=dynamic({'Authentication':'Active Directory Integrated', 'AadResourceId':'https://eniac.to.the.max.example.com/’}); evaluate http_request_post(uri, headers, options) Etc etc etc evaluate http_request_post ( Uri [, RequestHeaders [, Options [, Content]]] ) RESULT name alpha2_code alpha3_code Bahamas BS BHS Bahrain BH BHR Bangladesh BD BGD WHERE IS ADX.. IN THIS TYPICAL USE CASE?
  • 53. “Let your data drive. But.. Sir… Data driven or data informed?