Complex event analytics solutions require massive architecture, and Know-How to build a fast real-time computing system. Google BigQuery solves this problem by enabling super-fast, SQL-like queries against append-only tables, using the processing power of Google’s infrastructure.In this presentation we will see how Bigquery solves our ultimate goal: Store everything accessible by SQL immediately at petabyte-scale. We will discuss some common use cases: funnels, user retention, affiliate metrics.
Unlocking the Future of AI Agents with Large Language Models
Complex realtime event analytics using BigQuery @Crunch Warmup
1. Complex Realtime
Event Analytics using BigQuery
Márton Kodok
Senior Software Engineer at REEA
twitter: martonkodok stackoverflow: pentium10 github: pentium10
Crunch Warm Up - October 2015 - Budapest
2. Agenda
1. Big Data movement
2. Analytics Project - Background
3. Challenges - Why is it so hard?
4. Approach - Strategy - Application
5. Use Cases - Implementations
6. Exploring Big Data (GDELT, Hackernews, Reddit)
Complex Realtime Event Analytics using BigQuery @martonkodok
3. Big data analyses movement
Every scientist who needs
big data analytics to save millions of lives
should have that power.
Complex Realtime Event Analytics using BigQuery @martonkodok
4. Challenging experience
The simple fact is that
you are brilliant
but your brilliant ideas require
complex big data analytics.
Complex Realtime Event Analytics using BigQuery @martonkodok
5. Project: One-size-fits-all problem
Need a backend to store, query, extract for deep analytics:
● Events (product, app, site email events)
● Achievements (“tag” users on the go, retention)
● Entities (split tests, user profiles, business entities)
● Metrics (app profiler data, custom)
● Email activity (click-map, engagement, ISP, Spam)
● 3rd party Analytics (good to have: Google Analytics)
● Systems generated data (log file entries, unstructured)
Complex Realtime Event Analytics using BigQuery @martonkodok
6. Desired system/platform
● Terabyte scalable storage
● Real-time event ingestion
● Ask sophisticated queries (optional: without Dev)
● Query-performance
● Low-maintenance
● Cost effective
● Wire them up easily
Goal: Store everything accessible by SQL immediately.
Complex Realtime Event Analytics using BigQuery @martonkodok
7. Equipment strategy
● In-House
● Hosted
● Managed
* people still required
Services:
❏ ELK Stack (Elastic-Logstash-Kibana)...
❏ Cassandra, Hive, Hadoop...
❏ Amazon RedShift, Google BigQuery...
Complex Realtime Event Analytics using BigQuery @martonkodok
9. What is BigQuery?
● Analytics-as-a-Service - Data Warehouse in the Cloud
● Fully-Managed
● Scales into Petabytes
● Ridiculously fast
● Decent pricing (queries $5/TB, storage: $20/TB)
● 100.000 rows / sec Streaming API
* October 2015 pricing
Complex Realtime Event Analytics using BigQuery @martonkodok
10. BigQuery: Big Data Analytics in the Cloud
● Convenience of SQL
● Familiar DB Structure (table, column, views, JSON)
● Open Interfaces (REST, Web UI, ODBC)
● Fast atomic imports JSON/CSV (file size up to 5TB)
● Simple data ingest from GCS or Hadoop
● Web UI + bq CLI
● Connectors: Hadoop, Tableau, R, Talend, Logstash
● US or EU zone
Complex Realtime Event Analytics using BigQuery @martonkodok
11. BigQuery: Convenience of SQL/JSON/JS
● Append-only tables
● Batch load file size limits: 5TB (CSV or JSON)
● ACL - row level locking (individual or group based)
● Columnar storage (max 10 000 columns in table)
● Rich SQL: JSON,IP,Math,RegExp,Window functions
● Datatypes: String 2MB, Record, Nested …
● UDF (User defined functions): Javascript
Note: Store what you can in columns, the rest in JSON.
Complex Realtime Event Analytics using BigQuery @martonkodok
12. BigQuery Costs - October 2015
* 1 Petabyte storage, 100 TB rows insert, 100 TB queries => 26,000 USD
Queries Storage Ingestion
➔ 1 TB per month free
➔ 5 USD per TB
➔ only pay for the columns
you use in your query
➔ 20 USD per TB ➔ Batch load free (CSV/JSON)
➔ Exporting free
➔ Table copy free
➔ 1 USD per 20TB data
Estimate 1
- Storage 5 TB
- Streaming Inserts 5TB
- Queries 3 TB
Monthly total: 110 USD
Estimate 2
- Storage 20 TB
- Streaming Inserts 10TB
- Queries 10 TB
Monthly total: 455 USD
Complex Realtime Event Analytics using BigQuery @martonkodok
13. UDF - Power of Javascript
● impossible to express in SQL: Loops, complex
conditionals, string parsing or transformations
● UDFs are similar to map functions in MapReduce
● inline JS or from GCS (gs://some-bucket/js/lib.js)
Some UDF use cases:
● take one row and emit zero or more rows
● decoding URL-encoded strings
● text readability
Complex Realtime Event Analytics using BigQuery @martonkodok
14. Append only tables - Get last value
1. Use aggregation MIN/MAX on timestamp to find first/last and join back to the same table.
2. Use analytic functions FIRST_VALUE and LAST_VALUE.
SELECT LAST_VALUE(email) OVER(
PARTITION BY user_id
ORDER BY timestamp ASC) AS email_last ...
3. Using Window Functions
SELECT email, firstname, lastname
FROM
(SELECT email, firstname, lastname
row_number() over (partition BY user_id
ORDER BY timestamp DESC) seqnum
FROM [profile_event]
)
WHERE seqnum=1
Complex Realtime Event Analytics using BigQuery @martonkodok
15. Table wildcard functions
This example assumes the following tables exist:
● mydata.people20140323
● mydata.people20140324
● mydata.people20140325
SELECT
name
FROM
(TABLE_DATE_RANGE(mydata.people,
DATE_ADD(CURRENT_TIMESTAMP(), -2, 'DAY'),
CURRENT_TIMESTAMP()))
WHERE
age >= 35
#... another example with RegExp ...
FROM
(TABLE_QUERY(mydata,
'REGEXP_MATCH(table_id, r"^boo[d]{3,5}")'))
Complex Realtime Event Analytics using BigQuery @martonkodok
20. Attribute orders to first article visited
Example:
● article1 -> page2 -> page3 -> page4 -> orderpage1 -> thankyoupage1
● page1 -> article2-> page3 -> orderpage2 -> ...
Problem: When an order is made, attribute a credit to the first article visited by that user!
Complex Realtime Event Analytics using BigQuery @martonkodok
22. Email URL clicks map (79GB in 2.4sec)
Complex Realtime Event Analytics using BigQuery @martonkodok
23. Achievements Continued
● Funnel Analysis
● Email URL click heatmap
● Email Dashboard (Trends, SPAM, ISP deferral)
● Split tests (by content, region, device, during the day)
● Ability for advanced segmentation as all raw data is stored
● Behavioral analytics (engaged users, recommendations)
Complex Realtime Event Analytics using BigQuery @martonkodok
24. Our benefits
● no provisioning/deploy
● no running out of resources
● no more focus on large scale execution plan
● no need to re-implement tricky concepts
(time windows / join streams)
● pay only the columns we have in your queries
● run raw ad-hoc queries (either by analysts/sales or Devs)
● no more throwing away-, expiring-, aggregating old data.
Complex Realtime Event Analytics using BigQuery @martonkodok
25. BigQuery: Sample projects to try out
1. githubarchive.org: 20+ event types available since 2012
a. pull request latency
b. expressions, emotions in commit messages
2. httparchive.org: Trends in web technology
a. popular scripts
b. website performance
3. raw Google Analytics data (*only Premium Customers)
4. GDELT - Global Database of Events, Language, and Tone
GKG - Global Knowledge Graph
5. GSOD - samples of weather (rainfall, temp…)
6. 1.6 billion Reddit comments
7. Hackernews data
8. Wikipedia edits
Complex Realtime Event Analytics using BigQuery @martonkodok