2. About Yahoo! JAPAN
2
The Largest Portal Site in Japan
65 billon pageviews / month
2.1 billon pageviews / day
3. YDN Report
What is YDN Report?
• Report for Yahoo Display Ads. Networks
Batch Reporting over Massive Dataset
• 13 months, 800B+ rows of data
• Adding 3.3B+ rows of data per day
Highly Parallel Workload
• 100K reports per hour
3
4. YDN Report Query
Typical Query
• Query is Relatively Simple
• Answer “How many clicks did I get last week?”
4
0
5000
10000
15000
1 3 5 7 9 11 13 15 17 19 21 23 25 27
SELECT account, yyyymmdd,
sum(total_imps),
sum(total_click),
...
FROM table_x
WHERE yyyymmdd >= xxx
AND yyyymmdd < xxx
AND account = xxx
...
GROUP BY account, yyyymmdd, ...;
6. Hive Performance Recap
Hive is fast: interactive response
• ORC columnar file format
• Cost based optimizer (CBO)
• Vectorized SQL engine
• Tez execution engine (replacing MapReduce)
Hive 0.10
Batch
Processing 100-150x Query Speedup
Hive 1.2
Human
Interactive
(5 seconds)
7. Hive on Tez Query Execution
A query execution essentially is put together from
• Client execution [ 0s if done correctly ]
• Optimization [HiveServer2] [~ 0.1s]
• Metadata lookups [Hcatalog, Metastore] [ very fast in hive 0.14 ]
• Application Master creation [4-5s]
• Container Allocation [3-5s]
• Tez task execution on YARN
YARN and HDFS
HiveServer2
Server #1Client
Running testing tool
N connections
N
connections
Metastore Metastore DB
HiveServer2
Server #2
Tez
AM
Tez
Container
Tez
Container
…
8. Mini Test
Mini Setup Tested
• 50 nodes
• 450B rows dataset
• Achieved 15K queries per hour
So, can we get 100K qph on 700 nodes?
We thought it should be easy, but…
8
9. The Bottlenecks at Scale
Challenges at Scale
• Hive Metastore Server
• YARN Resource Manager
• Datanode Hotspot
• YARN ATS
9
10. Hive Metastore Server
10
Use Local Metastore
• Before: HS2 -> Metastore Server -> Metastore DB
• After: HS2 (local metastore) -> Metastore DB
46. Web UI (HIVE-11526)
LLAP daemon
exposes basic metrics
on port 15002(default)
Included in HIVE2.1
Contributed from
Yahoo! JAPAN
46
47. Web UI (HIVE-14030)
HIVE-11526 is just for each daemon
HIVE-14030 provides aggregation view of a
LLAP cluster (not yet in master)
Contributed from Yahoo! JAPAN
47
50. Direct Access to HDFS
breaks everything
50
HS2 LLAP
YARN
HDFS
Storage Based Authorization
M/R,
Pig,
Spark
Break
SQL
Standard
Based
ACLs !!
But direct accessing(Not from Hive)
to HDFS breaks the security model.
Other solutions
(not only Hive)
are necessary
Thank you for coming today
My name is yohei abe, from yahoo japan.
This time, this presentation is from two people, not only me
This talk is consits of two parts
Both parts of talks are related to Yahoo japan,
First part is from Mr. Jiang, about HIVE Tez usecase in Yahoo! Japan
Mr. Jiang ….
Last part is from me, about LLAP usecase, for same dataset, query.
At first, Allow me to introduce myself
I’m a engineer of yahoo japan, working for hadoop infrastructure systems, supporting hive, hadoop systems.
Yahoo! JAPAN is the largest portal site in japan, providing many services like a weather service, auction, news and whatnot.
So, our site is able to reach 81% of entire Japanese internet users,
it provides advertisement place for advertisors
We offer a variety of advertising solutions.
YDN, yahoo display network, is one of the solutions.
It uses HIVE to generate YDN report, that has how many impressions were there, were clicked, wre viewed, for a certain period of time.
It contians some useful information for advertisers.
The point is, data is massive, so large.
The data source table, report is generated from that, has 800 billions rows over 13 months period.
The report generating job is parallel workload, batch processing, not interactive query.
We need to generate 100000 reports per hour, this is our business , customer requirments.
From a single client machine, we run 60K queries and calculate queries per hour(qph)from a result.
Throughput = 60,000 queries * (60 / minutes taken to process 60K queries)
For each cluster configuration change, we have several patterns of attempt withdifferent concurrencies, 32, 64, 96, 128, 256 and bigger.
Our queries were already highly optimized. So we focused on some other parts. A query execution essentially is put together from
– Client execution [ 0s if done correctly ]
– Optimization [HiveServer2] [~ 0.1s]
– HCatalog lookups [Hcatalog, Metastore] [ very fast in hive 14 ]
– Application Master creation [4-5s]
– Container Allocation [3-5s]
– Query Execution
Pending apps decreased, but
Didn’t gain too much throughput
Increased tez.am-rm.heartbeat.interval-ms.max from 250ms to 1000ms
Increased replication factor for specific directory from 3 to 10
Ok, so LLAP. We are going to use LLAP for YDN report.
As Jiang said at his talk, hive on tez can produce 100K reports per hour
Our engineers found some bottlnecks and fixed them to achieve the requirement by tuning some parameters, basically.
Next step is LLAP
LLAP is a new hive feature from hive 2.0,
So we did some technical investigation, mostly we need to know how llap can process YDN report.
How many servers necessary?
What parameters need to be changed?
Is it possible to generate 100K reports per hour?
What is LLAP
I think , in other session , in previous hadoop summit, LLAP is already introduced into detail.
So here, I’m going to talk just briefly about LLAP, what it does.
LLAP is for sub-second query processing, the main component is the persistent daemons.
Let me compare with Tez processing model, I think it’s easy way to understand the difference, what LLAP does.
In the case of Tez, when the client throw a SQL, application master is created. This is same behavior with LLAP.
And then, application master creates some child tez container for computation. These are created dynamically, not persistent.
On the other hand, LLAP is persistent daemon. “Persistent” means it’s can be used by some queries, some users, if it is not the query using private data.
Persistency provides some benefit like omitting startup cost, intelligent cache, JVM can JIT it effectively and so on.
I would say again , I don’t go through the internals.
So if you’re interested in that, please make sure and catches talk by core engineers of hortonworks
From here, I’m going to talk about some tuning points and performance results.
This is our LLAP cluster just for evaluation purpose.
The important point here is 45 nodes for LLAP, it means 45 daemons are running.
We also prepare hiveserver2 so as not to hiveserver2 becomes bottleneck.
LLAP can be configured by xml files as well as hive.
These xml files have many parameters, some of them are basic, you need to change some of them for performance.
Some default values are not suitable for your system, so you need to change them
These basic parameters are related to thread size
This is very simplified threading model of LLAP. LLAP has two main components, executor and IO layer.
IO layer reads data from HDFS, decode ORC data, convert it to vectorized data, pass them to executor
Executor gets data from IO layer, and compute and generate results.
These data passing is completely asynchronous.
The size of executor size is specified by hive.llap.daemon.num.executors
Default is 4
You need to set this value according to your CPU vcore.
In our case, its 40.
This chart is performance result
Vertical axis is for the number of query per seconds, hight is better.
Horizontal axis is for the size of executor threads size.
The leftmost, default 4 is so slow, CPU is almost idle.
The second bar is 40, its in our CPU vcore size.
No further improvement is watched when the size is larger than CPU size.
So it’s good to set this value to CPU size.
Next, IO thread size can be specified by hive.llap.io.threadpool.size
Default is 10, it is also too small in our case.
This chart is performance result in the case of changing IO thread size
Default is not good performance , its not suitable for our cpu.
It’s better performance when the size is vcpu size.
Following these executor and IO results, I set these values to CPU vcore size on later slides, performance test.
Memory,
When it comes to memory, these are two parameters.
One is for executor , the other is for IO layer.
One thing to note is, executore uses JVM on-heap memory, but IO layer uses off-heap memory.
The value of executor memory is changed by a little bit through internal process and passed as a java command line parameter of Xmx.
There is no clear guidline what value is effective for these values, in our case, split physical memory size equally and set values to them.
If LLAP daemon run out of them, you can watch and find it by LLAP Web UI. I’m going to talk about it later on this slide.
Performance
This chart shows, the blue line is LLAP, Tez+LLAP and red line is tez
Verticali axis is query per seconds, higher is better.
Horizontal axis is clients, it means more clients, more concurrent queries at the same time.
This chart indicates LLAP is always better performance than Tez even for batch processing, not interactive query.
グラフのスケールをあわせる
Is the previous chart meaning 100K per hour , we need 100k per hour performance for our Ad report.
From the chart, the max qps is abou 24, it’s 87000 query per hour using 45 LLAP daemons.
Almost there, it was so close, but 45 nodes in our test environment is not enough.
We calculated, so if LLAP scaled almost lineary, 70 nodes is enough for 100K performance.
It’s far smaller than Tez system. LLAP provide us really good performance.
More tuning
We found one more parameter that can be effective in our case
The parameter name is client consistent splits.
This takes boolean value, default is false.
The difference is LLAP daemon follows data locality or not.
That is, data is on the same machine with LLAP
The computation may be fast when LLAP daemon uses local data instead of remote data.
The default is false, Tez application master distributes computations based on file locality.
True is, Tez application master uses a kind of hash distribution for selecting LLAP. It means file locality is ignored,
Compute process is distributed evenly on LLAP cluster
Recap: A node runs llap daemon and also datanode daemon.
The resut is here. It’s a little bit , opssite result I thought.
Ignoring file locality is faster than default setting.
But, it depends data size, table size, and so on. We think it cannot be generarized this result, but in our case, it’s faset when I changed the value from default.
We have two future work.
We are now under investigating, verifying them, LLAP features
The first is Web UI
LLAP daemon exposes some basic metrics, memory footprint, CPU usage, cache hit ratio..
At a specific port. This feature is in Hive2.1 and contributed from Yahoo! Japan.
Thank you for my co-worker.
This feature is really useful for cluster administrators.
For example, when you cannot get good performance even if you have modern machine, there may be some mis-configuration about LLAP.
In that case, you can use this UI, how daemon works, what is cache rate.
In my case, I found through the UI, the number of executor is too small. CPU was almost idle.
And again, in another JIRA ticket, this UI will be improved. This is not included master branch, I think.
This ticket provides you aggregated view of previous UI
You can easily check status of all cluster machines, all daemons.
Column-level ACL is really important for us, and I think other companies as well
Of course, Hive is able to do it using HiveServer2,
HIveserver2 and metastore knows which data should be exposed to who, which user
But, in our environment, we are not only ussing Hive, but also using other products, like MapReduce.
They breaks ACL, because they can read HDFS directly, without Hiveserver2.
when you need column-level ACL, you should use only Hive.
But we need othre solutions, its necessary, must be.
LLAP provides a solution for this issue,
It exposes LLAP as storage layer, so other products, not hive, can access it with keeping ACL.
If you interested in, plese see JIRA ticket, and LlapDump.java on github, hive repository.