1. Analyzing Small Files in HDFS Cluster
Presenters: Rohit Jangid
Presenters: Raman Goyal
HDFS Analysis for Small Files
2. Outline
▪ What are small files and their problems?
▪ Small Files Analysis
▪ Architecture
▪ FsImage Processing and Aggregation
▪ Implementation and tool
▪ Dashboards and Results
▪ Dashboards
▪ Results
▪ Future Work
▪ Conclusions
2
11. ARCHITECTURE
HDFS Cluster RAW FsImage
Interpreted
FsImage
Attributed Files and
Directory
information
Aggregated Files
and Directory
information
Dashboard
Storage
11
LSR
12. LSR
FsIMAGE PROCESSING
MeProcessed 20gb FsImage In ~20 Minutes
Custom OIV Interpreter For Reduced Memory Usage
Fetched from Name node OIV to LSR Interpreter
HDFS Cluster RAW FsImage
Interpreted
FsImage
12
13. LSR
ARCHITECTURE
HDFS Cluster RAW FsImage
Interpreted
FsImage
Attributed Files and
Directory
information
Aggregated Files
and Directory
information
Dashboard
Storage
13
14. Attributes Found Directly
Owner Name
Group Name
Size of File
Replication Factor
Number of Direct File objects
Last Modified Date
Level of File
Is File or Is Directory?
Attribution and Aggregation
Aggregated Attributes
Number of Small File objects
Number of Namespace objects
Smallest, Largest, Avg File size
Difference in Size since Last run
If Directory
14
15. Attribution and Aggregation
Generate Small Files / Total Files Metrics
Roll-up Attributes to Parent Directories
Custom UDF’s and
PIG Scripts
Using Sqoop
Stored in HDFS
Attributed Files and
Directory
information
Aggregated Files
and Directory
information
Storage
15
16. ARCHITECTURE
HDFS Cluster RAW FsImage
Interpreted
FsImage
Attributed Files and
Directory
information
Aggregated Files
and Directory
information
Dashboard
Storage
16
LSR
18. Implementation and Tool
Files and Directories Attributed
Small file & Directory information
Download and Interpret
HDFS NameNode
At Directory level
Statistics like Smallest File calculated
Using OIV Interpreter
By splitting FsImage rows
Storage, REST API and Dashboards
Can easily add new Clusters in Tool
18
20. Dashboards Information
For file size less than 10 MB
For file size between 10 MB to 70 MB
For file size between 70 MB to ~100 MB
3 possible bucketing models
Goes upto all levels in HDFS
Distribution of owners of small Top 10
Directories to be investigated for
deletion, re-partition, compaction
3
2
1
20
26. Doesn’t have real time analysis! with
alerting
Cluster has 200+ million namespace objects that we get as memory dump from
Hadoop server.
Future Work
Translating and attributing each directory and file is a time consuming process.
Developing Customisable Compaction
Utility
1
2
26