4. LSF 2010 participant/companies
Company # of participant Key background
Intel 5 Kernel performance, SSD, mem mgmt
EMC 5 Storage, file system
Fujitsu 4 IO Controller, btrfs
Taobao 3 Distributed storage, taobao server
Novell 2 Suse server, HA
Oracle 2 OCFS2 dev/test
Baidu 2 Baidu kernel optimization
Canonical 2
Redhat 1 Network driver
4
5. Key topics
Topic Slides? Description
Page writeback Y Dirty page ratio limit
Control process to write pages
CFQ, IO controller Y CFQ introduction and further features
BTRFS N Memory consuming, fsck speed
SSD/Block layer Y Block layer issues with SSD
Kernel Tracing N ftrace
VFS scalability N Multi-core challenges
Kernel testing Y Intel kernel auto test framework
Industrial talk: Taobao Y TFS, Tair
Industrial talk: Baidu N The architecture of Baidu search system
Industrial talk: EMC N FSCK
5
6. Writeback - Wu Fengguang
• vmscan is a bottleneck
decrease dirty ratio under memory pressure, so vmscan can less possibly find
dirty pages on page allocation.
• pageout(page) calls wirtepage() to write to disk, which is a performance killer
since it does random writes
let flusher write. expand single 4K write to 4MB write. So more dirty pages are
reclaimed and flushed.
• balance_dirty_page() should not write: random write kills performance
let flusher write and ask process to sleep. Three proposals:
a). wait io completion: NFS bumpy completion, need smoother sleep method.
b). sleep (dirtied *3/2 / write_bandwidth)
c). sleep (dirtied / throttle_bandwidth)
• flusher default write size (4MB -> 128MB), will be dynamic in the future.
Baidu's practice: SSD random write is really bad. For sequential write, increase
wb size (4MB -> 40MB) will get 120% SSD performance
6
7. Btrfs - Coly Li
• Has too much love from linux community
two years ago > now
• Used in MeeGo Project
• Taobao plan to push industrial deployment in 2-3 years
10T per data server in TFS cluster
Use on SSD and SATA hybrid data server
Metadata will be allocated on SSD and data on SATA.
• Dynamic data relocation with hot data tracking patch.
For generic fs usage, need to deal with device mapper to get device speed
information.
• FSCK
A difficult must. Currently assigned to Fujitsu.
7
8. SSD challenges - Li Shaohua
• Throughput: same issue as network
• Disk controller gap and big locks (queue locks & scsi locks)
• Interrupt related:
a. smp affinity: single queue, one CPU to deal with irqs
b. blk_iopoll: poll more req in one req
• Need hardware multiqueue
• CFQ needs to be changed to fit multiqueue (e.g. CFQ per queue)
• Queue lock contention vs. cache lock contention
See Andi Kleen's talk in Tokyo Linux Conf
• Intel is building nextGen PCIE SSD, with many fancy features. Stay tuned
8
9. VFS scalability- Ma Tao
• With multi-cores, all global locks suck
• Globle icache/dcache can be adapted to per-CPU
• CFQ can be adapted to per-queue
• The less global locks the better
9
10. Industry talk – Baidu (Cont.)
• Service types
a). Indexing: random read, high IOPS, small IO size, read-only. 80M records per data
node, and process 8-9 K queries per second.
b). Distributed system: large files, sequential read/write.
c). cache/KV storage: between a and b
d). Web Server: CPU bound.
• For a), read() sucks. Use mmap() to read blocks adhead to void
kernel/userspace memory copy.
mmap() can not use page cache LRU. Call readahead() after each mmap() to mark
pages as read.
mmap() pagefault is expensive with mm->mmap_sem lock. Use sync_readahead()
and sync_readaheadv()
With above, memory is now the bottleneck. Doing 10G+ MB read.
• Google patch for reducing mm->mmap_sem hold time
In do_page_fault(), drop mem->sem if page not found, then read it and get the lock
again.
10
11. Industry talk – Baidu (Cont.)
• 8K filesystem block size (ext2) with 8K page size. Alloc continuous two pages
each time
Get better performance for sequential IO
OCFS2 uses 1MB fs block size and 4K page size
• PCIE compress card + ECC
11
12. Industry Talk - Taobao
• TFS
• Tair uses an update server to record updates; apply updates to production
system during mid-night
• A config server is used to minimize meta server workload
A versioned bucket table is maintained by config server and stored in each data
server. Client can manipulate data location with the bucket table returned by config
server.
• Both TFS and Tair are open source projects now
12
13. Industry Talk - EMC
• Introduce the recovery methods and check methods use in the file systems
from ext2 to btrFS
• Emphasis the importance of FSCK; Introduce the issues within FSCK when
checking a huge file system; Collect the proposals to solve this problem
• pNFS learning notes
13