并发编程实践与思考

并发编程实践与思考
http://www.cnblogs.com/promise6522/
PromisE_ 谢

前言：权衡的艺术
 Premature optimisation is the root of all evil
 Donald Knuth

 Intuition is frequently wrong—be data intensive
 《 Real-World Concurrency 》 from ACM-Queue

 尽量以最小的代价换取最大的收益，杜绝复杂性

怎样优化程序的并行部分？
 “Multi-threading is easy. Correct synchronization is hard.”
 按照优先程度从高到低：
 逻辑层面：减少数据共享
 编码层面：减少锁粒度
 工具层面：使用轻量同步机制

减少数据共享（一）
 实例分析： Linux Kernel —— 计数器的设计
 不可能使用 mutex （实际上大部分使用的是 atomic_t ）
 使用 Per-CPU counter ，计数时无需加锁
 每个 counter 超过一定 threshold 时，把值加到 global counter （ spinlock ）
并清零

 优点是大大提高了并发效率
 缺点是精度： CPU 核数越多，不精确程度越高

 实际上，大部分的统计功能并不需要如此的精度！

01/13/14

减少数据共享（二）
 实例分析：多线程友好的内存分配器 —— TCMalloc




小对象在 Thread Cache 中分配和释放
大对象（ >32K) 在 Central Heap 中分配和释放
当 Thread Cache 大小超过 2M （线程越多，这个值越小），内存会被回收至
Central Heap

 TCMalloc 们似乎已成为 MySQL 服务器的标配（内存密集应用）


"Github got 30% better performance using TCMalloc with MySQL"

减少数据共享（三）
 个人案例： A program profiler using thread-local storage
 测量系统各个部分的性能开销（函数 / 作用域 / 阶段性分析）
 宏（方便统一开关） + RAII ，在测量域退出时将 time+context 发
送到统计容器中

 为了尽量减少 profiler 本身对系统的影响：
 统计容器是 thread-local 的，保存统计信息时不用加锁
 线程退出时将本地容器的信息合并到全局容器中

陷阱：真的没有共享吗？
 分析：多线程计数，分配一个 int 数组 result 作为结果集
 线程启动时分配连续的 id
 1 号线程更新 result[1] ， 2 号线程更新 result[2] ，以此类推
 需要取计数值时，取 result[] 之和

 没有线程冲突？事实并非如此

01/13/14

伪共享 (False Sharing) ：
 原因： CPU load memory into cache by "line“
cat /proc/cpuinfo | grep cache_alignment (64 bytes)

无处不在的 False-Sharing
 除了 Continuous Array ：
 The linker lay out global or static data closely in the memory
 structs and C++ object layout is compact
 Two individual objects on the heap happens to be nearby.(Especially
for same kind of objects allocated using its own memory-pool or slaballocator)

解决 False-Sharing 问题
 如何尽量减少 False Sharing ：
 使用线程本地存储
 GCC ： __thread
 Boost : thread_specific_ptr<T>

 对齐 + 填充（ Kernel Slab-Allocator ：
SLAB_HWCACHE_ALIGN ）

减小锁粒度
 什么时候需要减小锁粒度？
 Know your cold paths from your hot paths
 对于冷门路径，一个粗粒度锁足够

怎样减小锁粒度？
 同一个模块中，对不总是同时访问的数据，使用不同的锁
 固定的加锁顺序，防止死锁

 使用锁（临界区）来保护数据，而不是操作
 将可能耗时的操作移到临界区外面（特别是 IO 操作）
 避免在临界区中调用未知代码

01/13/14

适时解锁
 案例： PlayerSessionManager 通过玩家 id 获取玩家信息
1) lock() ，检查 map<id, session> 表，如果有即返回
2) unlock() ，从 Memcache 中读取信息
network + disk IO ，不能占用锁
3) lock() ，再次检查 map<id, session> 表，如果已有该 id 则以表中当
前内容为主，否则更新 map<id, session> 表

重新加锁：版本号验证
 对于类似的流程，更加通用的方法是：
 在第一次 lock 时，缓存一份当前的版本号： cached_gen
 在重新获取 lock 时，比较当前的版本号 gen 和之前缓存的版本号
cached_gen
相同：进行正常 update ，同时把 gen 加 1
不同：根据业务需求，选择放弃此次操作或从头重试

读写锁来减少读写竞争？
 boost::shared_mutex 实现

 为什么会如此复杂？
 Notify queue
 boost::upgrade_lock
 Starvation Prevention

01/13/14

谨慎使用读写锁
实现复杂，效率低下
 C++0x committee rejected boost::shared_mutex
 “lock cost is higher than plain mutex even for readers”
 POSIX 库中的 pthread_rwlock 亦是基于 mutex + cond_var

读写锁的使用场景？
 读操作频繁——有竞争
 读操作相对耗时——分担读写锁的复杂度开销
 写操作较少——较少读写竞争
 讨论：日常开发中使用读写锁的场景？

 数据库中的应用： lock table xxx in shared/exclusive mode
 大部分场景下，读操作远远比写操作频繁
 读操作可能会很耗时（ Cache Miss 时的 IO ）
 然而实际上，很多 DB 内部使用 MVCC 来处理读写竞争（如
PostgreSQL ）

到底要不要加锁？
 基本数据类型—— CPU 自动保证的原子操作：
The Intel486 processor :
• Reading or writing a byte
• Reading or writing a word aligned on a 16-bit boundary
• Reading or writing a doubleword aligned on a 32-bit boundary
The Pentium processor (and newer processors since) :
• Reading or writing a quadword aligned on a 64-bit boundary
• 16-bit accesses to uncached memory locations that fit within a 32-bit
data bus
The P6 family processors (and newer processors since) :
• Unaligned 16-, 32-, and 64-bit accesses to cached memory that fit
within a cache
《 Intel 64 and IA-32 Architectures Software Developer's Manual 》 Volume 3
01/13/14

有限的“原子”语义
 原子语义：

 如果 x = a ，另一个线程执行 x = b
 其他线程在任何时候读到 x 的值，不会出现 a 或 b 之外的值

 对 read-modify-write / read-check-modify 等无能为力



i++ ， a = a + n
if (x == a ) x = b

 不能保证对其他线程的可见性 (Visibility) ：

 编译器优化 & CPU 乱序执行，实际执行顺序与预期顺序不一致
 CPU write-buffer ，使 CPU 对某块地址的修改不会立即反映到内存
 寄存器优化， Memory Barrier 等等
（水很深，个人理解有限，不详细阐述）

01/13/14

Visibility 问题实例（一）
 实例：使用 bool 控制另一个线程中的循环 :
Thread 1:
while(A)
{
read(B) // 此时 B 不一定等于 some_value
// do something
}
Thread 2 :
B = some_value()
A = 1 //Tell Thread1 that B is ready

Visibility 问题实例（二）
 实例：使用 double-checked lock 实现 singleton 模式
class Foo {
static Foo* instance() {
if (m_pFoo == NULL) {
scoped_lock(m_mutex);
if (m_pFoo == NULL) {
m_pFoo = new Foo;
//m_pFoo 的写入可能会在初始化 Foo 之前执行，后果？
}
}
return m_pFoo;
}
// omitted for brevity ...
}
01/13/14

题外话： C++ 线程安全的 singleton 模
式
 GCC 自动保证 thread-safe local statics initialization
-fno-threadsafe-statics 可以关闭该功能

class Foo{
public:
Foo* instance() {
static Foo instance;
r
}
// omitted for brevity
};

eturn &instance;

 产生的汇编代码：
(objdump –d obj)

正确方案：使用原子操作
 解决基本数据类型的读写问题：
 Visibility/Ordering 问题
 read-modify-write / read-check-modify 等组合操作

 工具：
 GCC built-in atomic operations
__sync_fetch_and_add(...) / __sync_compare_and_swap(...) 等

 C++11 : atomic<T>
 使用 shared_ptr<T> 的原子计数来管理对象的声明周期

01/13/14

原子操作的极致——无锁结构？
 Intel Thread Building Blocks (TBB)
 concurrent_hash_map

 Boost.Lockfree
 queue / stack / spsc_queue

 手工实现的无锁队列

01/13/14

无锁算法的核心： CAS Loop
 案例分析：
怎样实现 weak_ptr::lock() ？

实现 lock-free stack : Pop()

实现 lock-free stack : Push()

无锁结构 ——真的需要吗？
 除非别无他法，不要使用特定的无锁 / 并发结构

 未经生产环境验证，存在各种问题隐患（如 ABA 问题和活锁）
 很少的性能提升以高昂的维护和调试成本为代价
 用数据说话

 实践经验：

一个高并发的 Actor 系统，对于 id-actor 的管理表尝试使用 TBB 中的
并发容器优化，效果并不明显。
对 TBB 的 concurrent map 进行 benchmark 的结果如下：
 单线程插入性能远远小于 STL 容器
 8 个线程下的高密集插入，性能才和 mutex-protected map 持平
（或许代码需要使用 ICC 进行特殊优化？）

01/13/14

重剑无锋：重新认识 mutex
 Linux Futex : fast userspace mutex
 内核的等待队列对应到用户态的 mutex （实质上是对齐的整数）
 应用程序仅仅操作用户态的整数（使用原子操作）

 Futex 的 Hybrid 行为：
 先在用户态自旋 (spin)
 一段时间未获得锁，再进入内核态

 "When in doublt, use mutex!"

01/13/14

Re-entrant Mutex ？
 pthread_mutex_t 默认不支持 re-entrant
 重复 lock 同一个 mutex 会造成死锁
 额外设置 PTHREAD_MUTEX_RECURSIVE 后才支持 re-entrant

 Redfox 库中的 mutex 不支持 recursive
 重复 lock 同一个 mutex 会抛异常

 boost::mutex 和 boost::recursive_mutex

01/13/14

不使用 re-entrant mutex （一）
 "Hide something serious from you"
 看似方便的“模糊”，实则削弱了对代码的控制
 很难通过 lock()/unlock() 判断临界区

 个人体会：每次对 mutex 无意中的递归使用，都意味着对
代码并行流程的理解不充分
 Redfox 的 Fail-Fast 有利于快速排错

 效率：每次加锁需要同步的 check 操作，即使是当前线程
的重复加锁

不使用 re-entrant mutex （二）
 与条件变量 cond_var 的语义冲突：
Consumer :
lock(mutex);
while(!some_check()) {
cond_wait(cond, mutex);
//Atomic release mutex and block the calling thread
}

Producer:
lock(mutex);
cond_signal(cond);

谈谈 Redfox::FiberMutex
 Mutex 的行为：

 获取锁失败：当前线程进入内核的等待队列
 其他线程释放该锁，对应线程被唤醒

 FiberMutex 的行为：

 获取锁失败：当前 Fiber 加入调度器的等待队列，当前线程开始调
度其他 Fiber
 其他 Fiber 释放该锁，唤醒这些等待的 Fiber

 优点：充分利用 CPU 时间，避免 mutex.lock() 阻塞整个线
程
 缺点：适应性不强，过于重量级
 对于轻量级的临界区， mutex.lock() 本身不会阻塞很久（ futex 保
证用户态 spinlock ，而不会陷入内核态）
 Fiber 的切换开销，甚至大于整个临界区的操作等待时间

01/13/14

FiberMutex 和 Mutex 的对比测试
 测试内容：
2 个线程的调度器上，创建 1000 个 fiber
每个 fiber 增加全局计数器 100 次
最后打印全局计数器，验证正确性

01/13/14

对比测试结果
 测试环境一：家用笔记本双核 CPU 2.0GHz Ubuntu on Virtualbox
使用 Fiber Mutex ：
real : 16s user : 19s sys : 6s
使用 Posix Mutex ：
real : 4s user : 5s sys : 1s

 测试环境二：公司开发环境 8 核 CPU 2.1GHz (QEMU Virtual CPU)
real : 20s user : 14s sys : 7s (CPU 110%)
real : 1s user : 1s sys : 0.4s

如果把调度器换成 4 个线程：
real : 22s user : 20s sys : 10s (CPU 130%)
real : 2s user : 1.5s sys : 1.5s

01/13/14

附： Redfox 的 Scheduler 切换效率

并发编程实践与思考

Recommandé

Recommandé

Contenu connexe

Similaire à 并发编程实践与思考

Similaire à 并发编程实践与思考 (20)

并发编程实践与思考