Contenu connexe Similaire à Александр Лямин. HOWTO. Высокий пакетрейт на x86-64: берем планку в 14,88 Mpps Similaire à Александр Лямин. HOWTO. Высокий пакетрейт на x86-64: берем планку в 14,88 Mpps (20) Plus de Positive Hack Days Plus de Positive Hack Days (20) Александр Лямин. HOWTO. Высокий пакетрейт на x86-64: берем планку в 14,88 Mpps3. Что модно?
• UDP Flood and amplification.
• TCP ( SYN ( open|closed|firewalled) | ACK )
• ICMP Flood ( smurf )
L7 – is out of style
5. Долбанные инопланетяне
static unsigned int tcp_timeouts[TCP_CONNTRACK_TIMEOUT_MAX] __read_mostly = {
[TCP_CONNTRACK_SYN_SENT] = 2 MINS,
[TCP_CONNTRACK_SYN_RECV] = 60 SECS,
[TCP_CONNTRACK_ESTABLISHED] = 5 DAYS,
[TCP_CONNTRACK_FIN_WAIT] = 2 MINS,
[TCP_CONNTRACK_CLOSE_WAIT] = 60 SECS,
[TCP_CONNTRACK_LAST_ACK] = 30 SECS,
[TCP_CONNTRACK_TIME_WAIT] = 2 MINS,
[TCP_CONNTRACK_CLOSE] = 10 SECS,
[TCP_CONNTRACK_SYN_SENT2] = 2 MINS,
/* RFC1122 says the R2 limit should be at least 100 seconds.
Linux uses 15 packets as limit, which corresponds
to ~13-30min depending on RTO. */
[TCP_CONNTRACK_RETRANS] = 5 MINS,
[TCP_CONNTRACK_UNACK] = 5 MINS,
};
6. Кто еще виноват?
top - 08:16:23 up 39 min, 1 user, load average: 0.44, 0.16, 0.79
Tasks: 158 total, 2 running, 156 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 89.3%id, 0.0%wa, 0.0%hi, 10.7%si, 0.0%st
Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu8 : 0.0%us, 0.0%sy, 0.0%ni, 49.8%id, 0.0%wa, 0.0%hi, 50.2%si, 0.0%st
Cpu9 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu10 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu11 : 0.0%us, 1.0%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu12 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu13 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu14 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu15 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 32921100k total, 4598792k used, 28322308k free, 15496k buffers
Swap: 0k total, 0k used, 0k free, 83252k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
39 root 20 0 0 0 0 R 100 0.0 0:27.91 [ksoftirqd/8]
1401 root 20 0 0 0 0 S 8 0.0 0:03.05 [kpktgend_8]
5346 root 20 0 0 0 0 S 2 0.0 0:00.34 [kworker/8:0]
5740 root 20 0 19356 1472 1076 R 1 0.0 0:00.12 top
9. Как быть ?
AFFINITY > BALANCER
%/etc/init.d/irqbalancer stop
%grep eth8 /proc/interrupts
123: 19 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 PCI-MSI-edge eth8-TxRx-0
124: 0 15 0 0 0 0 0 0 1 0 0 0 0 0 0 0 PCI-MSI-edge eth8-TxRx-1
125: 0 0 15 0 0 0 0 0 1 0 0 0 0 0 0 0 PCI-MSI-edge eth8-TxRx-2
126: 0 0 0 15 0 0 0 0 1 0 0 0 0 0 0 0 PCI-MSI-edge eth8-TxRx-3
127: 0 0 0 0 15 0 0 0 1 0 0 0 0 0 0 0 PCI-MSI-edge eth8-TxRx-4
128: 0 0 0 0 0 15 0 0 1 0 0 0 0 0 0 0 PCI-MSI-edge eth8-TxRx-5
129: 0 0 0 0 0 0 17 0 1 0 0 0 0 0 0 0 PCI-MSI-edge eth8-TxRx-6
130: 0 0 0 0 0 0 0 15 1 0 0 0 0 0 0 0 PCI-MSI-edge eth8-TxRx-7
10. Лучше?
top - 07:40:25 up 3 min, 1 user, load average: 4.61, 1.29, 0.44
Tasks: 164 total, 9 running, 155 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 49.8%id, 0.0%wa, 0.0%hi, 50.2%si, 0.0%st
Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 49.8%id, 0.0%wa, 0.0%hi, 50.2%si, 0.0%st
Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 49.8%id, 0.0%wa, 0.0%hi, 50.2%si, 0.0%st
Cpu3 : 0.0%us, 0.0%sy, 0.0%ni, 49.8%id, 0.0%wa, 0.0%hi, 50.2%si, 0.0%st
Cpu4 : 0.0%us, 0.0%sy, 0.0%ni, 49.8%id, 0.0%wa, 0.0%hi, 50.2%si, 0.0%st
Cpu5 : 0.0%us, 0.0%sy, 0.0%ni, 49.8%id, 0.0%wa, 0.0%hi, 50.2%si, 0.0%st
Cpu6 : 0.0%us, 0.0%sy, 0.0%ni, 49.8%id, 0.0%wa, 0.0%hi, 50.2%si, 0.0%st
Cpu7 : 0.0%us, 0.0%sy, 0.0%ni, 49.8%id, 0.0%wa, 0.0%hi, 50.2%si, 0.0%st
Cpu8 : 0.0%us, 0.0%sy, 0.0%ni, 90.2%id, 0.0%wa, 0.0%hi, 9.8%si, 0.0%st
Cpu9 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu10 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu11 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu12 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu13 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu14 : 0.0%us, 1.0%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu15 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 32921100k total, 4597288k used, 28323812k free, 15340k buffers
Swap: 0k total, 0k used, 0k free, 83240k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15 root 20 0 0 0 0 R 96 0.0 0:46.06 [ksoftirqd/2]
23 root 20 0 0 0 0 R 96 0.0 0:46.04 [ksoftirqd/4]
11 root 20 0 0 0 0 R 95 0.0 0:46.04 [ksoftirqd/1]
19 root 20 0 0 0 0 R 95 0.0 0:46.03 [ksoftirqd/3]
27 root 20 0 0 0 0 R 95 0.0 0:46.02 [ksoftirqd/5]
31 root 20 0 0 0 0 R 95 0.0 0:46.08 [ksoftirqd/6]
35 root 20 0 0 0 0 R 95 0.0 0:46.04 [ksoftirqd/7]
3 root 20 0 0 0 0 R 93 0.0 0:45.23 [ksoftirqd/0]
11. Более лучше?
# ethtool -K eth8 ntuple on
# ethtool -U eth8 flow-type udp4 action -1
Added rule with ID 8189
# ethtool -u eth8
8 RX rings available
Total 1 rules
Filter: 8189
Rule Type: UDP over IPv4
Src IP addr: 0.0.0.0 mask: 255.255.255.255Dest IP addr: 0.0.0.0 mask: 255.255.255.255
TOS: 0x0 mask: 0xff
Src port: 0 mask: 0xffff
Dest port: 0 mask: 0xffff
VLAN EtherType: 0x0 mask: 0xffff
VLAN: 0x0 mask: 0xffff
User-defined: 0x0 mask: 0xffffffffffffffff
Action: Drop
12. Более лучше!
(здесь ~14.88Mpps UDP)
Tasks: 163 total, 1 running, 162 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu1 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu8 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu9 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu10 : 1.0%us, 0.0%sy, 0.0%ni, 99.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu11 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu12 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu13 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu14 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu15 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 32921100k total, 4374344k used, 28546756k free, 7700k buffers
Swap: 0k total, 0k used, 0k free, 24036k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4348 root 20 0 19356 1476 1076 R 1 0.0 0:00.03 top
1 root 20 0 4120 688 588 S 0 0.0 0:01.22 init [3]
2 root 20 0 0 0 0 S 0 0.0 0:00.00 [kthreadd]
13. Поприветствуем Flow Director
The flow director filters identify specific flows or
sets of flows and routes them to specific queues.
The flow director filters are programmed by
FDIRCTRL and all other FDIR registers. The 82599
shares the Rx packet buffer for the storage of
these filters.
14. Flow Director умеет
• Perfect match filters — The hardware checks a
match between the masked fields of the received
packets and the programmed filters. Masked
fields should be programmed as zeros in the filter
context. The 82599 support up to 8 K - 2 perfect
match filters.
• Signature filters — The hardware checks a match
between a hash-based signature of the masked
fields of the received packet. The 82599 supports
up to 32 K - 2 signature filters.
16. Not so perfect
(Выкидыш FlowDirector)
• Потребляют память RX buffer (256/512)
• Не умеют ЕСЛИ-ТО
• Masks are GLOBAL for signature filters
• 64b это до обидного мало
• Поддерживается ethtool (perfect, buggy) и
PF_RING(signature only)
Но и на том Intel SPASIBO!
17. Flex Filters
(Выкидыши реализации RSS)
• 128b of the packet (FRAME!)
• 6 filters
• Кратковременно отключаются при
доступе(R|W)
• Нет публично доступного userland
конфигуратора.
18. Как быть с TCP SYN?
• SYN без Seq Number
• SYN без MSS
• … и прочие ляпы где можно вывести
сигнатуру до первых 128b
19. Как быть с Perfect TCP SYN ?
Больно умереть на 400kPPS…
20. Post mortem
# ========
#
# Samples: 19K of event 'cycles'
# Event count (approx.): 12923232073
#
# Overhead Command Shared Object Symbo l
# ........ ........... ................. .................................... .
#
78.74% ksoftirqd/0 [kernel.kallsyms] [k] _raw_spin_lock
|
--- _raw_spin_lock
|
|--98.84%-- tcp_v4_rcv
| ip_local_deliver_finish
| ip_local_deliver
| ip_rcv_finish
| ip_rcv
| __netif_receive_skb
| netif_receive_skb
| napi_skb_finish
| napi_gro_receive
| 0xffffffffa005c134
| net_rx_action
| __do_softirq
| run_ksoftirqd
| smpboot_thread_fn
| kthread
| ret_from_fork
21. net/ipv4/tcp_ipv4.c
process:
if (sk->sk_state == TCP_TIME_WAIT)
goto do_time_wait;
if (unlikely(iph->ttl < inet_sk(sk)->min_ttl)) {
NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP);
goto discard_and_relse;
}
if (!xfrm4_policy_check(sk, XFRM_POLICY_IN, skb))
goto discard_and_relse;
nf_reset(skb);
if (sk_filter(sk, skb))
goto discard_and_relse;
skb->dev = NULL;
bh_lock_sock_nested(sk);
ret = 0;
if (!sock_owned_by_user(sk)) {
[dd]
} bh_unlock_sock(sk);
sock_put(sk);
return ret;
23. Запилим Инновационный Костыль!
*rawpost
:POSTROUTING ACCEPT [15:1548]
-A POSTROUTING -s 10.1.0.0/24 -o eth8 -j RAWSNAT --to-source 10.10.40.3/32
COMMIT
# Completed on Mon May 20 04:47:30 2013
# Generated by iptables-save v1.4.16.3 on Mon May 20 04:47:30 2013
*raw
:PREROUTING ACCEPT [28:2128]
:OUTPUT ACCEPT [18:2056]
-A PREROUTING -d 10.10.40.3/32 -m cpu --cpu 0 -j RAWDNAT --to-destination 10.1.0.1/32
-A PREROUTING -d 10.10.40.3/32 -m cpu --cpu 1 -j RAWDNAT --to-destination 10.1.0.2/32
-A PREROUTING -d 10.10.40.3/32 -m cpu --cpu 2 -j RAWDNAT --to-destination 10.1.0.3/32
-A PREROUTING -d 10.10.40.3/32 -m cpu --cpu 3 -j RAWDNAT --to-destination 10.1.0.4/32
-A PREROUTING -d 10.10.40.3/32 -m cpu --cpu 4 -j RAWDNAT --to-destination 10.1.0.5/32
-A PREROUTING -d 10.10.40.3/32 -m cpu --cpu 5 -j RAWDNAT --to-destination 10.1.0.6/32
-A PREROUTING -d 10.10.40.3/32 -m cpu --cpu 6 -j RAWDNAT --to-destination 10.1.0.7/32
-A PREROUTING -d 10.10.40.3/32 -m cpu --cpu 7 -j RAWDNAT --to-destination 10.1.0.8/32
COMMIT
26. На сладкое
А что будет если послать пакет на не
слушаемый порт?
А что если послать много-много пакетов?
28. net/ipv4/ip_output.c
bh_lock_sock(sk);
inet->tos = arg->tos;
sk->sk_priority = skb->priority;
sk->sk_protocol = ip_hdr(skb)->protocol;
sk->sk_bound_dev_if = arg->bound_dev_if;
ip_append_data(sk, &fl4, ip_reply_glue_bits, arg->iov->iov_base, len, 0,
&ipc, &rt, MSG_DONTWAIT);
if ((skb = skb_peek(&sk->sk_write_queue)) != NULL) {
}
if (arg->csumoffset >= 0)
*((__sum16 *)skb_transport_header(skb) +
arg->csumoffset) = csum_fold(csum_add(skb->csum,
arg->csum));
skb->ip_summed = CHECKSUM_NONE;
ip_push_pending_frames(sk, &fl4);
bh_unlock_sock(sk);
29. Спасибо Эрик!
commit be9f4a44e7d41cee50ddb5f038fc2391cbbb4046
Author: Eric Dumazet <edumazet@google.com>
Date: Thu Jul 19 07:34:03 2012 +0000
ipv4: tcp: remove per net tcp_sock
tcp_v4_send_reset() and tcp_v4_send_ack() use a single socket
per network namespace.
This leads to bad behavior on multiqueue NICS, because many cpus
contend for the socket lock and once socket lock is acquired, extra
false sharing on various socket fields slow down the operations.
To better resist to attacks, we use a percpu socket. Each cpu can
run without contention, using appropriate memory (local node)