Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Improving GStreamer performance on large pipelines: from profiling to optimization

When using GStreamer for creating media middleware and media infrastructures performance becomes critical for achieving the appropriate scalability without degrading end-user QoE. However, GStreamer does not provide off-the-shelf tools for that objective.

In this talk, we present efforts carried out for improving the performance of the Kurento Media Server during the last year. We present our main principle: “you cannot improve what you cannot measure”. Developing on it, we introduce different techniques for benchmarking large GStreamer pipelines including callgrind, time profiling, gst-meta profiling, chain-profiling, etc. We present results for different pipeline configurations and topologies. After that, we introduce some evolutions for GStreamer which could be helpful for optimizing performance such as the pervasive use of buffer-lists, the introduction of thread-pools or the appropriate management of queues.

To conclude, we present some preliminary work carried out in the GStreamer community for implementing such optimization and we discuss their advantages and drawbacks.

Livres associés

Gratuit avec un essai de 30 jours de Scribd

Tout voir
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Improving GStreamer performance on large pipelines: from profiling to optimization

  1. 1. Improving GStreamer performance on large pipelines: from profiling to optimization 8-9 October 2015 Dublin, Ireland Conference 2015 Miguel París mparisdiaz@gmail.com
  2. 2. 2 Who I am Miguel París ● Software Engineer ● Telematic Systems Master's ● Researcher at Universidad Rey Juan Carlos (Madrid, Spain) ● Kurento real-time manager ● mparisdiaz@gmail.com ● Twitter: @mparisdiaz
  3. 3. Overview 3  GStreamer is quite good to develop multimedia apps, tools, etc. in an easy way.  It could be more efficient  The first step: measuring / profiling ● Main principle: “you cannot improve what you cannot measure”  Detecting bottlenecks  Measuring the gain of the possible solutions  Comparing different solutions ● In large pipelines a “small” performance improvement could make a “big” difference  The same for a lot of pipelines in the same machine
  4. 4. Profiling levels 4 ● Different detailed levels: the most detailed, the most overhead (typically) ● High level  Threads num: ps -o nlwp <pid>  CPU: top, perf stat -p <pid> ● Medium level  time-profiling: how much time is spent per each GstElement (using GstTracer) ● Easy way to determine which elements are the bottlenecks. ● do_push_buffer_(pre|post), do_push_buffer_list_(pre|post) ● Reducing the overhead as much as possible – Avoid memory alloc/free: it stores all timestamps in a static memory previously allocated – Avoid logs: logging all entries at the end of the execution – Post-processing: log in CSV format that can be processed by a R script.  latency-profiling: latency added per each Kurento Element (using GstMeta) ● Low level: which functions spend more CPU (using callgrind)
  5. 5. Applying solutions 5 ● Top-down function. Repeat this process: 1)Remove unnecessary code 2)Reduce calls a) Is it needed more than once? b) Reuse results (CPU vs memory) 3)Go into more low-level functions ● GstElements 1)Remove unnecessary elements 2)Reduce/Reuse elements
  6. 6. Study case I 6 ● The one2many case ● What do we want to improve?  Increase the number of senders in a machine.  Reduce the consumed resources using a fix number of viewers
  7. 7. 7 <GstPipeline> pipeline0 [>] KmsWebrtcEndpoint kmswebrtcendpoint1 [>] GstRTPRtxQueue rtprtxqueue1 [>] GstRtpVP8Pay rtpvp8pay1 [>] GstRtpOPUSPay rtpopuspay1 [>] KmsWebrtcSession kmswebrtcsession1 [>] KmsRtcpDemux kmsrtcpdemux1 [>] GstRtpSsrcDemux rtpssrcdemux5 [>] KmsWebrtcTransportSinkNice kmswebrtctransportsinknice1 [>] GstNiceSink nicesink1 [>] GstDtlsSrtpEnc dtlssrtpenc1 [>] GstFunnel funnel [>] GstSrtpEnc srtp-encoder [>] GstDtlsEnc dtls-encoder [>] KmsWebrtcTransportSrcNice kmswebrtctransportsrcnice1 [>] GstDtlsSrtpDec dtlssrtpdec1 [>] GstSrtpDec srtp-decoder [>] GstDtlsDec dtls-decoder [>] GstDtlsSrtpDemux dtls-srtp-demux [>] GstNiceSrc nicesrc1 [>] GstRtpBin rtpbin1 [>] GstRtpSsrcDemux rtpssrcdemux4 [>] GstRtpSession rtpsession3 [>] GstRtpSsrcDemux rtpssrcdemux3 [>] GstRtpSession rtpsession2 [>] KmsWebrtcEndpoint kmswebrtcendpoint0 [>] GstRtpVP8Depay rtpvp8depay0 [>] KmsAgnosticBin2 kmsagnosticbin2-1 [>] GstQueue queue3 [>] KmsParseTreeBin kmsparsetreebin1 [>] KmsVp8Parse kmsvp8parse0 [>] GstFakeSink fakesink3 [>] GstTee tee3 [>] GstFakeSink fakesink2 [>] GstTee tee2 [>] GstRTPOpusDepay rtpopusdepay0 [>] KmsAgnosticBin2 kmsagnosticbin2-0 [>] GstQueue queue1 [>] KmsParseTreeBin kmsparsetreebin0 [>] GstOpusParse opusparse0 [>] GstFakeSink fakesink1 [>] GstTee tee1 [>] GstFakeSink fakesink0 [>] GstTee tee0 [>] GstRTPRtxQueue rtprtxqueue0 [>] GstRtpVP8Pay rtpvp8pay0 [>] GstRtpOPUSPay rtpopuspay0 [>] KmsWebrtcSession kmswebrtcsession0 [>] KmsRtcpDemux kmsrtcpdemux0 [>] GstRtpSsrcDemux rtpssrcdemux2 [>] KmsWebrtcTransportSinkNice kmswebrtctransportsinknice0 [>] GstNiceSink nicesink0 [>] GstDtlsSrtpEnc dtlssrtpenc0 [>] GstFunnel funnel [>] GstSrtpEnc srtp-encoder [>] GstDtlsEnc dtls-encoder [>] KmsWebrtcTransportSrcNice kmswebrtctransportsrcnice0 [>] GstDtlsSrtpDec dtlssrtpdec0 [>] GstSrtpDec srtp-decoder [>] GstDtlsDec dtls-decoder [>] GstDtlsSrtpDemux dtls-srtp-demux [>] GstNiceSrc nicesrc0 [>] GstRtpBin rtpbin0 [>] GstRtpJitterBuffer rtpjitterbuffer1 [>] GstRtpPtDemux rtpptdemux1 [>] GstRtpJitterBuffer rtpjitterbuffer0 [>] GstRtpPtDemux rtpptdemux0 [>] GstRtpSsrcDemux rtpssrcdemux1 [>] GstRtpSession rtpsession1 [>] GstRtpSsrcDemux rtpssrcdemux0 [>] GstRtpSession rtpsession0 [>] Legend Element-States: [~] void-pending, [0] null, [-] ready, [=] paused, [>] playing Pad-Activation: [-] none, [>] push, [<] pull Pad-Flags: [b]locked, [f]lushing, [b]locking; upper-case is set Pad-Task: [T] has started task, [t] has paused task proxypad40 [>][bfb] sink [>][bfb] sink_audio [>][bfb] proxypad42 [>][bfb] sink [>][bfb] sink_video [>][bfb] sink [>][bfb] src [>][bfb] send_rtp_sink_1 [>][bfb] proxypad33 [>][bfb] src [>][bfb] src [>][bfb] send_rtp_sink_0 [>][bfb] proxypad31 [>][bfb] send_rtp_src_0 [>][bfb] sink [>][bfb] rtp_src [>][bfb] rtcp_src [>][bfb] rtcp_sink [>][bfb] sink [>][bfb] src_1 [>][bfb] recv_rtp_sink_1 [>][bfb] rtcp_src_1 [>][bfb] recv_rtcp_sink_1 [>][bfb] src_421259003 [>][bfb] recv_rtp_sink_0 [>][bfb] rtcp_src_421259003 [>][bfb] recv_rtcp_sink_0 [>][bfb] proxypad44 [>][bfb] proxypad45 [>][bfb] proxypad46 [>][bfb] proxypad47 [>][bfb] sink [>][bfb] proxypad34 [>][bfb] rtp_sink_0 [>][bfb] rtp_sink_0 [>][bfb] src [>][bfb] proxypad36 [>][bfb] rtcp_sink_0 [>][bfb] rtcp_sink_0 [>][bfb] proxypad37 [>][bfb] rtp_sink_1 [>][bfb] rtp_sink_1 [>][bfb] proxypad39 [>][bfb] rtcp_sink_1 [>][bfb] rtcp_sink_1 [>][bfb] proxypad29 [>][bfb] funnelpad5 [>][bfb] src [>][bfb] funnelpad6 [>][bfb] funnelpad7 [>][bfb] funnelpad8 [>][bfb] funnelpad9 [>][bfb] rtp_src_0 [>][bfb] rtcp_src_0 [>][bfb] rtp_src_1 [>][bfb] rtcp_src_1 [>][bfb] src [>][bfb][T] proxypad28 [>][bfb] sink [>][bfb] sink [>][bfb] rtp_src [>][bfb] proxypad26 [>][bfb] proxypad27 [>][bfb] rtcp_src [>][bfb] rtp_sink [>][bfb] rtp_src [>][bfb] rtcp_sink [>][bfb] rtcp_src [>][bfb] sink [>][bfb] rtp_src [>][bfb] dtls_src [>][bfb] src [>][bfb][T] send_rtp_sink [>][bfb] send_rtp_sink [>][bfb] recv_rtp_sink [>][bfb] recv_rtcp_sink [>][bfb] recv_rtp_sink [>][bfb] recv_rtcp_sink [>][bfb] proxypad30 [>][bfb] proxypad32 [>][bfb] send_rtp_src_1 [>][bfb] proxypad35 [>][bfb] send_rtcp_src_0 [>][bfb] proxypad38 [>][bfb] send_rtcp_src_1 [>][bfb] sink [>][bfb] rtcp_sink [>][bfb] send_rtp_src [>][bfb] send_rtcp_src [>][bfb] recv_rtp_src [>][bfb] sync_src [>][bfb] sink [>][bfb] rtcp_sink [>][bfb] send_rtp_src [>][bfb] send_rtcp_src [>][bfb] recv_rtp_src [>][bfb] sync_src [>][bfb] proxypad14 [>][bfb] sink [>][bfb] sink_audio [>][bfb] audio_src_0 [>][bfb] proxypad15 [>][bfb] sink [>][bfb] sink_video [>][bfb] proxypad24 [>][bfb] proxypad25 [>][bfb] video_src_0 [>][bfb] sink [>][bfb] src [>][bfb] sink [>][bfb] proxypad23 [>][bfb] src_0 [>][bfb] sink [>][bfb] proxypad43 [>][bfb]sink [>][bfb] src [>][bfb][T] sink [>][bfb] src [>][bfb] sink [>][bfb] src_0 [>][bfb] sink [>][bfb] src_2 [>][bfb] sink [>][bfb] src_0 [>][bfb] src_1 [>][bfb] sink [>][bfb] src [>][bfb] sink [>][bfb] proxypad19 [>][bfb] src_0 [>][bfb] sink [>][bfb] proxypad41 [>][bfb]sink [>][bfb] src [>][bfb][T] sink [>][bfb] src [>][bfb] sink [>][bfb] src_0 [>][bfb] sink [>][bfb] src_2 [>][bfb] sink [>][bfb] src_0 [>][bfb] src_1 [>][bfb] sink [>][bfb] src [>][bfb] send_rtp_sink_1 [>][bfb] proxypad7 [>][bfb] src [>][bfb] src [>][bfb] send_rtp_sink_0 [>][bfb] proxypad5 [>][bfb] send_rtp_src_0 [>][bfb] sink [>][bfb] rtp_src [>][bfb] rtcp_src [>][bfb] rtcp_sink [>][bfb] sink [>][bfb] src_1442068093 [>][bfb] recv_rtp_sink_0 [>][bfb] rtcp_src_1442068093 [>][bfb] recv_rtcp_sink_0 [>][bfb] src_836061664 [>][bfb] recv_rtp_sink_1 [>][bfb] rtcp_src_836061664 [>][bfb] recv_rtcp_sink_1 [>][bfb] proxypad16 [>][bfb] proxypad17 [>][bfb] proxypad20 [>][bfb] proxypad21 [>][bfb] sink [>][bfb] proxypad8 [>][bfb] rtp_sink_0 [>][bfb] rtp_sink_0 [>][bfb] src [>][bfb] proxypad10 [>][bfb] rtcp_sink_0 [>][bfb] rtcp_sink_0 [>][bfb] proxypad11 [>][bfb] rtp_sink_1 [>][bfb] rtp_sink_1 [>][bfb] proxypad13 [>][bfb] rtcp_sink_1 [>][bfb] rtcp_sink_1 [>][bfb] proxypad3 [>][bfb] funnelpad0 [>][bfb] src [>][bfb] funnelpad1 [>][bfb] funnelpad2 [>][bfb] funnelpad3 [>][bfb] funnelpad4 [>][bfb] rtp_src_0 [>][bfb] rtcp_src_0 [>][bfb] rtp_src_1 [>][bfb] rtcp_src_1 [>][bfb] src [>][bfb][T] proxypad2 [>][bfb] sink [>][bfb] sink [>][bfb] rtp_src [>][bfb] proxypad0 [>][bfb] proxypad1 [>][bfb] rtcp_src [>][bfb] rtp_sink [>][bfb] rtp_src [>][bfb] rtcp_sink [>][bfb] rtcp_src [>][bfb] sink [>][bfb] rtp_src [>][bfb] dtls_src [>][bfb] src [>][bfb][T] send_rtp_sink [>][bfb] send_rtp_sink [>][bfb] recv_rtp_sink [>][bfb] recv_rtcp_sink [>][bfb] recv_rtp_sink [>][bfb] recv_rtcp_sink [>][bfb] proxypad4 [>][bfb] proxypad6 [>][bfb] send_rtp_src_1 [>][bfb] proxypad9 [>][bfb] send_rtcp_src_0 [>][bfb] proxypad12 [>][bfb] send_rtcp_src_1 [>][bfb] proxypad18 [>][bfb] recv_rtp_src_0_1442068093_111 [>][bfb] proxypad22 [>][bfb] recv_rtp_src_1_836061664_100 [>][bfb] sink [>][bfb] src [>][bfb][T] sink_rtcp [>][bfb] sink [>][bfb] src_100 [>][bfb] sink [>][bfb] src [>][bfb][T] sink_rtcp [>][bfb] sink [>][bfb] src_111 [>][bfb] sink [>][bfb] src_836061664 [>][bfb] rtcp_sink [>][bfb] rtcp_src_836061664 [>][bfb] send_rtp_src [>][bfb] send_rtcp_src [>][bfb] recv_rtp_src [>][bfb] sync_src [>][bfb] sink [>][bfb] src_1442068093 [>][bfb] rtcp_sink [>][bfb] rtcp_src_1442068093 [>][bfb] send_rtp_src [>][bfb] send_rtcp_src [>][bfb] recv_rtp_src [>][bfb] sync_src [>][bfb] Study case II The pipeline
  8. 8. Study case III 8 ● Analyzing the sender part of the pipeline ● We detected that:  funnel is quite inefficient  https://bugzilla.gnome.org/show_bug.cgi?id=749315  srtpenc does unnecesary work ● https://bugzilla.gnome.org/show_bug.cgi?id=752774
  9. 9. funnel: time-profiling (nanoseconds) 9 pad mean e_mean e_min (accumulative) 1 dtlssrtpenc1:src 163034.5 163034.478 49478 2 funnel:src 170207.5 7173 2029 3 srtp-encoder:rtp_src_1 317373.9 147166.435 57318 4 :proxypad40 716469.7 399095.739 105379 5 rtpbin1:send_rtp_src_1 781019 64371.783 1832 6 rtpsession3:send_rtp_src 784436 3417 859 7 :proxypad35 802532 18096 5632 8 rtprtxqueue3:src 806016.1 3484.174 1245 9 rtpvp8pay1:src 834627.3 28611.217 8957 10 :proxypad46 905171.5 69938.136 21206 11 kmswebrtcep0:video_src_1 912607 7435.455 2126 12 kmsagnosticbin2-1:src_0 918833.2 6226.227 2283 13 queue3:src 925268.2 6434.955 2486
  10. 10. funnel: callgrind profiling 10 ● IDEA: look for chain functions to see accumulative CPU usage of the downstream flow. ● CPU percentages (Downstream and ordered by Incl. in kcachegrind) 100 - gst_rtp_base_payload_chain 93.99 - gst_rtp_rtx_queue_chain + gst_rtp_rtx_queue_chain_list 90.90 - gst_rtp_session_chain_send_rtp_common 80.13 - gst_srtp_enc_chain + gst_srtp_enc_chain_list 53.35 - srtp_protect 19.51 - gst_funnel_sink_chain_object 9.82 - gst_pad_sticky_events_foreach 8.79 - gst_base_sink_chain_main
  11. 11. funnel: callgrind graph 11
  12. 12. funnel: solution 12 CPU impr.: ~100% Time before: 147166 ns Time after: 5829 ns ● Applying solution type 2.a): send sticky events only once ● Add a property to funnel element (“forward-sticky-events”)  If set to FALSE, do not forward sticky events on sink pad changes.  Results
  13. 13. srtpenc 13 ● Applying solution type 1) ● srtpenc: remove unnecessary rtp/rtcp checks  https://bugzilla.gnome.org/show_bug.cgi?id=752774  CPU improvement: 2.89 / (100 – 58.90) = 7%
  14. 14. Other examples 14 ● g_socket_receive_message: the most CPU usage of is wasted in the error management  https://bugzilla.gnome.org/show_bug.cgi?id=752769
  15. 15. latency-profiling 15 ● Mark Buffers with timestamp using GstMeta ● Add a considerable overhead  Sampling (do not profile every buffer)  GstMeta pool? ● DEMO (WebRtcEp + FaceOverlay)  Real time profiling  WebRTC, decoding, video processing, encoding...
  16. 16. General remarks (BufferLists) 16 Use BufferLists always you can  Pushing buffers through pads is not free  Really important in large pipelines  Pushing BufLists through pads spend the same CPU than pushing only one buffer  Pushing BufLists through some elements spend the same CPU than pushing only one buffer. Eg: tee, queue  Kurento has funded and participated in the BufList support of a lot of elements  Open discussion: queue: Add property to allow pushing all queued buffers together ● https://bugzilla.gnome.org/show_bug.cgi?id=746524
  17. 17. General remarks (BufferPool) 17 Extending the usage of BufferPool  Significant CPU % is spent allocating / freeing buffers  Nowadays, memory is much cheaper than CPU  Let's take advantage of this  Example  Buffers of different size, but always < than 1500Bytes are allocated  Configure a BufferPool to generate Buffers of 1500Bytes and reuse them in a BaseSrc, Queue, RtpPayloader, etc.
  18. 18. General remarks (Threading) 18 GStreamer could be improved a lot in threading aspects  Each GstTask has its own thread  It is idle the most time  A lot of threads → Too many context-switches → wasting CPU  Kurento team proposes using thread pools and avoid blocking threads  Kurento has funded the development of the first implementation of TaskPool ( thanks Sebastian ;) ) ● http://cgit.freedesktop.org/~slomo/gstreamer/log/?h=task-pool ● It is not finished, let's try to push it forward  Ambitious architecture change ● Sync vs Async ● Move to a reactive architecture
  19. 19. Conclusion/Future work 19 ● Take into account performance ● Performance could be as important as a feature works properly ● Time processing restriction ● Embedded devices  Automatic profiling  Reduce manual work  Continuous integration: pass criteria to accept a commit  Warnings
  20. 20. Thank you Miguel París mparisdiaz@gmail.com http://www.kurento.org http://www.github.com/kurento info@kurento.org Twitter: @kurentoms http://www.nubomedia.eu http://www.fi-ware.org http://ec.europa.eu