Examination of video processing applications and their memory bandwidth requirement challenges, discussion of technical details of the multiport front end memory controller and how users can use it improve the efficiency of their external memory system.
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Remove the External Memory Bottleneck in Your Video Design
1. Online Demo: Accelerate Your Video Format Conversion Using the 1080p Video Framework 2009 Video Framework Online Series Part 1: Upgrade Your Broadcast System to PCIe Gen2 Part 2: Remove the External Memory Bottleneck in Your Video Design
2.
3.
4. Current Broadcast Video Solution (UDX3.0) Memory Subsystem Two-Channel Format Conversion, OSD, Nios II Processor UDX3.0 memory access efficiency is > 42.35/51.2 = 82.7% DDR3 SDRAM (51.2 Gbps) Mem ctrl Master1 Master2 Master3 Master 18 … .... 400 MHz * 64 bits * 2 UDX3.0 memory bandwidth: 42.35 Gbps 256-bit wide System Memory Breakdown MA deinterlacer1: 10.83 Gbps MA deinterlacer2: 10.83 Gbps Frame buffer1: 7.64 Gbps Frame buffer1: 7.64 Gbps OSD: 4.9 Gbps Nios II ® processor: 0.5 Gbps Total: 42.35 Gbps
5.
6. Video Framework With Multi-Port Front End DVI/HDMI SDI Format conversion BT656 Deinterlacer (VIP) Deinterlacer (VIP) Scaler (VIP) Scaler (VIP) Input CODEC SDI BT656 Output DVI/HDMI Audio Sample rate converter CODEC Audio Video Delay frame sync Ethernet PCIe TS multiplexer Video over IP PCIe Deinterlacer (VIP) Scaler (VIP) Altera IP Custom IP DDR2/3 Memory control DMA control DMA control Display port Deinterlacer (VIP) Scaler (VIP) Display port Third-party IP PCIe PCIe MPEG2 H.264 JPEG2000 Custom video functions
7. Video Framework: SOPC/Avalon-ST Video Avalon ® -ST Video SOPC-ready function SOPC Builder ready SDI/HDMI/DP Clocked video input AV-ST-V vip1 AV-ST-V vip2 AV-ST-V Clocked video output SDI/HDMI/DP
8.
9. Devices’ Memory Bandwidth for Video Processing *Maximum memory bandwidth per side using DDR3 interface on the largest FPGA in the family ** Altera FPGAs also support RLDRAM and QDR II Device DDR3 DDR2 DDR Max. memory bandwidth/ side * Arria ® II GX FPGA 600 Mbps 300 MHz 600 Mbps 300 MHz 400 Mbps 200 MHz 57 Gbps Stratix ® IV GX FPGA 1,067 Mbps 533 MHz 800 Mbps 400 MHz 400 Mbps 200 MHz 153 Gbps HardCopy ® IV GX ASIC 1,067 Mbps 533 MHz 667 Mbps 333 MHz 400 Mbps 200 MHz 153 Gbps
10. Example: Format Conversion Using Video Framework and FPGAs SD/HD/3G-SDI CLIP Motion adaptive Polyphase scaler 6x6 taps, 4:4:4 mode Deinterlacer 4:2:2 Frame rate conversion SCL Frame buffer DDR2 HP memory Avalon-ST video Clocked video Nios II processor Run-time configuration through Nios II processor CRS 4:2:2 to 4:4:4 CRS 4:4:4 to 4:2:2 Interlacer 4:2:2 Res.: 480i to 1080p SDI SD/HD/3G-SDI SD/HD/3G-SDI CLIP Motion adaptive Polyphase scaler 6x6 taps, 4:4:4 mode Deinterlacer 4:2:2 Frame rate conversion SCL Frame buffer CRS 4:2:2 to 4:4:4 CRS 4:4:4 to 4:2:2 Interlacer 4:2:2 SDI SD/HD/3G-SDI CVI CVO CVI CVO
26. Thank You For more information, visit www.altera.com
Notes de l'éditeur
Hello and welcome to webcast on how to Remove the External Memory Bottleneck in Your video design, brought to you by Altera Corporation. My name is Girish Malipeddi, technical marketing manager at Altera focused on broadcast and video applications. Today I will be discussing the second part of the video framework series - Remove the External Memory Bottleneck in Your video design. Please refer to the earlier webcast, Upgrade Your Broadcast System to PCIe Gen2, on Altera.com for the first part of this series.
First, we will look at the Video processing applications and their memory bandwidth requirement challenges. Next, we will discuss altera’s video framework that simplifies hardware design for video systems. Then, Pete Brookes, Senior Engineering Manager of the Multimedia System Solutions Group will discuss technical details of the multiport front end memory controller and how the users can use it improve the efficiency of their external memory system. Finally, we will look at next steps you can take to find out more and evaluate this solution.
Altera’s Video framework targets variety of broadcast application including Video servers, switchers, Video Input and Output cards, Video effects cards, Multiviewers and Broadcast Monitoring applications. All these applications have common requirements – Format conversion handling data upto full HD data rates. The multichannel applications such as switchers, multiviewers and video effects cards also need to implement multiple channels of video processing handling full HD and channel sync. All these video functions are memory hungry as they need to cache multiple frames of video for manipulation at the same time the data per frame significantly increases as you go from SD to HD to full HD as you can see in the table here hence the need for an efficient memory arbiter and controller for getting the most of the external memory available in the system. Having an efficient memory controller with high performance interfaces can reduce the external memory requirements costs and hence the system costs.
Here we show memory bandwidth requirement for 2-channel video format conversion. The estimates on this slide are based on Altera 2-ch format conversion reference design which includes OSD and an on chip processor and does basic frame rate conversion. One of the versions of the design has 18 masters accessing the external memory and uses Altera’s hi-performance DDR3 memory controller. The design is implemented on a Stratix IV GX FPGA development kit. The memory interface is running at 400Mhz, DDR3, 64-bit wide for maximum bandwidth of 51.2Gbps. The estimated memory bw requirements for the design as you see in the table here is about 42.35 Gbps. So in effect, the memory interface has to be at least 82.7% efficient to ensure successful implementation of the video system. The newer version of this design requires even higher memory bw efficiency in excess of 90%. This is a significant challenge which led us to build an memory arbiter in which the user can set priorities and weights on the masters in order to avoid starving critical masters. This memory controller and arbiter are targeted broadcast system but can be leveraged at other applications.
Here’s an overview altera’s multiport front end memory arbiter solution which addresses the aforementioned challenges of building a video system. The multi-port front end for the memory controller today supports 16 masters but can easily be exapanded to more, and users can assign masters as time crictical or non-critical masters. It also allows round-robin weighted arbitration in which masters are allowed to share available bw at user set ratios Altera reference design UDX3.0 shows that you can achieve > 90% efficiency and support various memory types such as DDR2, DDR3 and RLDRAM Finally the design is available as reference design upon request as part of the UDX3.0 design
This diagram shows a typical broadcast system and solutions that exist from Altera and its 3 rd party network. SDI is the primary I/O standard for broadcast; but other standards co-exist as well. For example, many multiviewers have DVI or HDMI outputs today and may add Display Port in the future. Today Altera hardware supports SDI, DVI and Display port interfaces and has IP to enable the support for these interaces More and more, studios want all their equipment to support any format; Thus, format conversion is moving from an optional feature to a standard requirement on all I/O. To implement Format conversion, altera offers a comprehensive set of images processing functions such as scaler, deinterlacer, color space conversion and others. More details on Altera;s Video and image processing suite of fucntions can be found on Altera.com Video Codecs are implemented in FPGAs in video server, contribution encoders, distribution encoders, and IRDs. Altera has a network of companies offering solutions in that space. A number of solutions for Audio exist, including audio de-embed, sample rate converter. Altera has been a leader in enabling video over IP in studio and headend applications, first introducing its first video over IP reference design in 2006. As we talked in earlier slides, all video systems require high performance multi-master memory controller. This webcast primarily focuses on the high performance Multi port front end memory controller.
Altera has various video and image processing functions as seen in here in the green box. They include scaler, deinteracer and other functions required to implement format conversion. Altera also introduced an open standard called Avalon Streaming Video. It defines a connectivity and lightweight protocol to enable interoperability for functions developed by different design teams. Avalon streaming video is the basis of a video framework that speeds video design. Altera also offers two key IO blocks namely CVI and CVO as seen at the bottom of the green box. These functions make it easy to get any external video interface such as SDI, HDMI or display port into the avalon streaming video format. There are also a number of 3 rd party video processing cores that support Avalon Video Streaming. Altera has a tool namely SOPC builder that recognizes this standard and enables integration of this blocks through a graphical tool. This tool creates interconnect that is correct by construction. This simplifies FPGA design for video systems.
Using off the shelf ASSPs to implement format conversion or multi-channel video processing has many limitations. They typically don’t scale well, more channels leads to more components on board and very soon the board can get ugly. Also systems designers are stuck with very limited memory type such DDR1 or DDR2 support and not necessarily the latest or the most cost effective memory solutions. Lastly, as the memory cannot be shared between channels, it can result in very inefficient and costly memory sub system. On the other hand using video frameowrk and FPGAs, you can easily integrate a lot of external components into the fpga – such as the SDI serializer and deserializer. Mutilple channels of format convaerion can be implemented in one FPGA and a signle external memroy sub system that can be shared across multiple channels.
There are several different choices of FPGAs for video processing applications. As an example, Altera’s 40-nm based Stratix IV GX and Arria II GX devices support various memory interface standards such as DDR3, DDR2, DDR, RLDRAM, QDR II. You can see that performace of the external interface are the best in the industry. For example DDR3 533 MHz external memory performance is the FPGA industry’s highest performance. The startix IV GX family can effectively support external memory bw in excess of 150 Gbps per side Altera’s 40-nm Arria II GX family delivers cost and power-optimized silicon formulti-channel video processing, featuring up to 16 3.75 Gbps transceivers and 300 MHz external memory performance. The arris II GX device can effectively sypport external mmeory bw in excess of 50 Gbps. The high performance external memory interfaces makes it easy to implement memory hungry video processing application such as format conversion.
Format conversion is a design that Altera developed which leverages this video framework plus internally developed video processing blocks to implement up/down/cross conversion. This design features two channels with support from 480i up to 1080p with full motion adaptive deinterlacing. A risc processor embedded in the FPGA is called Nios and it allows run-time configuration of all of the functions. The design integartes SDI functionality eliminating the need for SDI receivers and transmitters. This designs also offers efficient memory arbiter which enables memory sharing and scalability to multiple channels eliminating the need for separate memories for separate channels. For example using the MPFE and the video framework, you can implement this design in an Arria II GX device making it very cost effective versus other solutions. Also, using the video framework you can easily scale this solution to multiple channels in larger FPGA without increasing the board complexity linearly. Next Pete Brookes will provide more details on the Altera’s multi-port front end memory controller.
Thank you Girish. Hello, my name is Peter Brookes, and I’m going to describe how to design video processing systems which efficiently share access to external memory.
As Girish mentioned, a typical multi-channel SD/HD or 3G video format conversion system will contain multiple bus masters sharing access to a single external memory, such as DDR2 or DDR3. This is often because of the temporal nature of the video processing functions or to synchronize multiple video channels. By using system level design tools and off the shelf parameterizable IP you can easily and efficiently utilise available external memory bandwidth and focus an increased amount of engineering development time on *your* differentiating technology. I will be describing two approaches to designing a video memory system, showing how each solution works, their best use cases and their limitations. The first approach uses the system level tool, SOPC Builder, to automatically generate an application specific bus interconnect. The second approach shows how the SOPC Builder bus interconnect can easily be replaced by a Multi-Port Front End component to improve performance and support more complex memory access patterns.
Let’s start with the first approach where the interconnect is generated by SOPC Builder. To demonstrate the solution we’ve developed a reference design, called V2, which performs high-quality up, down, and cross conversion of two SD, HD and 3G video streams in interlaced or progressive format . The V2 design uses the Altera Video and Image Processing Suite of IP Functions as well as triple rate SDI and the High Performance DDR2 Memory Controller. The design provides an easy starting p oint for further customisation. For example, you could replace one of Altera’s video processing functions with your own differentiating IP and re-use the memory infrastructure and video interfaces. In V2 there are 14 bus masters which read and write video data from and to a single DDR2 64 bit wide external memory. The external memory is clocked at 266 Mhz providing a theoretical bandwidth of 34 GigaBits per second. In this design, deinterlacing two 1080i60 video sources requires a bandwidth of 26 GBit/s, which equates to a memory access efficiency requirement of 77.4%. Each master also has a memory access latency requirement which must be satisfied to ensure video is always streamed at the required rate.
In V2, the MPFE is constructed from the switch fabric generated by SOPC Builder that connects the video store bus masters to the memory controller and bridge components in the system. The generated interconnect includes the address decode logic and the logic to arbitrate between multiple masters requesting access to a single external memory component. The slave side arbitration scheme is round robin, so when multiple masters contend for access to the memory slave, the arbiter grants shares in round robin order. The pipeline bridge components are inserted into the design for two reasons. The bridges insert registers in the path between its master and slave so can help reduce register to register delay and improve system FMax. Secondly, the pipeline bridges are also used to control the topology of the generated interconnect and the arbitration. Specifically this provides the control to group masters that access a particular memory bank. This allows the arbitration algorithm to visit each pipeline bridge in turn, servicing a different bank each time. The DDR2 memory controller provides a simplified interface to industry standard DDR2 memory and instantiates the external memory PHY interface. The goal of the MPFE is really to reduce the impact of memory data management commands, such as activate and pre-charge cycles. We can achieve this in a number of ways.
In this design we co nfigure the bus masters associated with the frame buffer and deinterlacer to perform large burst transfers of length 64 with the initial transfer aligned to the start of a memory row. B y making the burst size a factor of the row size you ensure that a burst doesn’t cross a row boundary, which minimizes the penalty for switching rows. This requires large on-chip buffers (inside the deinterlacer and frame buffer) to create and receive the burst. B y making the on-chip buffers twice the burst size the video function can process a burst whilst transferring another to/from memory. This allows the video processing path to cope with the longer latency caused by waiting for access to the memory controller. Also using a separate clock domain for the memory system, with sufficient on-chip buffering, allows the memory to be run at a higher rate without being limited by the speed of the datapath; this allows more masters to be handled efficiently.
We can further minimize the penalty for switching rows in a bank, by overlapping the bank management commands of one bank with the data transfer to/from another bank. We achieve this by setting the base address, at compile time, of each frame buffer and deinterlacer to a different memory bank, and by grouping the masters that are on the same bank together using a pipeline bridge. The round robin arbitration scheme of SOPC Builder will then switch between the pipeline bridges enabling bank interleaving. The Altera DDR2 High Performance memory controller user guide describes which bits, set in a local address, map to a particular memory bank. Efficient DDR High Performance Memory Controllers improve this further by providing predictive bank management, which allows the bank management commands to be issued earlier.
This solution works well for video processing systems when the memory bandwidth requirements for each master or group of masters are equal. This is because of the round robin arbitration logic in the system interconnect. However, when the bandwidth requirements of the masters or groups of masters are not balanced, or when frame store functions share memory banks, it becomes difficult to satisfy both the bandwidth and latency requirements of all the masters in the system. The solution is also well suited to systems where the read and write data transfers can be performed in large bursts such as streaming video data. This system does not provide efficient memory access for bus masters with random memory access patterns such as a processor performing a cache line fill or a scatter gather DMA controller copying graphic sprites around in memory. Typically, for video systems with 14 or more masters, the auto-generated switch fabric can achieve an FMax of up to 160MHz on a Stratix III or Stratix IV device. For the V2 reference design, reducing the external memory clock frequency shows the method can achieve up to 88.5% memory access efficiency.
A second system design method to efficiently share memory bandwidth is to use a dedicated Multi-Port Front End IP component. The MPFE is optimized for high data rate video processing applications, as in the previous approach, but also includes support for small, random address accesses. The component has been designed to specifically target video processing applications and has been proven using a representative reference design, UDX3. The MPFE includes a multi-class weighted arbitration scheme that allows you to control the traffic flow to and from the external memory interface. This enables efficient video systems to be designed when bus masters have *different* bandwidth requirements. This also simplifies the design methodology by often removing the need for pipeline bridge components to control the topology of the system. Also , by defining which ports are critical, you can ensure that the time-critical masters in your system, such as video functions, have priority over other less time sensitive blocks. Another important consideration is the maximum clock rate that can be achieved by the system interconnect, because the system critical path can often be in the switch fabric. The MPFE component supports clock rates over 200MHz on a Stratix IV GX C3 speed grade device, which allows you to connect it directly to, say, a DDR3 memory controller in half rate mode with a memory clock frequency of more than 400MHz.
The MPFE component replaces the auto-generated SOPC Builder interconnect logic, including the round robin arbitration scheme. In SOPC Builder the component is selectable from the Component Library, and can be parameterised to suit the requirements of the application. For example, the component can be configured with the number and width of the slave ports, and the maximum supported burst size. Because the component is available in the SOPC Builder environment, it allows easy connection to the Altera Video and Image Processing Suite, DMA controllers and the Altera High Performance Memory Controller. However, if you want to use the MPFE component outside of the SOPC Builder environment, you can. You can parameterise, instantiate, and connect the component directly in your HDL.
The MPFE can be configured with up to 16 Avalon Memory Mapped read or write ports. Each slave port can then be connected to the system bus masters. The slave port data width is configurable at compile time, upto 512 bits. When developing complex video systems it is very important that the user can monitor the system performance; this is particularly true when considering multi-master access to external memory. The MPFE provides visibility of the behaviour of the arbitration at run time by exposing useful count and wait data. For example, the number of times a slave has been granted, the number of words of read data a slave has received, or the worst case number of cycles a slave has had to wait between requesting and being granted. This data can allow the user to tune the performance of their system. The MPFE, which connects directly to the Altera High Performance Memory Controller, is available as clear text RTL as part of the UDX3 reference design.
Each of the slave ports can be configured to be either time critical or non critical. By defining which ports are critical, you can ensure that the time-critical masters in your system, such as video blocks, have priority over other less time sensitive blocks. It also allows these non-critical masters to use any available bandwidth in times when the critical masters are not requesting access. The video processing function masters will typically be defined as critical bus masters. The processor instruction and data masters as well as the scatter gather DMA masters for On Screen Display are considered non-critical.
Whenever there are no time-critical requests outstanding, the arbiter will accept the next pending transaction from the non-critical ring. Once that has been serviced, the arbiter will check to see if there are any new requests on the time-critical ring and will return to service that request. If more than one request is present on the time-critical ring, the arbiter will continue to service them. Once all the pending time-critical requests have been serviced, the arbiter will switch back to servicing the non-critical ring. The MPFE uses an enhancement of a weighted round-robin scheme to share the available bandwidth between the slave ports at the ratios set by the user. The Bandwidth Settings tab, in the parameterization GUI, allows you to control the ratio of bandwidth that each port is given. By setting weights for each port, you can restrict how often a port is allowed access to the external memory interface. If you assign larger numbers to a port, it will be allowed a larger proportion of the external memory bandwidth, while smaller numbers will allow a port less bandwidth. The arbiter will cycle around the slave ports, granting each the ability to issue a read or write burst to the memory controller. The arbiter will continue to go around the ring, servicing slave ports that have not exceeded their bandwidth allowance. The MPFE uses a sliding window bandwidth allocation system to distribute the accesses more evenly to reduce the worst case latency. This in turn means less buffering in the user’s design.
The UDX3 reference design is implemented on a Stratix IV GX230 device, and demonstrates both high-quality up, down, cross conversion and on screen display. There are a total of 19 bus masters sharing one 64-bit DDR3 Memory. The Altera High Performance DDR3 memory controller is running in half rate mode with a memory clock frequency of 400 MHz. As you can see in the diagram, the bus masters connected to the MPFE perform a combination of video frame buffering, processor instruction and data bus accesses and DMA reads and writes for on-screen display. The performance capability of the MPFE is demonstrated by increasing the rate of the OSD layer written to external memory. To write the OSD layer to external memory at 60 frames per second, a memory access efficiency of 92% is required. Memory access efficiency is again defined as the ratio between the required bandwidth and the theoretical maximum memory bandwidth. The MPFE satisfies this requirement and ensures that all the critical masters satisfy their read and write latency requirements.
So, in summary, a multi-port front end to the Altera High Performance Memory Controller has been designed targeting video processing applications. This solution enables memory access efficiency of over 90% even when combining buffering video data in large bursts and performing random memory accesses. Reference designs demonstrate how to use the solutions today and provide a great starting point for further system development. The V2 design shows how to use SOPC Builder to generate an efficient interconnect for high data rate video streaming applications. The MPFE component, including parameterization GUI and HDL source code, is available as part of the UDX3 reference design. The UDX3 design shows how to use this component in a complex, multi-channel video system. I’ll now hand you back to Girish for the remainder of the presentation….
You can go to Altera’s website to get access to the design for evaluation. You can also download user guides and purchase audio video development kits at the links provided here. As mentioned during the webcast, this design runs on Stratix IV GX development kit and can be purchased at the link provided. There are also similar designs targeting Arria II GX audio video kit.