Load testing streaming video at scale - Part 2: Optimizing remote object storage-based VOD

July 2, 2021

In this second blog of our multi-part series on load testing video streaming at scale, we evaluate different Video On Demand (VOD) setups that rely on remote object-based cloud storage services like S3. As setups like these have become the norm for streaming services that work with (very) large content libraries, the goal of this blog is to determine which configuration performs best.

In the first part this series we introduced the testbed used for our evaluation. This testbed is based on MPEG’s Network Based Media Processing (NBMP) specification, which aims to standardize typical cloud setups where different parts of the workflow are executed at different parts in the network, otherwise known as Distributed Media Processing Workflows (DMPWs).

As explained in first part of this series, the Media Processing Function (MPF) sits at the center of DMPW. It is an origin function that can be responsible for performing in real-time tasks like media packaging, manifest generation, segment (re-)encryption, et cetera.

For the purpose of the evaluation in this blog the MPF is performed by Unified Origin, Unified Streaming’s dynamic packager that is capable of all of the aforementioned tasks and runs as plugin for the Apache web server.

Through the various tests presented in this blog we will determine:

How to optimize Unified Origin’s configuration when using a remote storage backend
The benefits of adding a caching layer that cached the media source’s metadata in between Origin and the remote storage backend
How much an optimal remote storage-based VOD configuration benefits from a bigger instance type

Configurations tested

To determine which Unified Origin setup with remote storage works best, we compared the following:

Configuration (A): Origin uses its builtin cURL library to make HTTP requests for media content to remote object storage.
Configuration (B): Origin uses configuration (A) and an additional caching layer to cache dref MP4s and the server manifests (.ism).
Configuration ©: Origin uses Apache’s native subrequests and a Proxy to retrieve media content.
Configuration (D): Origin uses configuration © and an additional caching layer to cache dref MP4s and the server manifests (.ism).
Configuration (E): Uses configuration (D), but with CacheIgnoreQueryString disabled on the caching layer and ProxyRemoteMatch enabled on Origin to only direct relevant requests (i.e., for dref MP4s and server manifests) to the cache and others to the remote object storage directly.

Each of the previous five configurations was deployed in a c5a.large AWS EC2 cloud instance with the Apache MPM module set to the ‘worker’ model, with the following configuration:

# worker MPM
<IfModule mpm_worker_module>

    ServerLimit               20
    StartServers               2
    MinSpareThreads           50
    MaxSpareThreads          150
    ThreadsPerChild           50
    MaxRequestWorkers       1000
    MaxConnectionsPerChild    0

</IfModule>

In addition to the MPM worker configuration, we increased the range of IPv4 ports that a networking connection can use on our Linux host by running the following:

#!/bin/bash

# The configuration file is located in '/proc/sys/net/ipv4/ip_local_port_range'
sudo sysctl -w net.ipv4.ip_local_port_range="2000 65535" # from 32768 60999

Testbed setup

Media source

For this VOD use case with remote object storage, we packaged the media source Sita Sings the Blues from progressive MP4 to CMAF using two seconds segment for audio and video. We also created an ‘index’ dref MP4 file from each CMAF media file, with a server manifest that references the dref MP4 files (instead of the CMAF source directly). Lastly, we have uploaded the dref MP4s, server manifest, and CMAF files to an AWS S3 bucket located in the Frankfurt region.

If you want to know more details about the encoding specifications of the media source test content, please refer to part one of this series. Information on how to create dref MP4s is available in our documentation, and more background about them we presented at Demuxed 2020.

Workload Generator

In the tests we’re presenting here, the Media Sink (or client in this case) of the testbed generates a load with a gradual increase of emulated workers from zero up to 50. Each worker emulates MPEG-DASH media requests as follows:

Request MPD and set media segment index ‘k’ = 0
Request audio segment ‘k’ with a bitrate of 132kbps
Request video segment ‘k’ with a bitrate of 3894kbps
If ‘k’ >= total number of available media segments ‘N’ go to step 1, else ‘k’ + 1 and go to step 2

Test results for different configurations

Figures 1 and 2 illustrate the comparison of the five Unified Origin configurations tested. Figure 1 shows the average response times and figure 2 shows the request rate during the workload.

It’s important to point out that these results consider ‘long tail’ content requests (content that has not been cached by the CDN). In a production setup popular (‘hot’) content will be served straight from the CDN and therefore won’t generate load on the Media Processing Function (e.g., Unified Origin).

Figures 1 and 2. Media Processing Function configuration

To put this into perspective with an example: based on the performance metrics of one of our customers we know that with an average outgoing throughput of 2.6Gbps from Unified Origin, ~3.3 million viewers can be served on average depending on the CDN provider.

Table 1 summarizes the incoming (IN) throughput from the S3 bucket and the outgoing (OUT) throughput towards the workload generator. Also, a conversion of efficiency factor has been added, which shows how much data needs to be ingested to generate a certain amount of output (i.e., it showcases efficiency).

Configuration	(A)	(B)	©	(D)	(E)
IN throughput (Gbps)	3.065	1.256	4.288	1.201	1.327
OUT throughput (Gbps)	0.902	1.225	1.256	1.174	1.297
Conversion efficiency (OUT/IN)	0.294	0.975	0.293	0.977	0.977

Table 1: Average throughput IN, OUT, and the conversion of efficiency factor.

The results indicate that within the context of our test setup, configurations (D) and (E) achieve the lowest response time, highest media requests rate, and highest conversion efficiency factor.
However, with an efficiency equal to configuration (D), configuration (E) provided
10% more outgoing throughput, which makes it the better choice.

The best performing configuration

The crucial difference between configuration (D) and (E) is the use of Apache’s ProxyRemoteMatch directive, to make sure only relevant requests pass through the caching layer (i.e., those for server manifests and dref MP4s). The rest of the requests (i.e., for media data) skip the caching layer and go to the remote object storage directly. This is visualized in see figure 3:

Figure 3. Provide an overview of configuration (E) with a caching layer. — Figure 3. Provide an overview of configuration (E) with a backend caching layer using ProxyRemoteMatch.

The benefit of this configuration is that the backend traffic between the remote storage and Unified Origin is reduced by up to 71%. In addition, the use of Apache’s native subrequests allows for advanced load-balancing mechanisms across multiple remote storage endpoints, an approach that we provide more details in our documentation.

For reference, the Apache virtual host configurations for Origin and the caching layer of configuration (E) are included in full below (but be sure to keep scrolling, because there is more to this blog post):

# Origin Apache VirtualHost
<VirtualHost *:80>
    ServerAdmin webmaster@localhost
    ServerName origin
    DocumentRoot /var/www/origin
    
    Options -Indexes

    HostnameLookups Off
    UseCanonicalName On
    ServerSignature On
    LimitRequestBody 0
    
    Header set Access-Control-Allow-Headers "origin, range"
    Header set Access-Control-Allow-Methods "GET, HEAD, OPTIONS"
    Header set Access-Control-Allow-Origin "*"

    # Use REGEX ProxyRemoteMatch to request and cache server manifest (.ism)
    # and dref mp4 files. 
    # The cache layer is located in port 81.
    ProxyRemoteMatch "\.(?i:ism|mp4)$" "http://localhost:81/"
    
    SSLProxyEngine on
    
    <Location />
     UspHandleIsm on
     UspHandleF4f on
     UspEnableSubreq on # Enable Apache Subrequests
     IsmProxyPass http://${YOUR_OBJECT_STORAGE_URI}
    </Location>

    <Proxy "http://${YOUR_OBJECT_STORAGE_URI}/">
     ProxySet connectiontimeout=5 enablereuse=on keepalive=on retry=0 timeout=30 ttl=300
    </Proxy>

    ErrorLog /var/log/apache2/origin-error.log
    CustomLog /var/log/apache2/origin-access.log combined
    LogLevel warn

</VirtualHost>

# Apache cache VirtualHost
<VirtualHost  *:81>
  LogLevel warn

  ProxyPreserveHost On
  ProxyPass / http://${YOUR_OBJECT_STORAGE_URI}/ connectiontimeout=5 enablereuse=on keepalive=on retry=0 timeout=30 ttl=300
  ProxyPassReverse / ${YOUR_OBJECT_STORAGE_URI}/

  <Proxy "http://${YOUR_OBJECT_STORAGE_URI}/">
    ProxySet connectiontimeout=5 enablereuse=on keepalive=on retry=0 timeout=30 ttl=300
  </Proxy>

  CacheRoot /var/cache/apache2
  CacheEnable disk /
  CacheDirLevels 5
  CacheDirLength 3
  CacheDefaultExpire 7200
  CacheIgnoreNoLastMod On
  CacheIgnoreCacheControl On
  CacheIgnoreQueryString Off 
  # The max size of your index files
  CacheMaxFileSize 1000000000

  # Run cache as a normal handler
  CacheQuickHandler off

  <Location />
    CacheEnable disk /
  </Location>
  
  # Unset range to enable cache
  <LocationMatch ".*\.(?i:ism|mp4)$">
    RequestHeader unset Range
  </LocationMatch>

  ErrorLog /var/log/apache2/cache-error.log
  LogFormat "%{%FT%T}t.%{usec_frac}t%{%z}t %v:%p %h:%{remote}p \"%r\"  %>s %I %O %{us}T: %{cache-status}e" cachelog
  CustomLog /var/log/apache2/cache-access.log cachelog

</VirtualHost>

Bigger is better?

Now that we determined which configuration performs best for our use case, can we push performance even further by using a much bigger instance type? Or does a relatively small instance like the c5a.large we tested with deliver significantly better bang for buck?

The c5n family uses an Intel Xeon Platinum processor and provides a network bandwidth up to 100Gbps for the biggest size model. In contrast, the c5a compute optimized family that we used for our previous tests comes with the 2^nd generation of AMD EPYC 7002 series processor and up to 20Gbps of network bandwidth for the biggest size model.

Instance type	# vCPU	# Phys. Cores	Memory (GiB)	Baseline Bandwidth (Gbps)	$/hr
c5a.large	2	1	4	0.75	0.087
c5n.large	2	1	5.25	3.0	0.123
c5a.xlarge	4	2	8	1.25	0.174
c5n.xlarge	4	2	10.5	5.0	0.246
c5a.2xlarge	8	4	16	2.5	0.348
c5n.2xlarge	8	4	21	10	0.492

Table 2. AWS EC2 instance tested. Details provided by AWS EC2 for Frankfurt region datacenter checked on 10^th of May 2021.

To answer our ‘bigger is better?’ question, figure 3 provides a comparison of the average outgoing (OUT) throughput of configuration (E) deployed on the different tested cloud instances from table 2. Also, to push the throughput of the instances further we tested two additional variants of the Media Source, using a segment length of four and eight seconds, to see if that would differentiate the results.

Figure 3. Outgoing throughput by tested cloud instance and different media segment length. — Figure 3. Outgoing throughput by tested cloud instance and different media segment lengths.

First of all, the results show that within our test conditions the AMD-based c5a instances obtain a higher throughput than the Intel-based c5n types. Also, as expected the larger segment sizes (in this case because of their length, but higher bit rates will have a comparable impact) affect the outgoing throughput generated by Unified Origin, but not in a way that further differentiates the results between the instance types.

Lastly, what is striking about these results is the linear scaling, so let’s take a closer look at that.

Linear scaling

Figure 4, presents the average outgoing throughput of the cloud instance c5a.large (smallest size instance), and as can be seen Origin’s outgoing throughput is directly correlated by the video segment bitrate and the duration of the segment.

Figure 4. Average outgoing throughput based on video bitrate and segment length.

Overall, Origin’s outgoing throughput can be linearly modeled regardless of the tested instance type, based on:

Outgoing throughput [Mbps]
R: request rate [Requests/second]
S: Segment duration [seconds]
B: Average media bitrate [Mbps]

Where outgoing throughput ~ R ×S × B.

Therefore, scaling horizontally instead of vertically will likely work best for most use cases, especially when cost is taken into consideration.

Conclusion

In this second part of our series on load testing streaming video at a scale we tested various optimizations for a remote storage based VOD setup. Based on our tests we can conclude the following:

When using remote object storage to store VOD content for streaming with Origin, caching the client manifest and metadata results in optimal performance (and a significant reduction of backend traffic to the remote storage)
Enabling Apache subrequests further improves performance and provides higher opportunities to implement backend load-balancing mechanisms such as described in our Object Storage High Availability (load-balancing) documentation
Selecting cloud instances that fit your media workload and the type of scaling (vertical or horizontal) will reduce your overall costs

In part 3 of this series we will present how the testbed can be used to deploy different Media Processing Functions in different parts of the network, with a focus on achieving the best results when streaming linear Live channels.

12 min read

Load testing streaming video at scale — Part 2: Optimizing remote object storage-based VOD

Configurations tested

Testbed setup

Media source

Workload Generator

Test results for different configurations

The best performing configuration

Bigger is better?

Linear scaling

Conclusion

12 min read

Load testing streaming video at scale — Part 2: Optimizing remote object storage-based VOD

Configurations tested

Testbed setup

Media source

Workload Generator

Test results for different configurations

The best performing configuration

Bigger is better?

Linear scaling

Conclusion

Share