A REALTIME SOFTWARE SOLUTION FOR RESYNCHRONIZING FILTERED MPEG2
TRANSPORT STREAM
Bin Yu, Klara Nahrstedt
Department of Computer Science
University of Illinois at Urbana-Champaign
DCL, 1304 W. Springfield, Urbana IL 61801
binyu,
ABSTRACT
With the increasing demand and popularity of multimedia
streaming applications over the current Internet, manipulating MPEG streams in a real-time software manner is gaining more and more importance. In this work, we studied the synchronization problem that arises when a gateway
changes the data content carried in an MPEG2 Transport
stream. In short, the distance between original time stamps
will be changed non-uniformly when the video frames are
resized, and decoders will fail to reconstruct the encoding
clock from the resulting stream. We propose a cheap software real-time approach to solve this problem, which basically reuses the original time stamp packets and adapts their
distance to accommodate the changes in bit rate. Experimental results from a real-time HDTV stream filter shows
that our approach is correct and efficient.
1. INTRODUCTION
Video streaming is gaining more and more attention from
both the academy and industry world, and primarily 3 things
are behind this popularity: a widely-accepted video compression standard – MPEG2 [4], a widely available Internet with high bandwidth becoming commonplace, and an
ever-growing user demands for the more easily understood
visual presentation of information. Beyond simply sending
the video content, people are working on adapting the content at intermediate gateways before it reaches the client, either to tackle heterogeneity in resource availability or to increase client customization and interaction. Example prototype systems include ProxyNet [1], IBM Transcoding proxy
[8], UC-Berkeley TranSend [5] and Content Service Network [10]. There could be many kinds of of video editing
services, such as watermarking, frequency-domain low-pass
filtering, frame/color dropping, external content embedding
[11] and so on.
As we focus on the case of HDTV streaming, the problem of streaming vs. decoding becomes obvious. On the
one hand, the Internet is bringing to end hosts video streams
above 10Mbits thanks to the technologies like IP multicast
on the mBone, Fast Ethernet [9] and Gigabit Ethernet [3]
in office buildings and xDSL [2] and Cable modem [6] at
home. On the other hand, PCs and even gateway servers
are still not able to decode or do non-trivial video manipulation on the high volume HD streams in real-time for lack
of enough computing power and real-time support. For example, using ordinary 30frame per second HDTV stream at
18Mbits, a 100M local area network could afford 4 or 5 high
definition video conference sessions in the office building,
but even the most advanced desktop computer could only
decode and render two frames per second. Also, the PC
monitor could never match the great experience rendered
by TV screens and big video walls, and in many situations
the high definition video needs to be shown on large screens
for a large audience.
In such a situation, we propose to combine the software
video delivery channel with the hardware decoding/rendering
interfaces by using desktop PCs to receive/process the HD
video streams and then feed the resulting streams into a
hardware decoding board. For example, in [11], we presented how we implemented software real-time Picture-inPicture for HDTV streams in this way. However, one key
problem we are facing is that hardware decoding boards rely
on the time-stamps contained in MPEG2 Transport Layer
Streams to maintain their hardware clock, while almost all
software editing operations would compromise these timestamps. This problem has to be solved before any similar
software video manipulations could be applied to HD video
streams, and in this paper we will present our solutions to
it. Our solutions are cheap in the sense that it is simple
and easy to implement, and no hardware real-time support
is necessary. This way, it could be adopted by desktop PCs
or intermediate gateways servers with minimum extra cost.
This paper is organized as follows: In section 2, we will
briefly introduce how the synchronization between MPEG
encoder and decoder works according to the MPEG2 standard and the problem of re-synchronization that arises after
Figure 1: Synchronization between the encoder and decoder
video editing operations. Our solution is then discussed in
detail in section 3 and experiment results follow in section
4. Finally we discuss some related work and conclude this
paper in section 5 and 6.
2. THE SYNCHRONIZATION PROBLEM
In this section, we will first briefly review how the timestamp encoded in MPEG2 Transport stream is used by the
decoder to reconstruct the encoder’s clock, and then we introduce what kind of video editing system we are focusing
on and how it affects the synchronization between the encoder and the decoder.
✝✁✟
✂✁☎☛
s and the PTS
s. After that, as all of these packetized
elementary stream packets are further multiplexed together,
the final stream is time-stamped with Program Clock Reference(PCR), which is given by periodically sampling the
encoder clock. This resulting transport layer stream is then
sent over the network to the receiver, or stored in storage
devices for the decoder to read in the future. As long as the
delay the whole stream experiences remains constant from
the receiver’s point of view, the receiver should be able to
reconstruct the sender’s clock that has been used when the
stream was encoded. The accuracy and stability of this recovered clock is very important, since the decoder will try
to match the PTS and DTS against this clock to guide its
decoding and displaying activities.
2.1. The MPEG2 Transport Layer Stream Timestamps
Figure 1 shows how MPEG2 Transport streams manage to
maintain synchronization between the sender, which encodes
the stream, and the receiver, which decodes it. As the elementary streams carrying video and audio content are packetized, their target Decoding Time Stamp (DTS) and Presentation Time Stamp (PTS) are determined based on the
current sender clock and inserted into the packet headers.
For video streams, the access unit is a frame, and both DTS
and PTS are given only for the first bit of each frame after its picture header. These time stamps are later used by
the decoder to control the timing at which it starts to do
✂✁☎✄✝✆
decoding and presentation. For example, if at time
s
an encoded frame comes to the multiplexing stage at the
sending side and the encoder believes (based on calculation
using predefined parameters) that the decoder should begin
✄✞✁✟
to decode this frame
s after it receives it and output the
✄✞✁✡✠
decoded frame
s thereafter. Assuming the decoder could
reconstruct the encoder clock and the time it receives this
✝✁☎✄✝✆
, then the DTS should be set to
frame would also be
Figure 2: MPEG2 Transport Stream Syntax
Knowing the general idea in timing, we now introduce
how the Transport Layer syntax works, as shown in Figure
2. All sub-streams (video, audio, data and time stamps) are
segmented into small packets of constant size (188 bytes),
and the Packet ID(PID) field in the 4-byte header of each
packet tells which sub-stream that packet belongs to. The
PCR packets are placed at constant intervals, and they form
a running time line along which all other packets are positioned at the target time point. On this time line, each 188byte packet occupies one time slot, and the exact time stamp
of each packet/slot could be interpolated using neighboring
PCR packets. Data packets arrive and are read into the de-
Figure 3: Layered coding scheme of MPEG-2 Transport Stream
coder buffer at constant rate, and this rate can be calculated
by dividing the number of bits between any two consecutive PCR packets by the time difference between their time
stamps. In other words, if the number of packets between
any two PCR packets remains constant, then the difference
between their time stamps should also be constant. In an
ideal state, packets are read into the decoder at the constant
bit rate, and whenever a new PCR packet arrives, its time
stamp should match exactly with the receiver clock, which
confirms the decoder that so far it has successfully reconstructed the same clock as the encoder. However, since PCR
packets may have experienced jitter in network transmission
or storage device accessing before they arrive at the receiver,
we can not simply set the receiver’s local clock to be the
same as the time stamp carried by the next incoming PCR no
matter when it comes. To smooth out the jitter and maintain
a stable clock with a limited buffer size at the receiver, generally the receiver will resort to some smoothing technique
like the Phase-Locked-Loop(PLL) [7] to generate a stable
clock from the jittered PCR packets. PLL is a feedback
loop that uses an external signal(the incoming PCR packets in our case) to tune a local signal source(generated by
a local oscillator in our case) to generate a relatively more
stable result signal(the receiver’s reconstructed local clock
in our case). So long as the timing relation between PCR
packets is correct, the jitter can be smoothed out with PLL.
2.2. HDTV Stream Editing/Streaming Test Bed
In the following discussion, we will base our discussion on a
video editing/streaming test bed as shown in Figure 3. Live
High definition digital TV stream from the satellite or the
HD storage device is feed into the server PC, which then
encodes it into MPEG2 Transport stream and multicasts this
stream over the high speed Local Area Network. Players on
the client PC’s join this multicast group to receive the HD
stream, and then feed this stream into the decoding board.
The decoded analogue signal is then sent to the wide-screen
TV for display. Our filter receives this stream in the same
way as a normal player, and performs various kinds of video
editing operations on this stream in real time, such as low
pass filtering, frame/color dropping and visual information
embedding [11]. There could be multiple editing operations
done to the same stream in a chain, and the resulting streams
at all stages are available to clients through other multicast
groups.
2.3. How Video Editing Affects Clock Reconstruction
Since the timing and spacing of PCR packets are very important for clock reconstruction, it is obvious that video
editing operations will cause malfunctions since it changes
both.
First, all intermediate operations, a video stream goes
through before it reaches the decoder, contribute to the delay and jitter of the PCR sub-stream. Different filtering operations, such as low pass filtering and Picture-In-Picture,
or even the same operation, take varying processing time to
do the necessary calculation for different frames or different parts of the same frame. In compensation, traditional
solutions would either try to adjust the resulting stream at
each intermediate point or push all the trouble to the final
client. The former solution would suffer from the fact that
processing time for different operations/frames tends to be
quite different and varying, which makes it very hard to find
a local optimal answer. The latter solution implies that the
client needs to have a very large buffer and a long waiting
time because of the unpredictable delay and jitter of the incoming stream. We will see later how our solutions solve
this problem by utilizing the inherent PCR time stamps of
the streams.
The second problem of changing spacing between PCR
packets is even more intractable. As we said above, each access unit (video frame or audio packet) should be positioned
within the time line formed by the PCR sub-stream. If a
video frame arrives at the receiver at its destined time point,
the decoder would be able to correctly schedule where and
how long to buffer it before decoding it. However, normally
after the filtering operations, a video frame becomes smaller
or larger. It takes less or more packets to carry, and so its
following frames are dragged earlier or pushed later along
the time line. In such circumstances, if we keep both the
time stamp and the spacing of the PCR packets unchanged,
then the receiver’s clock can still be correctly reconstructed,
but the arriving time of each frame will be skewed along
the time line. For example, if the stream is low pass filtered, then every frame becomes shorter, and so following
frames are dragged forward to pack up the vacancy spared
out. If the decoder still reads in data at the original speed, it
feels that more and more future frames begin to come earlier
and earlier. Since they are all buffered until their stamped
time for decoding, the buffer will be overflowed in the long
run no matter how large it is. The fundamental problem is
that after the filtering, the actual bit rate becomes lower or
higher, but the data is still read in by the decoder at the original rate since the timing and spacing of PCR packets are
not changed. So if the new rate is lower, more and more
future frames are read in by the decoder, causing the receiving buffer for the network connection to be emptied while
the decoder’s decoding buffer is overflowed; on the other
hand, if the new rate is higher, then at some point in the future, the data will remain in the receiving buffer and not be
read in by the decoder even at its decoding time.
3. OUR SOLUTIONS
To solve the problems described above, an immediate thought
would be to do the same kind of clock reconstruction as
the decoder does at the filter, and then re-generate the PCR
packets to reflect the changes at the filter output. However,
we know that the smoothing mechanisms like PLL are implemented in hardware circuits containing a voltage controlled oscillator that generates high frequency signals to be
tuned with the incoming PCR time stamps. This is not easy,
if not impossible, to be done in software on computers without real-time support in hardware. Therefore, a pure software mechanism that does not require hardware real-time
support would enable us to distribute the video editing service across the network to any point on the streaming path.
Another goal is to achieve a cheap and efficient solution that
could be easily implemented and carried out by any computer with modest CPU and memory resource available.
The key idea behind our solution comes from the observation that the DTS and PTS are only associated with
the beginning bit of each frame. Consequently, so long as
we manage to fix that point to the correct position on the
time line, the decoder should be working fine even if the remaining bits of that frame following the starting point are
stretched shorter or longer.
3.1. Simple Solution: Padding
Following the discussion above, we have designed a simple
solution that works for bit rate reducing video editing operations. We do not change the timestamp and the position
of any PCR packet along the time line within the stream,
and we also preserve the position of the frame header and
so that of the beginning bit of every frame. What is changed
here is the size of each frame in terms of number of bits,
and we just pack the filtered bits of a frame closely following the picture header. Since each frame takes less 188-byte
packets to carry, yet the frame headers are still positioned at
their original time points, we can imagine that there would
be some “white space” left between the last bit of one frame
and the first bit of the header of the next frame. Actually
the capacity of this space is the same as the reduction in the
number of bits used to encode this frame as a result of the
video editing operations, and we can simply pad this space
with empty packets (NULL packets).
This solution is very simple to understand and implement, and it preserves the timing synchronization, since we
only need to pack the filtered bits of each frame continuously after the picture header and then insert NULL packets
until the header of the next frame. No new time stamps need
to be generated in real-time, and the bit rate remains stably
at the original rate. However, it inevitably has some drawbacks. First, it can only handle bit rate reduction operations.
We only try to fix the header of each frame to its original
position on the time line, which means the changed frames
should not occupy more bits larger than the distance between the current frame header and the next. This property
does not always hold, since some filtering operations like
information embedding and watermarking may increase the
frame size in bits. Secondly, the saved bits are padded with
NULL packets to maintain the original constant bit rate and
the starting point of each frame, and this ironically runs
counter to our initial goal of bit rate reduction for some operations like low pass filtering and color/frame dropping.
The resulting stream contains the same number of packets
as the original one. The only difference is that the number of bits representing each frame has been shrunk, yet this
saving is spent immediately by padding NULL packets at
the end of each frame.
Here we want to mention that there does exist another
approach to bypass the second problem. Up to now we have
been using a filter model that is transparent to the client
player, which confines us strictly to the MPEG2 standard
syntax. However, if some of the filtering intelligence is exported to the end hosts, then some saving can be expected.
For example, instead of inserting NULL packets, we may
compress them by insert only a special packet saying the
next packets should be NULL packets. At the end host, a
stub proxy could be watching this incoming stream, and on
seeing this packet, it replaces this packet with the supposed
amount of padding packets before sending the stream to the
client player. Note that this padding is important to maintain
correct timing, especially if the client is using some standard hardware decoding board. This way, the bandwidth
is indeed saved, but at the price of relying on non-standard
protocol outside MPEG2. Of course, this will in turn introduce problems associated with non-standardized solutions,
such as difficulty in software maintenance and upgrading.
Therefore, we only consider this as a secondary choice, and
not as a major solution.
Figure 4: Example: 2/3 shrinking
the number of packets between any PCR pair to another
constant value. This way, we can scale the PCR packets’
distance and achieve a fixed new bit rate, as if the time line
is scaled looser or tighter to carry more or less packets, yet
we do not need to re-generate new PCR time stamps which
relies on hardware real-time support. All non-video stream
packets can be simply mapped to the new position on the
scaled output time line that corresponds to the same time
point as on the original input time line. In case that no exact
mapping is available because the packets are aligned at unit
of 188-byte, we could simply use the nearest time point on
the new time line without introducing any serious problem.
For video stream, the same kind of picture header fixing and
frame data packing are conducted as in the first solution but
in a scaled way.
An example of shrinking the stream to its 2/3 bandwidth is given in Figure 4. All non-video packets and video
packets that carry picture headers are mapped to their corresponding position on the new time line, and so their distance
is also shrunk to 2/3 of the original. After video editing
operations, the resulting video packets are packed closely
and as early as possible within the new stream following the
frame header. Intuitively, the filtered video data is squeezed
into the remaining space between all non-video packets and
picture header packets. For example, if in the input stream,
packet 6 is a frame header, and packet 9 is an audio packet,
and packet 7, 8, 10 through 24 are video data from the same
frame of packet 6. After 2/3 shrinking, packet 6 is positioned in slot 4, and packet 9 goes to slot 6. The other
video data packets are processed by a video editing filter,
and the resulting bits are packed again into packets of 188byte each. Therefore, all empty slots, such as slot 5 and 7
through 16, are used to carry the resulting bits. If the filter
shrinks the video frame to occupy less than 2/3 of its original number of bits, then the new slots should be enough to
carry the resulting frame.
This algorithm is also very simple to implement. For
each non-video packet, its distance (in number of packets)
from the last PCR packet is multiplied by a scaling factor
✆
, and the result is used to set the distance between this
packet and the last PCR packet in the output stream. For
video frames, the header containing DTS and PTS is scaled
and positioned in the same way, and the remaining bits are
closely appended to the header in the result stream. Note
✆
that when is set to always be 1, this reduces to the simple
solution above.
✆
3.2. Enhanced Solution: Time-Invariant Bitrate Scaling
To ultimately solve the synchronization problem, a more
general algorithm has been designed. The key insight behind it is that we can change the bit rate to another constant
value while preserving the PCR time stamps by changing
Now the only problem is how to determine for a specific streaming path. If we shrink the time line too much
and for some frames the bit rate reducing operation does
not have a significant effect, then again we will not have
enough space to squeeze in this frame, which will push the
beginning bit of the next frame behind schedule. On the
other hand, if we shrink the time line too little or expand (
Figure 5: Result of time line scaling
✁
✆
✠
) it too much, then more space will be padded using
NULL packets to preserve important time points, leading to
a waste of bandwidth. There exists one optimal scale fac✆
tor
that can balance these two strengths if it fulfills the
condition that
✄✂✆☎✞✝
✟
✟
the filtered frame data will always be squeezed into
the scaled stream;
the number of NULL packets for padding purpose is
minimum.
However, this optimal scale factor is hard to estimate in advance since for different operations with various parameters
have quite varying effect on distinct video clips in terms of
bit rate changing. Therefore, in our current implementation,
we simply use a slightly exaggerated scale factor based on
the operation type and parameters. For example, for low
pass filtering with a threshold of 5, a scaling factor of 0.80
will work almost for all streams. Even if we meet a frame
that still occupies more than 0.9 number of packets after
the filtering, only the next few frames may be slightly affected. Since a smaller-than-average frame is expected to
follow shortly, this local skew can be absorbed by the decoder easily and does not have any chain effect.
Our next step will be looking into how to “learn” this
optimal scale factor by analyzing history bit rate change of
✆
a stream and adjust this factor on the fly. It is not specified by the MPEG standard how a decoder, especially hardware decoding board, should react if the incoming stream
changes from one constant bit rate to another, and it is also
an open question how quickly it would adapt to the new rate.
4. EXPERIMENTAL RESULT
Figure 5 shows the effect of the time line scaling approach
for a Low Passing Filter with threshold 5. Each point on the
x axis represents an occurrence of a PCR packet, and the y
axis shows in three colors how many video packets, NULL
packets or packets for other data stream are in between each
two PCR packets. We can see that the distribution of the
three areas is kept almost constant for the original frame
except for more NULL packets at the end of a frame. However, without scaling, the number of video packets varies
across different PCR intervals and a lot of extra space is
padded with NULL packets as shown in the upper right subfigure. On the other hand, if we do scaling with a scaling
factor of 80%, then the padding occurs mostly only at the
end of frames and the stream contains mostly only useful
Original BR (Mbps)
Average resulting BR (Mbps)
Average relative change
✆
Suggested
LP (10)
18.0
15.45
0.86
0.90
LP (5)
18.0
13.71
0.76
0.80
PIP
18.0
19.04
1.06
1.10
Table 1: Final Statistics
data.
One thing we need to point out here is that the skew of
video access units along the time line still exists with this
scaling approach. What happens is that after the filtering
operation, each frame shrinks to a size mostly less than 80%
of its original size. If we mask out all other packets, we can
see that in the video stream, frames are packed closely one
after another. If one frame takes more space than its share,
then the next frame may be pushed behind its time point, but
this skew will be compensated later by another frame with
a larger shrink effect. As we said before, this kind of small
jitter around the exact time point on the scaled time line
is acceptable, and it is the change in the bit rate at which
the decoder reads in the data that fundamentally makes our
scaling algorithm able to solve the problem.
Another experiment is done on how to determine the
✆
scaling factor for a particular kind of video editing operation. Three kinds of operations are tested: Low Pass Filtering with threshold being 10 and 5 and Picture-in-Picture.
The original stream is a HD stream “stars1.mpg” with bit
rate 18 Mbps in MPEG2 Transport Layer format. The embedded frame used for Picture-In-Picture is a shrunk version of another HD stream “football1.mpg”. Since the content of this stream is more condense than the background
stream (i.e., more DCT coefficients are used to describe
each block), it is expected that the bit rate will increase
after this Picture-In-Picture operation. The final statistics
are shown in Table 1. Experiment results have shown that
with the suggested scaling factor based on real world statistics, our time-invariant scaling algorithm could successfully
solve the synchronization problem.
5. CONCLUSION
In this paper, we focus on the scenario of streaming HD
video in MPEG2 Transport Layer syntax with software streaming/processing and hardware decoding, which would be commonplace until the processing power of personal computers
becomes strong enough to cope with high bandwidth/definition
video streams. We have studied an important problem that
decoders may lose synchronization and fail to reconstruct
the encoder’s clock because of video editing operations on
the streaming path. We have proposed two solutions to
this problem, both based on the idea of reusing the original
time stamp packets (PCR packets) and adjusting the num-
ber of packets between them to reflect the changes in bit
rate caused by video editing operations. Experimental results have shown that our solutions are efficient and work
fine without any requirement on real-time support from the
system.
As far as we know, our work is among the first efforts
in promoting real-time software filtering of High Definition
MPEG2 streams, and can be beneficial to many real-time
applications that work with MPEG2 system streams like
HDTV broadcast.
6. ACKNOWLEDGMENT
This work was supported by the NASA grant under contract
number NASA NAG 2-1406, National Science Foundation
under contract number NSF CCR-9988199 and NSF CCR
0086094, NSF EIA 99-72884 EQ, and NSF EIA 98-70736.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and
do not necessarily reflect the views of the National Science
Foundation or NASA.
7. REFERENCES
[1] ProxiNet. .
[2] Emerging high-speed xDSL access services: architectures, issues, insights, and implications . IEEE Communications Magazine , Volume: 37 Issue: 11 , Nov.
1999, Page(s): 106 -114, 1999.
[3] Gigabit Ethernet. Circuits and Systems, 2001. Tutorial
Guide: ISCAS 2001. The IEEE International Symposium on , 2001, Page(s): 9.4.1 -9.4.16, 2001.
[4] I. I. S. 13818. Generic coding of moving pictures and
associated audio information. 1994.
[5] Y. C. A. Fox, S.D. Gribble and E. Brewer. Adapting
to network and client variation using active proxies:
lessons and perspectives. IEEE Personal Communication, Vol. 5, No. 4, pp. 10C19, 1998.
[6] A. Dutta-Roy. An overview of cable modem technology and market perspectives . IEEE Communications
Magazine , Volume: 39 Issue: 6 , June 2001, Page(s):
81 -88, 2001.
[7] C. E. Holborow. Simulation of Phase-Locked Loop for
processing jittered PCRs. ISO/IEC JTC1/SC29/WG11,
MPEG94/071, 1994.
[8] R. M. J. Smith and C. Li. Scalable multimedia delivery for pervasive computing. ACM Multimedia 1999,
1999.
[9] J. Spragins. Fast Ethernet: Dawn of a New Network
[New Books and Multimedia] . IEEE Network , Volume: 10 Issue: 2 , March-April 1996, Page(s): 4,
1996.
[10] B. S. W. Y. Ma and J. Brassil. Content Services Network: The Architecture and Protocols. Proceedings of
the Sixth International Workshop on Web Caching and
Content Distribution, 2001.
[11] B. Yu and K. Nahrstedt. A Compressed-Domain Visual Information Embedding Algorithm for MPEG2
HDTV Streams. ICME 2002, 2002.