Understanding Media in SIP Session Description Protocol (SDP)
There are many SIP networks where calls either fail or there are issues with a conference bridge where clients offering a diversity of codecs cause failed negotiations or one-way audio. In addition, the increased use of VPN especially from current traditional SD-WAN implementations cause incorrect mapping of IP addresses and increase latencies. Video and voice codec negotiations fail due to no match or common ground between codecs offered by each endpoint or there is a problem with implementation between the two vendors. This post aims to give a high-level understanding of how Session Description Protocol, a protocol used within a SIP message, represents various media capabilities and VoIP phones networks negotiate to ensure smooth and successful multimedia communications and a better user experience.
The codec carries and encodes the analog audio containing our speech or video and compresses it into the bandwidth available on the IP network. A measure of a codec’s sophistication is its ability to carry high definition Voice or high fidelity VoIP audio in a small amount of bandwidth required from the network to transport the voice. For example, Opus and SILK codecs provide hi-fi quality speech ie up to the limit of the human ear (20kHz, or a bit lower than that for all the folks like me) using low Digital bandwidth down to as low as 30kbps
Both endpoints i.e. IP phones or mobile’s involved in the phone conversation must agree which codec is going to be used for a particular call in order to ensure interoperability and correct decoding of the audio sent. This process is called Codec negotiation and occurs while the SIP signalling is setting up the call. The specific media descriptions are specified and offered in the Codec list within the Session Description Protocol part of the SIP INVITE message . It is then counter offered by the terminating endpoint during this negotiation.
If the negotiation fails, the switch or SBC will notice the mismatch between the parties and the fact that there is no common codec to be used, offered or agreed upon, and generate a SIP response message 488 “not acceptable media” or “not acceptable here”. In this case the call fails and setup never completes.
So that we can troubleshoot such codec negotiations, let’s first take a quick peek at the Session Description Protocol
RFC 4566 (obsoletes RFC 2327) defines the details of SDP in complete detail intended for describing multimedia sessions for purposes of session announcement, session invitation and other forms of multimedia session initiation such as conference calls.
The SDP session description consists of several lines of text in the form <type> = <value>. The session description starts with “v=” line (version number, always zero) and each media level section starts with an “m=” line. We will focus on the media section and media attributes in more detail:
m=<media> <port> <proto> <fmt> ...
Where <proto> is the transport protocol dependent on the connection type, and <fmt> is the media format description.
For <proto> subfields, the common types for RTP media are RTP/AVP or RTP/SAVP. RTP/AVP is the Audio/Video profile carried over UDP, whereas RTP/SAVP indicates Secure RTP (encrypted audio) running over UDP.
The C Field <c=> indicates the IP address where the media RTP should be sent to by the other end.
The <fmt> field indicates the payload type, the details of common payload types is shown in table 1.1 below. For non RTP media, the proto field could be set as UDP as indicated in the example below e.g. transport of whiteboard (wb) media type over UDP.
m=application 32416 udp wb
The <m> field in the screenshot above, is used to indicate that audio is the payload and also the UDP port number used by the RTP.
a=rtpmap:<payload type> <encoding name>/<clock rate> [/<encoding parameters>]
This attribute provides more details such as encoding, clock rate and encoding parameters of the payload type used in the “m=” line. Up to one rtpmap attribute can be defined for each media format specified. Thus, we might well have the following:
a= fmtp:<format><format specific parameters>
Codec specific parameters can be included through the fmtp attribute.
m=audio 49230 RTP/AVP 18 96 97 98 a= rtpmap:18 G729/8000/1 a=rtpmap:96 L8/8000 a=rtpmap:97 L16/8000 a=rtpmap:98 L16/11025/2 a=fmtp:18 annexb=yes
During the early stages of RTP development, statically assigned payload types (0-34) were used to bind encodings to payload types. Since the payload type number space is Limited and relatively small, it cannot accommodate static assignments for all existing and future encodings. So payload type numbers in the range 96-127 are used exclusively for dynamic assignment of a codec to be used for that particular call. Payload types 35-95 are unassigned and reserved for future use.
So the Emmaline gives a summary of all codecs to be used. The <a=> lines go into detail and provide specific codec definitions dynamically allocating a number for the payload type from 96-127 is used for that call.
For static payload types, the a=rtpmap attribute, may be omitted in case the payload details to be used are completely as per the RTP audio/video static profile for that payload type.
For example: 16-bit linear encoded stereo audio sampled at 16 kHz using dynamic RTP/AVP payload type 98 for this stream:
m=audio 49232 RTP/AVP 98 a=rtpmap:98 L16/16000/2
RFC 3551 specifies an initial set of “payload types”. The table below lists some of the common payload types defined in RFC 3551 and extends that list for easy reference:
|Payload ID||Encoding Name||Audio/Video||Clock Rate (Hz)||Channels||Reference||Description|
|0||PCMU||A||8000||1||RFC 3551||ITU-T G.711 PCM μ-Law audio 64 kbit/s|
|3||GSM||A||8000||1||RFC 3551||European GSM Full Rate audio 13 kbit/s (GSM 06.10)|
|4||G723||A||8000||1||RFC 3551||ITU-T G.723.1 audio|
|8||PCMA||A||8000||1||RFC 3551||ITU-T G.711 PCM A-Law audio 64 kbit/s|
|9||G722||A||8000||1||RFC 3551||ITU-T G.722 audio 64 kbit/s|
|10||L16||A||44100||2||RFC 3551||Linear PCM 16-bit Stereo audio 1411.2 kbit/s, uncompressed|
|11||L16||A||8000||1||RFC 3551||Linear PCM 16-bit Mono audio 705.6 kbit/s, uncompressed|
|12||QCELP||A||8000||1||RFC 3551||Qualcomm Code Excited Linear Prediction|
|15||G728||A||8000||1||RFC 3551||ITU-T G.728 audio 16 kbit/s|
|18||G729||A||8000||1||RFC 3551||ITU-T G.729 and G.729a audio 8 kbit/s; Annex B is implied unless the annexb=no parameter is used|
|26||JPEG||V||90000||N/A||RFC 2435||JPEG video|
|31||H261||V||90000||N/A||RFC 4587||ITU-T H.261 video|
|32||MPV||V||90000||N/A||RFC 2250||MPEG-1 & MPEG-2 video|
|33||MP2T||AV||90000||N/A||RFC 2250||MPEG-2 transport stream|
|34||H263||V||90000||N/A||RFC 3551/2190||ITU-T G.711 PCM μ-Law audio 64 kbit/s|
|96-127||dynamic||RFC 3551||Payloads defined dynamically during a session|