RFC 3557 - RTP Payload Format for European Telecommunications Standards Institute (ETSI) European Standard ES 201 108 Distributed Speech Recognition Encoding 日本語訳

原文URL : https://datatracker.ietf.org/doc/html/rfc3557
タイトル : RFC 3557 - 欧州通信標準研究所（ETSI）欧州標準ES 201 108分散音声認識エンコーディングのRTPペイロード形式
翻訳編集 : 自動生成

[要約] RFC 3557は、ETSI標準に基づく分散型音声認識（DSR）データを、RTP（Real-time Transport Protocol）上で転送するためのペイロード形式を定義しています。クライアント側で抽出した音声特徴量をサーバーへ効率的に伝送するためのカプセル化手順を規定しています。モバイル端末等における低負荷で高精度な音声認識サービスの実現を支援するための標準的な技術基盤を提供することを目的としています。

Network Working Group                                        Q. Xie, Ed.
Request for Comments: 3557                                Motorola, Inc.
Category: Standards Track                                      July 2003

RTP Payload Format for European Telecommunications Standards Institute (ETSI) European Standard ES 201 108 Distributed Speech Recognition Encoding

欧州通信標準研究所（ETSI）欧州標準ES 201 108分散音声認識エンコーディングのRTPペイロード形式

Status of this Memo

本文書の状態

This document specifies an Internet standards track protocol for the Internet community, and requests discussion and suggestions for improvements. Please refer to the current edition of the "Internet Official Protocol Standards" (STD 1) for the standardization state and status of this protocol. Distribution of this memo is unlimited.

このドキュメントは、インターネットコミュニティのインターネット標準トラックプロトコルを指定し、改善のための議論と提案を要求します。このプロトコルの標準化状態とステータスについては、「インターネット公式プロトコル標準」（STD 1）の現在のエディションを参照してください。このメモの配布は無制限です。

著作権表示

Abstract

概要

This document specifies an RTP payload format for encapsulating European Telecommunications Standards Institute (ETSI) European Standard (ES) 201 108 front-end signal processing feature streams for distributed speech recognition (DSR) systems.

このドキュメントは、欧州通信標準研究所（ETSI）欧州標準（ESI）201 108フロントエンド信号処理機能分散型音声認識（DSR）システムのストリームをカプセル化するためのRTPペイロード形式を指定します。

Table of Contents

   1.  Conventions and Acronyms . . . . . . . . . . . . . . . . . . .  2
   2.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  2
       2.1.  ETSI ES 201 108 DSR Front-end Codec. . . . . . . . . . .  3
       2.2.  Typical Scenarios for Using DSR Payload Format . . . . .  4
   3.  ES 201 108 DSR RTP Payload Format. . . . . . . . . . . . . . .  5
       3.1.  Consideration on Number of FPs in Each RTP Packet. . . .  6
       3.2.  Support for Discontinuous Transmission . . . . . . . . .  6
   4.  Frame Pair Formats . . . . . . . . . . . . . . . . . . . . . .  7
       4.1.  Format of Speech and Non-speech FPs. . . . . . . . . . .  7
       4.2.  Format of Null FP. . . . . . . . . . . . . . . . . . . .  8
       4.3.  RTP header usage . . . . . . . . . . . . . . . . . . . .  8
   5.  IANA Considerations. . . . . . . . . . . . . . . . . . . . . .  9
       5.1.  Mapping MIME Parameters into SDP . . . . . . . . . . . . 10
   6.  Security Considerations. . . . . . . . . . . . . . . . . . . . 11
   7.  Contributors . . . . . . . . . . . . . . . . . . . . . . . . . 11
   8.  Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . 11
   9.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 11
       9.1.  Normative References . . . . . . . . . . . . . . . . . . 11
       9.2.  Informative References . . . . . . . . . . . . . . . . . 12
   10. IPR Notices. . . . . . . . . . . . . . . . . . . . . . . . . . 12
   11. Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . 13
   12. Editor's Address . . . . . . . . . . . . . . . . . . . . . . . 14
   13. Full Copyright Statement . . . . . . . . . . . . . . . . . . . 15

1. Conventions and Acronyms

1. コンベンションと頭字語

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119].

このドキュメントのキーワード "MUST"、"MUST NOT"、"REQUIRED"、"SHALL"、"SHALL NOT"、"SHOULD"、"SHOULD NOT"、"RECOMMENDED"、"MAY"、および "OPTIONAL" は、[RFC2119] で説明されているように解釈されるものとします。

The following acronyms are used in this document:

このドキュメントでは、次の頭字語が使用されています。

DSR - Distributed Speech Recognition

DSR-分散音声認識

ETSI - the European Telecommunications Standards Institute

ETSI-欧州通信標準研究所

FP - Frame Pair

FP-フレームペア

DTX - Discontinuous Transmission

DTX-不連続伝送

2. Introduction

2. はじめに

Motivated by technology advances in the field of speech recognition, voice interfaces to services (such as airline information systems, unified messaging) are becoming more prevalent. In parallel, the popularity of mobile devices has also increased dramatically.

音声認識の分野でのテクノロジーの進歩に動機付けられ、サービスへの音声インターフェイス（航空会社の情報システム、統一されたメッセージングなど）がより一般的になりつつあります。並行して、モバイルデバイスの人気も劇的に増加しています。

However, the voice codecs typically employed in mobile devices were designed to optimize audible voice quality and not speech recognition accuracy, and using these codecs with speech recognizers can result in poor recognition performance. For systems that can be accessed from heterogeneous networks using multiple speech codecs, recognition system designers are further challenged to accommodate the characteristics of these differences in a robust manner. Channel errors and lost data packets in these networks result in further degradation of the speech signal.

ただし、通常、モバイルデバイスで使用される音声コーデックは、音声認識の精度ではなく可聴音声品質を最適化するように設計されており、音声認識でこれらのコーデックを使用すると、認識パフォーマンスが低下する可能性があります。複数の音声コーデックを使用して不均一なネットワークからアクセスできるシステムの場合、認識システム設計者は、これらの違いの特性に堅牢な方法で対応するようにさらに挑戦します。これらのネットワーク内のチャネルエラーと失われたデータパケットにより、音声信号がさらに分解されます。

In traditional systems as described above, the entire speech recognizer lies on the server. It is forced to use incoming speech in whatever condition it arrives after the network decodes the vocoded speech. To address this problem, we use a distributed speech recognition (DSR) architecture. In such a system, the remote device acts as a thin client, also known as the front-end, in communication with a speech recognition server, also called a speech engine. The remote device processes the speech, compresses the data, and adds error protection to the bitstream in a manner optimal for speech recognition. The speech engine then uses this representation directly, minimizing the signal processing necessary and benefiting from enhanced error concealment.

上記の従来のシステムでは、音声認識者全体がサーバー上にあります。ネットワークがボコードされた音声を解読した後に到着する条件で着信音声を使用することを余儀なくされます。この問題に対処するために、分散音声認識（DSR）アーキテクチャを使用します。このようなシステムでは、リモートデバイスは、音声認識サーバーと通信されて、音声エンジンとも呼ばれる、フロントエンドとしても知られる薄いクライアントとして機能します。リモートデバイスは、音声を処理し、データを圧縮し、音声認識に最適な方法でビットストリームにエラー保護を追加します。スピーチエンジンはこの表現を直接使用し、必要な信号処理を最小限に抑え、エラー隠蔽の強化から恩恵を受けます。

To achieve interoperability with different client devices and speech engines, a common format is needed. Within the "Aurora" DSR working group of the European Telecommunications Standards Institute (ETSI), a payload has been defined and was published as a standard [ES201108] in February 2000.

さまざまなクライアントデバイスと音声エンジンで相互運用性を実現するには、共通の形式が必要です。欧州通信標準研究所（ETSI）の「オーロラ」DSRワーキンググループ内で、ペイロードが定義され、2000年2月に標準[ES201108]として公開されました。

For voice dialogues between a caller and a voice service, low latency is a high priority along with accurate speech recognition. While jitter in the speech recognizer input is not particularly important, many issues related to speech interaction over an IP-based connection are still relevant. Therefore, it is desirable to use the DSR payload in an RTP-based session.

発信者と音声サービス間の音声対話の場合、低遅延は正確な音声認識とともに高い優先度です。音声認識者のジッターは特に重要ではありませんが、IPベースの接続を介した音声相互作用に関連する多くの問題は依然として関連しています。したがって、RTPベースのセッションでDSRペイロードを使用することが望ましいです。

2.1 ETSI ES 201 108 DSR Front-end Codec

2.1 ETSI ES 201 108 DSRフロントエンドコーデック

The ETSI Standard ES 201 108 for DSR [ES201108] defines a signal processing front-end and compression scheme for speech input to a speech recognition system. Some relevant characteristics of this ETSI DSR front-end codec are summarized below.

DSR [ES201108のETSI標準ES 201 108]は、音声認識システムへの音声入力の信号処理フロントエンドおよび圧縮スキームを定義します。このETSI DSRフロントエンドコーデックのいくつかの関連する特性を以下に要約します。

The coding algorithm, a standard mel-cepstral technique common to many speech recognition systems, supports three raw sampling rates: 8 kHz, 11 kHz, and 16 kHz. The mel-cepstral calculation is a frame-based scheme that produces an output vector every 10 ms.

多くの音声認識システムに共通する標準的なメルクリスチュストラル技術であるコーディングアルゴリズムは、8 kHz、11 kHz、および16 kHzの3つの生のサンプリングレートをサポートしています。Mel-Cepstral計算は、10ミリ秒ごとに出力ベクトルを生成するフレームベースのスキームです。

After calculation of the mel-cepstral representation, the representation is first quantized via split-vector quantization to reduce the data rate of the encoded stream. Then, the quantized vectors from two consecutive frames are put into an FP, as described in more detail in Section 4.1.

メルクリスチュストラル表現の計算後、表現は最初にスプリットベクトル量子化を介して量子化され、エンコードされたストリームのデータレートを低下させます。次に、セクション4.1で詳細に説明するように、2つの連続したフレームからの量子化されたベクトルがFPに入れられます。

2.2 Typical Scenarios for Using DSR Payload Format

2.2 DSRペイロード形式を使用するための典型的なシナリオ

The diagrams in Figure 1 show some typical use scenarios of the ES 201 108 DSR RTP payload format.

図1の図は、ES 201 108 DSR RTPペイロード形式の典型的な使用シナリオを示しています。

   +--------+                     +----------+
   |IP USER |  IP/UDP/RTP/DSR     |IP SPEECH |
   |TERMINAL|-------------------->|  ENGINE  |
   |        |                     |          |
   +--------+                     +----------+

a) IP user terminal to IP speech engine

a) IPユーザーターミナルからIPスピーチエンジン

   +--------+  DSR over      +-------+                +----------+
   | Non-IP |  Circuit link  |       | IP/UDP/RTP/DSR |IP SPEECH |
   |  USER  |:::::::::::::::>|GATEWAY|--------------->|  ENGINE  |
   |TERMINAL|  ETSI payload  |       |                |          |
   +--------+  format        +-------+                +----------+

b) non-IP user terminal to IP speech engine via a gateway

b) ゲートウェイを介して非IPユーザーターミナルからIPスピーチエンジンへ

   +--------+                  +-------+  DSR over       +----------+
   |IP USER |  IP/UDP/RTP/DSR  |       |  circuit link   |  Non-IP  |
   |TERMINAL|----------------->|GATEWAY|::::::::::::::::>|  SPEECH  |
   |        |                  |       |  ETSI payload   |  ENGINE  |
   +--------+                  +-------+  format         +----------+

c) IP user terminal to non-IP speech engine via a gateway

c) ゲートウェイを介してIPユーザー端末から非IPスピーチエンジンから非IPスピーチエンジン

Figure 1: Typical Scenarios for Using DSR Payload Format.

図1：DSRペイロード形式を使用するための典型的なシナリオ。

For the different scenarios in Figure 1, the speech recognizer always resides in the speech engine. A DSR front-end encoder inside the User Terminal performs front-end speech processing and sends the resultant data to the speech engine in the form of "frame pairs" (FPs). Each FP contains two sets of encoded speech vectors representing 20ms of original speech.

図1のさまざまなシナリオの場合、音声認識者は常に音声エンジンにあります。ユーザー端末内のDSRフロントエンドエンコーダーは、フロントエンドの音声処理を実行し、結果のデータを「フレームペア」（FPS）の形で音声エンジンに送信します。各FPには、オリジナルの音声20msを表す2セットのエンコードされた音声ベクトルが含まれています。

3. ES 201 108 DSR RTP Payload Format

3. ES 201 108 DSR RTPペイロード形式

An ES 201 108 DSR RTP payload datagram consists of a standard RTP header [RFC3550] followed by a DSR payload. The DSR payload itself is formed by concatenating a series of ES 201 108 DSR FPs (defined in Section 4).

ES 201 108 DSR RTPペイロードデータグラムは、標準のRTPヘッダー[RFC3550]に続いてDSRペイロードが続きます。DSRペイロード自体は、一連のES 201 108 DSR FPS（セクション4で定義）を連結することにより形成されます。

FPs are always packed bit-contiguously into the payload octets beginning with the most significant bit. For ES 201 108 front-end, the size of each FP is 96 bits or 12 octets (see Sections 4.1 and 4.2). This ensures that a DSR payload will always end on an octet boundary.

FPSは、常に最も重要なビットから始まるペイロードオクテットにビット並行して詰め込まれます。ES 201 108のフロントエンドの場合、各FPのサイズは96ビットまたは12オクテットです（セクション4.1および4.2を参照）。これにより、DSRペイロードが常にOctet境界で終了することが保証されます。

The following example shows a DSR RTP datagram carrying a DSR payload containing three 96-bit-long FPs (bit 0 is the MSB):

次の例は、3つの96ビット長fps（ビット0はMSB）を含むDSRペイロードを運ぶDSR RTPデータグラムを示しています。

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   \                                                               \
   /                    RTP header in [RFC3550]                    /
   \                                                               \
   +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+
   |                                                               |
   +                                                               +
   |                         FP #1 (96 bits)                       |
   +                                                               +
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   +                                                               +
   |                         FP #2 (96 bits)                       |
   +                                                               +
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   +                                                               +
   |                         FP #3 (96 bits)                       |
   +                                                               +
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Figure 2. An example of an ES 201 108 DSR RTP payload.

図2. ES 201 108 DSR RTPペイロードの例。

3.1 Consideration on Number of FPs in Each RTP Packet

3.1 各RTPパケットのFPS数に関する考慮事項

The number of FPs per payload packet should be determined by the latency and bandwidth requirements of the DSR application using this payload format. In particular, using a smaller number of FPs per payload packet in a session will result in lowered bandwidth efficiency due to the RTP/UDP/IP header overhead, while using a larger number of FPs per packet will cause longer end-to-end delay and hence increased recognition latency. Furthermore, carrying a larger number of FPs per packet will increase the possibility of catastrophic packet loss; the loss of a large number of consecutive FPs is a situation most speech recognizers have difficulty dealing with.

ペイロードパケットあたりのFPS数は、このペイロード形式を使用してDSRアプリケーションのレイテンシおよび帯域幅要件によって決定する必要があります。特に、セッションでペイロードパケットあたりのPPSあたりのFPSを少なく使用すると、RTP/UDP/IPヘッダーの張り出しにより帯域幅効率が低下しますが、パケットごとのFPSを使用すると、エンドツーエンドの遅延が長くなります。したがって、認識遅延が増加しました。さらに、パケットごとにFPSの数が多いと、壊滅的なパケット損失の可能性が高まります。多数の連続したFPSの損失は、ほとんどのスピーチ認識者が対処するのが困難な状況です。

It is therefore RECOMMENDED that the number of FPs per DSR payload packet be minimized, subject to meeting the application's requirements on network bandwidth efficiency. RTP header compression techniques, such as those defined in [RFC2508] and [RFC3095], should be considered to improve network bandwidth efficiency.

したがって、ネットワーク帯域幅の効率に関するアプリケーションの要件を満たす場合、DSRペイロードパケットあたりのFPS数を最小限に抑えることをお勧めします。[RFC2508]や[RFC3095]で定義されているものなどのRTPヘッダー圧縮技術は、ネットワーク帯域幅の効率を改善するために考慮する必要があります。

3.2 Support for Discontinuous Transmission

3.2 不連続な送信のサポート

The DSR RTP payloads may be used to support discontinuous transmission (DTX) of speech, which allows that DSR FPs are sent only when speech has been detected at the terminal equipment.

DSR RTPペイロードは、端末機器で音声が検出された場合にのみDSR FPSが送信されることを可能にするスピーチの不連続伝送（DTX）をサポートするために使用できます。

In DTX a set of DSR frames coding an unbroken speech segment transmitted from the terminal to the server is called a transmission segment. A DSR frame inside such a transmission segment can be either a speech frame or a non-speech frame, depending on the nature of the section of the speech signal it represents.

DTXでは、ターミナルからサーバーに送信される壊れていない音声セグメントをコーディングするDSRフレームのセットは、伝送セグメントと呼ばれます。このような伝送セグメント内のDSRフレームは、それが表す音声信号のセクションの性質に応じて、音声フレームまたは非スピーチフレームのいずれかです。

The end of a transmission segment is determined at the sending end equipment when the number of consecutive non-speech frames exceeds a pre-set threshold, called the hangover time. A typical value used for the hangover time is 1.5 seconds.

伝送セグメントの終わりは、連続した非スピーチフレームの数が二日酔い時間と呼ばれる事前に設定されたしきい値を超えると、送信端部機器で決定されます。二日酔い時間に使用される典型的な値は1.5秒です。

After all FPs in a transmission segment are sent, the front-end SHOULD indicate the end of the current transmission segment by sending one or more Null FPs (defined in Section 4.2).

送信セグメント内のすべてのFPSが送信された後、フロントエンドは、1つまたは複数のヌルFPSを送信することにより、現在の伝送セグメントの終わりを示す必要があります（セクション4.2で定義）。

4. Frame Pair Formats

4. フレームペア形式

4.1 Format of Speech and Non-speech FPs

4.1 音声と非スピーチFPSの形式

The following mel-cepstral frame MUST be used, as defined in [ES201108]:

[ES201108]で定義されているように、次のメルクリスチュストラルフレームを使用する必要があります。

As defined in [ES201108], pairs of the quantized 10ms mel-cepstral frames MUST be grouped together and protected with a 4-bit CRC, forming a 92-bit long FP:

[ES201108]で定義されているように、量子化された10msメルクリスチュストラルフレームのペアをグループ化し、4ビットCRCで保護する必要があり、92ビットの長いFPを形成する必要があります。

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                      Frame #1  (44 bits)                      |
   +                       +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                       |          Frame #2 (44 bits)           |
   +-+-+-+-+-+-+-+-+-+-+-+-+                       +-+-+-+-+-+-+-+-+
   |                                               | CRC   |0|0|0|0|
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The length of each frame is 44 bits representing 10ms of voice. The following mel-cepstral frame formats MUST be used when forming an FP:

各フレームの長さは、10ミリ秒の音声を表す44ビットです。FPを形成するときは、次のMel-Cepstralフレーム形式を使用する必要があります。

   Frame #1 in FP:
   ===============
       (MSB)                                     (LSB)
         0     1     2     3     4     5     6     7
      +-----+-----+-----+-----+-----+-----+-----+-----+
      :  idx(2,3) |            idx(0,1)               |    Octet 1
      +-----+-----+-----+-----+-----+-----+-----+-----+
      :       idx(4,5)        |     idx(2,3) (cont)   :    Octet 2
      +-----+-----+-----+-----+-----+-----+-----+-----+
      |             idx(6,7)              |idx(4,5)(cont)  Octet 3
      +-----+-----+-----+-----+-----+-----+-----+-----+
       idx(10,11) |              idx(8,9)             |    Octet 4
      +-----+-----+-----+-----+-----+-----+-----+-----+
      :       idx(12,13)      |   idx(10,11) (cont)   :    Octet 5
      +-----+-----+-----+-----+-----+-----+-----+-----+
                              |   idx(12,13) (cont)   :    Octet 6/1
                              +-----+-----+-----+-----+

   Frame #2 in FP:
   ===============
       (MSB)                                     (LSB)
         0     1     2     3     4     5     6     7
      +-----+-----+-----+-----+
      :        idx(0,1)       |                            Octet 6/2
      +-----+-----+-----+-----+-----+-----+-----+-----+
      |              idx(2,3)             |idx(0,1)(cont)  Octet 7
      +-----+-----+-----+-----+-----+-----+-----+-----+
      :  idx(6,7) |              idx(4,5)             |    Octet 8
      +-----+-----+-----+-----+-----+-----+-----+-----+
      :        idx(8,9)       |      idx(6,7) (cont)  :    Octet 9
      +-----+-----+-----+-----+-----+-----+-----+-----+
      |          idx(10,11)               |idx(8,9)(cont)  Octet 10
      +-----+-----+-----+-----+-----+-----+-----+-----+
      |                   idx(12,13)                  |    Octet 11
      +-----+-----+-----+-----+-----+-----+-----+-----+

Therefore, each FP represents 20ms of original speech. Note, as shown above, each FP MUST be padded with 4 zeros to the end in order to make it aligned to the 32-bit word boundary. This makes the size of an FP 96 bits, or 12 octets. Note, this padding is separate from padding indicated by the P bit in the RTP header.

したがって、各FPは、元のスピーチの20msを表します。上記のように、各FPは、32ビットワードの境界に合わせて、4つのゼロを最後までパディングする必要があります。これにより、FP 96ビット、または12オクテットのサイズになります。このパディングは、RTPヘッダーのPビットで示されるパディングとは別のパディングです。

The 4-bit CRC MUST be calculated using the formula defined in 6.2.4 in [ES201108]. The definition of the indices and the determination of their value are also described in [ES201108].

4ビットCRCは、[ES201108]で6.2.4で定義された式を使用して計算する必要があります。インデックスの定義とその値の決定は、[ES201108]でも説明されています。

4.2 Format of Null FP

4.2 null fpの形式

A Null FP for the ES 201 108 front-end codec is defined by setting the content of the first and second frame in the FP to null (i.e., filling the first 88 bits of the FP with 0's). The 4-bit CRC MUST be calculated the same way as described in 6.2.4 in [ES201108], and 4 zeros MUST be padded to the end of the Null FP to make it 32-bit word aligned.

ES 201 108のフロントエンドコーデックのヌルFPは、FPの1番目と2番目のフレームのコンテンツをNULLに設定することによって定義されます（つまり、FPの最初の88ビットを0で埋める）。4ビットCRCは、[ES201108]で6.2.4で説明されているのと同じ方法で計算する必要があり、4つのゼロをnull fpの最後にパッドで埋めて、32ビットワードを調整する必要があります。

4.3 RTP header usage

4.3 RTPヘッダーの使用

The format of the RTP header is specified in [RFC3550]. This payload format uses the fields of the header in a manner consistent with that specification.

RTPヘッダーの形式は[RFC3550]で指定されています。このペイロード形式は、その仕様と一致する方法でヘッダーのフィールドを使用します。

The RTP timestamp corresponds to the sampling instant of the first sample encoded for the first FP in the packet. The timestamp clock frequency is the same as the sampling frequency, so the timestamp unit is in samples.

RTPタイムスタンプは、パケット内の最初のFPに対してエンコードされた最初のサンプルのサンプリングインスタントに対応しています。タイムスタンプクロック周波数はサンプリング周波数と同じであるため、タイムスタンプユニットはサンプルにあります。

As defined by ES 201 108 front-end codec, the duration of one FP is 20 ms, corresponding to 160, 220, or 320 encoded samples with sampling rate of 8, 11, or 16 kHz being used at the front-end, respectively. Thus, the timestamp is increased by 160, 220, or 320 for each consecutive FP, respectively.

ES 201 108のフロントエンドコーデックで定義されているように、1つのFPの期間は20ミリ秒で、160、220、または320のエンコードされたサンプルに相当し、サンプリングレートはそれぞれフロントエンドで使用されています。。したがって、タイムスタンプは、それぞれ連続FPごとに160、220、または320増加します。

The DSR payload for ES 201 108 front-end codes is always an integral number of octets. If additional padding is required for some other purpose, then the P bit in the RTP in the header may be set and padding appended as specified in [RFC3550].

ES 201 108のフロントエンドコードのDSRペイロードは、常にオクテットの不可欠な数です。他の目的に追加のパディングが必要な場合、[RFC3550]で指定されているように、ヘッダー内のRTPのPビットを設定し、パディングが追加される場合があります。

The RTP header marker bit (M) should be set following the general rules defined in [RFC3551].

RTPヘッダーマーカービット（M）は、[RFC3551]で定義されている一般的なルールに従って設定する必要があります。

The assignment of an RTP payload type for this new packet format is outside the scope of this document, and will not be specified here. It is expected that the RTP profile under which this payload format is being used will assign a payload type for this encoding or specify that the payload type is to be bound dynamically.

この新しいパケット形式のRTPペイロードタイプの割り当ては、このドキュメントの範囲外であり、ここでは指定されません。このペイロード形式が使用されているRTPプロファイルは、このエンコードにペイロードタイプを割り当てるか、ペイロードタイプを動的にバインドすることを指定することが期待されます。

5. IANA Considerations

5. IANAの考慮事項

One new MIME subtype registration is required for this payload type, as defined below.

以下に定義するように、このペイロードタイプには、1つの新しいMIMEサブタイプの登録が必要です。

This section also defines the optional parameters that may be used to describe a DSR session. The parameters are defined here as part of the MIME subtype registration. A mapping of the parameters into the Session Description Protocol (SDP) [RFC2327] is also provided in 5.1 for those applications that use SDP.

このセクションでは、DSRセッションを説明するために使用できるオプションのパラメーターも定義します。パラメーターは、MIMEサブタイプ登録の一部としてここで定義されています。セッション説明プロトコル（SDP）[RFC2327]へのパラメーターのマッピングは、SDPを使用するアプリケーションの5.1でも提供されます。

Media Type name: audio

メディアタイプ名：オーディオ

Media subtype name: dsr-es201108

メディアサブタイプ名：DSR-ES201108

Required parameters: none

必要なパラメーター：なし

Optional parameters:

オプションのパラメーター：

rate: Indicates the sample rate of the speech. Valid values include: 8000, 11000, and 16000. If this parameter is not present, 8000 sample rate is assumed.

レート：音声のサンプルレートを示します。有効な値には、8000、11000、および16000が含まれます。このパラメーターが存在しない場合、8000のサンプルレートが想定されます。

maxptime: The maximum amount of media which can be encapsulated in each packet, expressed as time in milliseconds. The time shall be calculated as the sum of the time the media present in the packet represents. The time SHOULD be a multiple of the frame pair size (i.e., one FP <-> 20ms).

Maxptime：各パケットにカプセル化できるメディアの最大量は、ミリ秒単位で時間として表されます。時間は、パケットに存在するメディアが表す時間の合計として計算されます。時間は、フレームペアサイズの倍数（つまり、1つのfp < - > 20ms）でなければなりません。

If this parameter is not present, maxptime is assumed to be 80ms.

このパラメーターが存在しない場合、Maxptimeは80ミリ秒と想定されます。

Note, since the performance of most speech recognizers are extremely sensitive to consecutive FP losses, if the user of the payload format expects a high packet loss ratio for the session, it MAY consider to explicitly choose a maxptime value for the session that is shorter than the default value.

注、ほとんどのスピーチ認識者のパフォーマンスは連続したFP損失に非常に敏感であるため、ペイロード形式のユーザーがセッションの高いパケット損失率を期待している場合、セッションの最大値を明示的に選択することを検討する場合があります。デフォルト値。

ptime: see RFC2327 [RFC2327].

PTIME：RFC2327 [RFC2327]を参照してください。

Encoding considerations : This type is defined for transfer via RTP [RFC3550] as described in Sections 3 and 4 of RFC 3557.

考慮事項のエンコード：このタイプは、RFC 3557のセクション3および4で説明されているように、RTP [RFC3550]を介して転送するために定義されます。

Security considerations : See Section 6 of RFC 3557.

セキュリティ上の考慮事項：RFC 3557のセクション6を参照してください。

Person & email address to contact for further information: Qiaobing.Xie@motorola.com

詳細については、人とメールアドレスをお問い合わせください：qiaobing.xie@motorola.com

Intended usage: COMMON. It is expected that many VoIP applications (as well as mobile applications) will use this type.

意図された使用法：共通。多くのVoIPアプリケーション（およびモバイルアプリケーション）がこのタイプを使用することが予想されます。

Author/Change controller: Qiaobing.Xie@motorola.com IETF Audio/Video transport working group

著者/変更コントローラー：qiaobing.xie@motorola.com IETFオーディオ/ビデオトランスポーティングワーキンググループ

5.1 Mapping MIME Parameters into SDP

5.1 MIMEパラメーターをSDPにマッピングします

The information carried in the MIME media type specification has a specific mapping to fields in the Session Description Protocol (SDP) [RFC2327], which is commonly used to describe RTP sessions. When SDP is used to specify sessions employing ES 201 018 DSR codec, the mapping is as follows:

MIMEメディアタイプの仕様に掲載されている情報には、セッション説明プロトコル（SDP）[RFC2327]のフィールドへの特定のマッピングがあります。これは、RTPセッションを説明するために一般的に使用されます。SDPがES 201 018 DSRコーデックを使用するセッションを指定するために使用される場合、マッピングは次のとおりです。

o The MIME type ("audio") goes in SDP "m=" as the media name.

o MIMEタイプ（ "Audio"）は、メディア名としてSDP "m ="になります。

o The MIME subtype ("dsr-es201108") goes in SDP "a=rtpmap" as the encoding name.

o MIMEサブタイプ（ "DSR-ES201108"）は、sdp "a = rtpmap"にエンコード名として掲載されます。

o The optional parameter "rate" also goes in "a=rtpmap" as clock rate.

o オプションのパラメーター「レート」も、「A = rtpmap」にクロックレートとして入力されます。

o The optional parameters "ptime" and "maxptime" go in the SDP "a=ptime" and "a=maxptime" attributes, respectively.

o オプションのパラメーター「PTIME」と「MAXPTIME」は、それぞれSDP「A = PTIME」および「A = MaxPtime」属性に移動します。

Example of usage of ES 201 108 DSR:

ES 201 108 DSRの使用例：

      m=audio 49120 RTP/AVP 101
      a=rtpmap:101 dsr-es201108/8000
      a=maxptime:40

6. Security Considerations

6. セキュリティに関する考慮事項

Implementations using the payload defined in this specification are subject to the security considerations discussed in the RTP specification [RFC3550] and the RTP profile [RFC3551]. This payload does not specify any different security services.

この仕様で定義されたペイロードを使用した実装は、RTP仕様[RFC3550]およびRTPプロファイル[RFC3551]で説明されているセキュリティ上の考慮事項の対象となります。このペイロードは、別のセキュリティサービスを指定しません。

7. Contributors

7. 貢献者

The following individuals contributed to the design of this payload format and the writing of this document: Q. Xie (Motorola), D. Pearce (Motorola), S. Balasuriya (Motorola), Y. Kim (VerbalTek), S. H. Maes (IBM), and, Hari Garudadri (Qualcomm).

次の個人は、このペイロード形式の設計とこの文書の執筆に貢献しました：Q。Xie（Motorola）、D。Pearce（Motorola）、S。Balasuriya（Motorola）、Y。Kim（Verbaltek）、S。H. Maes（IBM（IBM））、および、ハリ・ガルダドリ（Qualcomm）。

8. Acknowledgments

8. 謝辞

The design presented here benefits greatly from an earlier work on DSR RTP payload design by Jeff Meunier and Priscilla Walther. The authors also wish to thank Brian Eberman, John Lazzaro, Magnus Westerlund, Rainu Pierce, Priscilla Walther, and others for their review and valuable comments on this document.

ここで紹介するデザインは、Jeff MeunierとPriscilla WaltherによるDSR RTPペイロードデザインに関する以前の研究から大きな恩恵を受けています。著者はまた、この文書に関するレビューと貴重なコメントについて、ブライアン・エバーマン、ジョン・ラザロ、マグナス・ウェスターランド、レイン・ピアス、プリシラ・ウォルサーなどに感謝したいと考えています。

9. References

9. 参考文献

9.1 Normative References

9.1 引用文献

[ES201108] European Telecommunications Standards Institute (ETSI) Standard ES 201 108, "Speech Processing, Transmission and Quality Aspects (STQ); Distributed Speech Recognition; Front-end Feature Extraction Algorithm; Compression Algorithms," Ver. 1.1.2, April 11, 2000.

[ES201108]欧州通信標準研究所（ETSI）Standard ES 201 108、「音声処理、伝送、品質の側面（STQ）、分散音声認識、フロントエンド特徴抽出アルゴリズム、圧縮アルゴリズム、 "Ver。1.1.2、2000年4月11日。

[RFC3550] Schulzrinne, H., Casner, S., Jacobson, V. and R. Frederick, "RTP: A Transport Protocol for Real-Time Applications", RFC 3550, July 2003.

[RFC3550] Schulzrinne、H.、Casner、S.、Jacobson、V。およびR. Frederick、「RTP：リアルタイムアプリケーション用の輸送プロトコル」、RFC 3550、2003年7月。

[RFC2026] Bradner, S., "The Internet Standards Process -- Revision 3", BCP 9, RFC 2026, October 1996.

[RFC2026] Bradner、S。、「インターネット標準プロセス - リビジョン3」、BCP 9、RFC 2026、1996年10月。

[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997.

[RFC2119] Bradner、S。、「要件レベルを示すためにRFCで使用するためのキーワード」、BCP 14、RFC 2119、1997年3月。

[RFC2327] Handley, M. and V. Jacobson, "SDP: Session Description Protocol", RFC 2327, April 1998.

[RFC2327] Handley、M。and V. Jacobson、「SDP：セッション説明プロトコル」、RFC 2327、1998年4月。

9.2 Informative References

9.2 参考引用

[RFC3551] Schulzrinne, H. and S. Casner, "RTP Profile for Audio and Video Conferences with Minimal Control", RFC 3551, July 2003.

[RFC3551] Schulzrinne、H。およびS. Casner、「最小限のコントロールを備えたオーディオおよびビデオ会議のRTPプロファイル」、RFC 3551、2003年7月。

[RFC2508] Casner, S. and V. Jacobson, "Compressing IP/UDP/RTP Headers for Low-Speed Serial Links", RFC 2508, February 1999.

[RFC2508] Casner、S。およびV. Jacobson、「低速シリアルリンクのIP/UDP/RTPヘッダーの圧縮」、RFC 2508、1999年2月。

[RFC3095] Bormann, C., Burmeister, C., Degermark, M., Fukushima, H., Hannu, H., Jonsson, L-E, Hakenberg, R., Koren, T., Le, K., Liu, Z., Martensson, A., Miyazaki, A., Svanbro, K., Wiebke, T., Yoshimura, T. and H. Zheng, "RObust Header Compression (ROHC): Framework and four profiles", RFC 3095, July 2001.

[RFC3095] Bormann、C.、Burmeister、C.、Degermark、M.、Fukushima、H.、Hannu、H.、Jonsson、L-E、Hakenberg、R.、Koren、T.、Le、K.、Liu、Z。、Martensson、A.、Miyazaki、A.、Svanbro、K.、Wiebke、T.、Yoshimura、T。、およびH. Zheng、「堅牢なヘッダー圧縮（ROHC）：フレームワークと4つのプロファイル」、RFC 3095、2001年7月。

10. IPR Notices

10. IPR通知

The IETF takes no position regarding the validity or scope of any intellectual property or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; neither does it represent that it has made any effort to identify any such rights. Information on the IETF's procedures with respect to rights in standards-track and standards-related documentation can be found in BCP-11. Copies of claims of rights made be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementors or users of this specification can be obtained from the IETF Secretariat.

IETFは、知的財産またはその他の権利の有効性または範囲に関して、この文書に記載されているテクノロジーの実装または使用に関連すると主張される可能性のある他の権利、またはそのような権利に基づくライセンスがどの程度であるかについての程度に関連する可能性があるという立場はありません。利用可能;また、そのような権利を特定するために努力したことも表明していません。標準トラックおよび標準関連のドキュメントの権利に関するIETFの手順に関する情報は、BCP-11に記載されています。行われた権利の請求のコピーが利用可能になります。または、この仕様の実施者またはユーザーによるそのような独自の権利の使用のための一般的なライセンスまたは許可を取得しようとした試みの結果、IETF事務局から取得できます。

The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights which may cover technology that may be required to practice this standard. Please address the information to the IETF Executive Director.

IETFは、関心のある当事者に、著作権、特許、または特許出願、またはこの基準を実践するために必要な技術をカバーする可能性のあるその他の独自の権利を注意深く招待します。情報をIETFエグゼクティブディレクターに宛ててください。

11. Authors' Addresses

11. 著者のアドレス

David Pearce Motorola Labs UK Research Laboratory Jays Close Viables Industrial Estate Basingstoke, HANTS, RG22 4PD

David Pearce Motorola Labs UK Research Laboratory Jays Close Biables Industrial Estate Basingstoke、Hants、RG22 4PD

   Phone: +44 (0)1256 484 436
   EMail: bdp003@motorola.com

Senaka Balasuriya Motorola, Inc. 600 U.S Highway 45 Libertyville, IL 60048, USA

Senaka Balasuriya Motorola、Inc。600 U.S Highway 45 Libertyville、IL 60048、USA

   Phone: +1-847-523-0440
   EMail: Senaka.Balasuriya@motorola.com

Yoon Kim VerbalTek, Inc. 2921 Copper Rd. Santa Clara, CA 95051

Yoon Kim Verbaltek、Inc。2921 Copper Rd。サンタクララ、CA 95051

   Phone: +1-408-768-4974
   EMail: yoonie@verbaltek.com

Stephane H. Maes, PhD, Oracle 500 Oracle Parkway, M/S 4op634 Redwood City, CA 94065 USA

Stephane H. Maes、PhD、Oracle 500 Oracle Parkway、M/S 4OP634 Redwood City、CA 94065 USA

   Phone: +1-650-607-6296.
   EMail: stephane.maes@oracle.com

Hari Garudadri Qualcomm Inc. 5775, Morehouse Dr. San Diego, CA 92121-1714, USA

Hari Garudadri Qualcomm Inc. 5775、Morehouse Dr. San Diego、CA 92121-1714、米国

   Phone: +1-858-651-6383
   EMail: hgarudad@qualcomm.com

12. Editor's Address

12. 編集者のアドレス

Qiaobing Xie Motorola, Inc. 1501 W. Shure Drive, 2-F9 Arlington Heights, IL 60004, USA

Qiaobing Xie Motorola、Inc。1501 W. Shure Drive、2-F9 Arlington Heights、IL 60004、USA

   Phone: +1-847-632-3028
   EMail: Qiaobing.Xie@motorola.com

13. Full Copyright Statement

13. 完全な著作権声明

This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English.

このドキュメントと翻訳は他の人にコピーされて提供される場合があります。また、それについてコメントまたは説明する派生作品、またはその実装を支援することは、いかなる種類の制限なしに、準備、コピー、公開、および部分的に配布される場合があります。、上記の著作権通知とこの段落がそのようなすべてのコピーとデリバティブ作品に含まれている場合。ただし、このドキュメント自体は、インターネット協会や他のインターネット組織への著作権通知や参照を削除するなど、いかなる方法でも変更できない場合があります。インターネット標準プロセスに従うか、英語以外の言語に翻訳するために必要な場合に従う必要があります。

The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns.

上記の限られた許可は永続的であり、インターネット社会またはその後継者または譲受人によって取り消されることはありません。

This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

このドキュメントと本書に含まれる情報は、「現状」に基づいて提供されており、インターネット社会とインターネットエンジニアリングタスクフォースは、ここにある情報の使用が行われないという保証を含むがこれらに限定されないすべての保証を否認します。特定の目的に対する商品性または適合性の権利または黙示的な保証を侵害します。

Acknowledgement

謝辞

Funding for the RFC Editor function is currently provided by the Internet Society.

RFCエディター機能の資金は現在、インターネット協会によって提供されています。