RFC 9839 - Unicode Character Repertoire Subsets 日本語訳

原文URL : https://www.rfc-editor.org/rfc/rfc9839.html
タイトル : RFC 9839 - Unicode文字レパートリーサブセット
翻訳編集 : 自動生成

Internet Engineering Task Force (IETF)                           T. Bray
Request for Comments: 9839                           Textuality Services
Category: Standards Track                                     P. Hoffman
ISSN: 2070-1721                                                    ICANN
                                                             August 2025

Unicode Character Repertoire Subsets

Unicode文字レパートリーサブセット

Abstract

概要

This document discusses subsets of the Unicode character repertoire for use in protocols and data formats and specifies three subsets recommended for use in IETF specifications.

このドキュメントでは、プロトコルとデータ形式で使用するためのUnicode文字レパートリーのサブセットについて説明し、IETF仕様で使用するために推奨される3つのサブセットを指定します。

Status of This Memo

本文書の位置付け

This is an Internet Standards Track document.

これは、インターネット標準トラックドキュメントです。

This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Further information on Internet Standards is available in Section 2 of RFC 7841.

このドキュメントは、インターネットエンジニアリングタスクフォース（IETF）の製品です。IETFコミュニティのコンセンサスを表しています。公開レビューを受けており、インターネットエンジニアリングステアリンググループ（IESG）からの出版が承認されています。インターネット標準の詳細については、RFC 7841のセクション2で入手できます。

Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at https://www.rfc-editor.org/info/rfc9839.

このドキュメントの現在のステータス、任意のERRATA、およびそのフィードバックを提供する方法に関する情報は、https://www.rfc-editor.org/info/rfc9839で取得できます。

Copyright Notice

著作権表示

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.

このドキュメントは、BCP 78およびIETFドキュメント（https://trustee.ietf.org/license-info）に関連するIETF Trustの法的規定の対象となります。この文書に関するあなたの権利と制限を説明するので、これらの文書を注意深く確認してください。このドキュメントから抽出されたコードコンポーネントには、セクション4.Eで説明されている法的規定のセクション4.Eで説明されており、改訂されたBSDライセンスで説明されている保証なしで提供されるように、改訂されたBSDライセンステキストを含める必要があります。

   1.  Introduction
     1.1.  Notation
   2.  Characters and Code Points
     2.1.  Encoding Forms
     2.2.  Problematic Code Points
       2.2.1.  Surrogates
       2.2.2.  Control Codes
       2.2.3.  Noncharacters
   3.  Dealing with Problematic Code Points
   4.  Subsets
     4.1.  Unicode Scalars
     4.2.  XML Characters
     4.3.  Unicode Assignables
   5.  Using Subsets
   6.  IANA Considerations
   7.  Security Considerations
   8.  References
     8.1.  Normative References
     8.2.  Informative References
   Acknowledgements
   Authors' Addresses

1. Introduction

1. はじめに

Protocols and data formats frequently contain or are made up of textual data. Such text is normally composed of Unicode [UNICODE] characters, to support use by speakers of many languages. Unicode characters are represented by numeric code points, and the "set of all Unicode code points" is generally not a good choice for use in text fields. Unicode recognizes different types of code points, not all of which are appropriate in protocols or even associated with characters. Therefore, even if the desire is to support "all Unicode characters", a subset of the Unicode code point repertoire should be specified. Subsets such as those discussed in this document are appropriate choices when more-specific limitations do not apply.

プロトコルとデータ形式は、頻繁にテキストデータを含むか、または構成されています。このようなテキストは、通常、Unicode [Unicode]文字で構成されており、多くの言語のスピーカーによる使用をサポートしています。Unicode文字は数値コードポイントで表され、「すべてのユニコードコードポイントのセット」は一般に、テキストフィールドで使用するのに適した選択ではありません。Unicodeはさまざまなタイプのコードポイントを認識しますが、そのすべてがプロトコルで適切であるわけではなく、文字に関連付けられているわけではありません。したがって、「すべてのユニコード文字」をサポートしたい場合でも、Unicodeコードポイントレパートリーのサブセットを指定する必要があります。このドキュメントで説明したようなサブセットは、より特定の制限が適用されない場合に適切な選択です。

In this document, "subset" means a subset of the Unicode character repertoire. This document specifies subsets that exclude some or all of the code points that are "problematic" as defined in Section 2.2. Authors should have a way to concisely and exactly reference a stable specification that identifies which subset a protocol or data format accepts.

このドキュメントでは、「サブセット」とは、Unicode文字レパートリーのサブセットを意味します。このドキュメントは、セクション2.2で定義されているように「問題がある」コードポイントの一部またはすべてを除外するサブセットを指定します。著者は、どのサブセットAプロトコルまたはデータ形式が受け入れるかを特定する安定した仕様を簡潔に、正確に参照する方法を持つ必要があります。

This document discusses issues that apply in choosing subsets, names two subsets that have been popular in practice, and suggests one new subset. The intended use is to serve as a convenient target for cross-reference from other specifications whose authors wish to exclude problematic code points from the data format or protocol being specified.

このドキュメントでは、サブセットの選択に適用される問題、実際に人気がある2つのサブセットに名前を付け、1つの新しいサブセットを提案しています。意図された使用は、著者が指定されているデータ形式またはプロトコルから問題のあるコードポイントを除外したい他の仕様からの相互参照の便利なターゲットとして機能することです。

Note that this document only provides guidance on avoiding the use of code points that cannot be used for interoperable interchange of Unicode textual data. Dealing with strings, particularly in the context of user interfaces, requires addressing language, text rendering direction, alternate representations of the same abstract character, and so on. These issues, among many others, led to efforts by the Unicode Consortium, efforts by the IETF such as [IDN] and [PRECIS], and internationalization efforts by W3C such as [W3C-CHAR]. The results of these efforts should be consulted by anyone engaging in such work.

このドキュメントは、ユニコードテキストデータの相互運用可能なインターチェンジには使用できないコードポイントの使用を回避するためのガイダンスのみを提供することに注意してください。特にユーザーインターフェイスのコンテキストで文字列を扱うには、言語の対処、テキストのレンダリング方向、同じ抽象文字の代替表現などが必要です。これらの問題は、他の多くの問題であり、Unicodeコンソーシアムによる努力、[IDN]や[Precis]などのIETFによる努力、および[W3C-Char]などのW3Cによる国際化の取り組みにつながりました。The results of these efforts should be consulted by anyone engaging in such work.

1.1. Notation

1.1. 表記

In this document, the numeric values assigned to Unicode characters are provided in hexadecimal. This document uses Unicode's standard notation of "U+" followed by four or more hexadecimal digits. For example, "A", decimal 65, is expressed as U+0041, and "🖤" (Black Heart), decimal 128,420, is U+1F5A4.

このドキュメントでは、Unicode文字に割り当てられた数値が16進数で提供されます。このドキュメントでは、Unicodeの「u+」の標準表記に続いて4つ以上の16進数桁を使用します。たとえば、「A」、10進65はu+0041として表され、「🖤」（黒いハート）、10進128,420はu+1f5a4です。

Groups of numeric values described in Section 4 are given in ABNF [RFC5234]. In ABNF, hexadecimal values are preceded by "%x" rather than "U+".

セクション4で説明されている数値のグループは、ABNF [RFC5234]に示されています。ABNFでは、16進数の前に「u+」ではなく「％x」が先行します。

All the numeric ranges in this document are inclusive.

このドキュメントのすべての数値範囲は包括的です。

The subsets are described in ABNF.

サブセットはABNFで説明されています。

2. Characters and Code Points

2. 文字とコードポイント

Definition D9 in Section 3.4 of [UNICODE] defines "Unicode codespace" as "a range of integers from 0 to 10FFFF_16". Definition D10 defines "code point" as "Any value in the Unicode codespace".

[Unicode]のセクション3.4の定義D9は、「Unicode CodeSpace」を「0〜10FFFF_16の範囲の範囲」と定義しています。定義D10は、「コードポイント」を「Unicode CodeSpaceの任意の値」として定義します。

The Unicode Standard's definition of "Unicode character" is conceptual. However, each Unicode character is assigned a code point, used to represent the characters in computer memory and storage systems and to specify allowed subsets in specifications.

Unicode Standardの「Unicode文字」の定義は概念的です。ただし、各Unicode文字には、コンピューターメモリおよびストレージシステムの文字を表すために使用され、仕様で許可されたサブセットを指定するために使用されるコードポイントが割り当てられます。

There are 1,114,112 (17 * 2^16) code points; as of Unicode 16.0 (2024), about 155,000 have been assigned to characters. Since unassigned code points regularly become assigned when new characters are added to Unicode, it is usually not a good practice to specify that unassigned code points should be avoided.

1,114,112（17 * 2^16）のコードポイントがあります。Unicode 16.0（2024）の時点で、約155,000が文字に割り当てられています。Unicodeに新しい文字が追加されると、未割り当てのコードポイントが定期的に割り当てられるため、通常、割り当てられていないコードポイントを避ける必要があることを指定することは良い慣行ではありません。

2.1. Encoding Forms

2.1. フォームのエンコード

Unicode describes a variety of encoding forms that can be used to marshal code points into byte sequences. A survey of these is beyond the scope of this document. However, it is useful to note that "UTF-16" represents each code point with one or two 16-bit chunks, while "UTF-8" uses variable-length byte sequences [RFC3629].

Unicodeは、コードポイントをバイトシーケンスにマーシャリングするために使用できるさまざまなエンコードフォームを説明します。これらの調査は、このドキュメントの範囲を超えています。ただし、「UTF-16」は各コードポイントを1つまたは2つの16ビットチャンクで表す一方で、「UTF-8」は可変長バイトシーケンス[RFC3629]を使用していることに注意すると便利です。

The "IETF Policy on Character Sets and Languages", BCP 18 [RFC2277], says "Protocols MUST be able to use the UTF-8 charset", which becomes a mandate to use UTF-8 for any protocol or data format that specifies a single encoding form. UTF-8 is widely used for interoperable data formats such as JSON, YAML, CBOR, and XML.

「文字セットと言語に関するIETFポリシー」、BCP 18 [RFC2277]は、「プロトコルはUTF-8チャーセットを使用できなければならない」と述べています。UTF-8は、JSON、YAML、CBOR、XMLなどの相互運用可能なデータ形式に広く使用されています。

2.2. Problematic Code Points

2.2. 問題のあるコードポイント

This section classifies all the code points that can never represent useful text and, in some cases, can lead to software misbehavior as "problematic". This is a low bar; the PRECIS [RFC8264] framework's "IdentifierClass" and "FreeformClass" exclude many more code points that can cause problems when displayed to humans, in some cases presenting security risks. Specifications of fields in protocols and data formats whose contents are designed for display to and interactions with humans would benefit from careful consideration of the issues described by PRECIS; its more-restrictive subsets might be better choices than those specified in this document.

このセクションでは、有用なテキストを決して表すことができないすべてのコードポイントを分類し、場合によってはソフトウェアの不正行為に「問題のある」と誘導する可能性があります。これは低いバーです。PRECIS [RFC8264] Frameworkの「IdentifierClass」および「FreeFormClass」は、人間に表示されるときに問題を引き起こす可能性のあるより多くのコードポイントを除外します。コンテンツが表示され、人間との相互作用が設計されているプロトコルおよびデータ形式のフィールドの仕様は、PRECISで記述された問題を慎重に検討することで恩恵を受けるでしょう。より制限的なサブセットは、このドキュメントで指定されているものよりも優れた選択肢かもしれません。

Definition D10a in Section 3.4 of [UNICODE] defines seven code point types. Three types of code points are assigned to entities that are not actually characters or whose value as Unicode characters in text fields is questionable: "Surrogate", "Control", and "Noncharacter". In this document, "problematic" refers to code points whose type is "Surrogate" or "Noncharacter" and to "legacy controls" as defined in Section 2.2.2.2 below.

[Unicode]のセクション3.4の定義D10Aは、7つのコードポイントタイプを定義しています。3種類のコードポイントは、実際に文字ではないエンティティまたはテキストフィールドのUnicode文字としての値の「サロゲート」、「コントロール」、「非特徴」に割り当てられます。このドキュメントでは、「問題のある」とは、以下のセクション2.2.2.2で定義されているように、タイプが「サロゲート」または「非特徴」であるコードポイントと「レガシーコントロール」を指します。

Definition D49 in [UNICODE] concerns the "private-use" type, and Section 3.5.10 states that they "are considered to be assigned characters". Section 23.5 further states that these characters' "use may be determined by private agreement among cooperating users". Because private-use code points may have uses based on private agreements, this document does not classify them as "problematic".

[Unicode]の定義D49は、「私的使用」タイプに関係しており、セクション3.5.10は「割り当てられた文字と見なされている」と述べています。セクション23.5に、これらのキャラクターの「使用は、協力しているユーザー間の個人的な合意によって決定される可能性がある」と述べています。個人用コードポイントには個人契約に基づいて使用する可能性があるため、このドキュメントでは「問題のある」と分類されていません。

2.2.1. Surrogates

2.2.1. サロゲート

A total of 2,048 code points, in the range U+D800-U+DFFF, are divided into two blocks called "high surrogates" and "low surrogates"; collectively, the 2,048 code points are referred to as "surrogates". Section 23.6 of [UNICODE] specifies how surrogates may be used in Unicode texts encoded in UTF-16, where a high-surrogate/low-surrogate pair represents a code point greater than U+FFFF.

U+D800-U+DFFFの範囲内の合計2,048コードポイントは、「High Surrogates」と「Low Surrogates」と呼ばれる2つのブロックに分割されます。集合的に、2,048のコードポイントは「代理」と呼ばれます。[Unicode]のセクション23.6は、UTF-16でエンコードされたUnicodeテキストでサロゲートを使用する方法を指定します。ここでは、高スロゲート/低スロゲートペアはU+FFFFより大きいコードポイントを表します。

A surrogate that occurs in text encoded in any encoding form other than UTF-16 has no meaning. In particular, Section 3.9.3 of [UNICODE] forbids representing a surrogate in UTF-8.

UTF-16以外のエンコード形式でエンコードされたテキストで発生する代理は意味がありません。特に、UTF-8の代理を表す[Unicode]のセクション3.9.3。

2.2.2. Control Codes

2.2.2. 制御コード

Section 23.1 of [UNICODE] introduces the control codes for compatibility with legacy pre-Unicode standards. They comprise 65 code points in the ranges U+0000-U+001F ("C0 controls") and U+0080-U+009F ("C1 controls"), plus U+007F, "DEL".

[Unicode]のセクション23.1では、レガシー前の標準との互換性に関する制御コードを紹介します。それらは、範囲のu+0000-u+001f（ "C0 Controls"）およびU+0080-U+009F（ "C1 Controls"）に加えて、u+007f、 "del"の65のコードポイントで構成されています。

2.2.2.1. Useful Controls

2.2.2.1. 有用なコントロール

The C0 controls include newline (U+000A), carriage return (U+000D), and tab (U+0009); this document refers to these three characters as the "useful controls".

C0コントロールには、NewLine（U+000A）、キャリッジリターン（U+000D）、およびTAB（U+0009）が含まれます。このドキュメントは、これらの3つの文字を「有用なコントロール」と呼んでいます。

2.2.2.2. Legacy Controls

2.2.2.2. レガシーコントロール

Aside from the useful controls, both the C0 and C1 control codes are mostly obsolete and generally lack interoperable semantics. This document uses the phrase "legacy controls" to describe control codes that are not useful controls.

有用なコントロールは別として、C0とC1の両方の制御コードはほとんどが陳腐化しており、一般に相互運用可能なセマンティクスがありません。このドキュメントでは、「レガシーコントロール」というフレーズを使用して、有用なコントロールではないコントロールコードを記述します。

Because the code points for C0 controls include the 32 smallest integers including zero, they are likely to occur in data as a result of programming errors.

C0コントロールのコードポイントには、ゼロを含む32の最小整数が含まれるため、プログラミングエラーの結果としてデータで発生する可能性があります。

2.2.3. Noncharacters

2.2.3. 非特徴

Certain code points are classified as "noncharacters", and [UNICODE] asserts repeatedly that they are not designed or used for open interchange.

特定のコードポイントは「非特徴」として分類され、[Unicode]は、開いたインターチェンジに設計または使用されていないことを繰り返し主張します。

Code points are organized into 17 "planes", each containing 2^16 code points. The last two code points in each plane are noncharacters: U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, and so on, up to U+10FFFE, U+10FFFF.

コードポイントは、それぞれ2^16コードポイントを含む17の「飛行機」に編成されています。各プレーンの最後の2つのコードポイントは、非特徴です：u+fffe、u+ffff、u+1fffe、u+1ffff、u+2fffe、u+2ffffなど、u+10fffe、u+10ffffまで。

The code points in the range U+FDD0-U+FDEF are noncharacters.

範囲U+FDD0-U+FDEFのコードポイントは非特徴です。

3. Dealing with Problematic Code Points

3. 問題のあるコードポイントを扱う

"Maintaining Robust Protocols" [RFC9413] provides a thorough discussion of strategies for dealing with issues in input data.

「堅牢なプロトコルの維持」[RFC9413]は、入力データの問題に対処するための戦略の徹底的な議論を提供します。

Different types of problematic code points cause different issues. Noncharacters and legacy controls are unlikely to cause software failures, but they cannot usefully be displayed to humans, and they can be used in attacks based on attempting to display text that includes them.

異なるタイプの問題のあるコードポイントは、さまざまな問題を引き起こします。非特徴とレガシーコントロールはソフトウェアの障害を引き起こす可能性は低いですが、人間に有用に表示することはできません。また、それらを含むテキストを表示しようとすることに基づいて攻撃に使用できます。

The behavior of software that encounters surrogates is unpredictable and differs among programming-language implementations, even between different API calls in the same language.

サロゲートに遭遇するソフトウェアの動作は予測不可能であり、同じ言語の異なるAPI呼び出し間でさえ、プログラミング言語の実装間で異なります。

Section 3.9 of [UNICODE] makes it clear that a UTF-8 byte sequence that would map to a surrogate is ill-formed. If a specification requires that input data be encoded with UTF-8, and if all input were well-formed, implementors would never have to concern themselves with surrogates.

[Unicode]のセクション3.9では、サロゲートにマッピングされるUTF-8バイトシーケンスが不適切であることを明らかにしています。仕様では、入力データをUTF-8でエンコードする必要があり、すべての入力が適切に形成されている場合、実装者は代理人に関心を持たせる必要はありません。

Unfortunately, industry experience teaches that problematic code points, including surrogates, can and do occur in program input where the source of input data is not controlled by the implementor. In particular, the specification of JSON allows any code point to appear in object member names and string values [RFC8259].

残念ながら、業界での経験は、サロゲートを含む問題のあるコードポイントが、入力データのソースが実装者によって制御されない場合に、プログラム入力で発生することができることを教えています。特に、JSONの仕様により、任意のコードポイントがオブジェクトメンバー名と文字列値[RFC8259]に表示されます。

For example, the following is a conforming JSON text:

たとえば、以下は適合性のJSONテキストです。

   {"example": "\u0000\u0089\uDEAD\uD9BF\uDFFF"}

The value of the "example" field contains the C0 control NUL, the C1 control "CHARACTER TABULATION WITH JUSTIFICATION", an unpaired surrogate, and the noncharacter U+7FFFF encoded per JSON rules as two escaped UTF-16 surrogate code points as described in Section 7 of [RFC8259]. It is unlikely to be useful as the value of a text field. That value cannot be serialized into well-formed UTF-8, but the behavior of libraries asked to parse the sample is unpredictable; some will silently parse this and generate an ill-formed UTF-8 string.

「例」フィールドの値には、C0コントロールNUL、C1コントロール「正当化による文字の集計」、対立するサロゲート、および2つの[RFC8259]のセクション7で説明されている2つのUTF-16サロゲートコードポイントが逃げたため、JSONルールごとにエンコードされた非特性u+7ffffが含まれます。テキストフィールドの価値として役立つ可能性は低いです。その値をよく形成されたUTF-8にシリアル化することはできませんが、サンプルを解析するように求められたライブラリの動作は予測不可能です。一部は静かにこれを解析し、不正なUTF-8文字列を生成します。

Two reasonable options for dealing with problematic input are either rejecting text containing problematic code points or replacing the problematic code points with placeholders.

問題のある入力を扱うための2つの合理的なオプションは、問題のあるコードポイントを含むテキストを拒否するか、問題のあるコードポイントをプレースホルダーに置き換えることです。

Silently deleting an ill-formed part of a string is a known security risk. Responding to that risk, Section 3.2 of [UNICODE] recommends dealing with ill-formed byte sequences by signaling an error or replacing problematic code points, ideally with "�" (U+FFFD, REPLACEMENT CHARACTER).

文字列の不正な部分を静かに削除することは、既知のセキュリティリスクです。[Unicode]のセクション3.2は、そのリスクに応じて、エラーをシグナルに合わせたり、問題のあるコードポイントを置き換えたりすることにより、不正なバイトシーケンスを扱うことを推奨します。

4. Subsets

4. サブセット

This section describes three increasingly restrictive subsets that can be used in specifying acceptable content for text fields in protocols and data types. Specifications can refer to these subsets by the names "Unicode Scalars", "XML Characters", and "Unicode Assignables".

このセクションでは、プロトコルとデータ型のテキストフィールドに許容可能なコンテンツを指定する際に使用できる3つのますます制限されたサブセットについて説明します。仕様は、これらのサブセットを「Unicode Scalars」、「XML文字」、および「Unicode Assignables」という名前で参照できます。

4.1. Unicode Scalars

4.1. ユニコードスカラー

Definition D76 in Section 3.9 of [UNICODE] defines the term "Unicode scalar value" as "Any Unicode code point except high-surrogate and low-surrogate code points".

[Unicode]のセクション3.9の定義D76は、「ユニコードスカラー値」という用語を「高スロゲートおよび低スロゲートコードポイントを除くユニコードコードポイント」として定義しています。

The "Unicode Scalars" subset can be expressed as an ABNF production:

「Unicode Scalars」サブセットは、ABNF生産として表現できます。

   unicode-scalar =
      %x0-D7FF /    ; exclude surrogates
      %xE000-10FFFF

This subset is the default for Concise Binary Object Representation (CBOR) [RFC8949] and has the advantage of excluding surrogates. However, it includes legacy controls and noncharacters.

このサブセットは、簡潔なバイナリオブジェクト表現（CBOR）[RFC8949]のデフォルトであり、サロゲートを除外するという利点があります。ただし、レガシーコントロールと非文字が含まれています。

4.2. XML Characters

4.2. XML文字

The XML 1.0 Specification (Fifth Edition) [XML], in its grammar production labeled "Char", specifies a subset of Unicode code points that excludes surrogates, legacy C0 controls, and the noncharacters U+FFFE and U+FFFF.

XML 1.0仕様（第5版）[XML]は、「Char」というラベルの付いた文法制作で、サロゲート、レガシーC0コントロール、および非文字のu+fffeおよびu+ffffを除外するユニコードコードポイントのサブセットを指定します。

The "XML Characters" subset can be expressed as an ABNF production:

「XML文字」サブセットは、ABNFプロダクションとして表現できます。

   xml-character =
      %x9 / %xA / %xD /   ; useful controls
      %x20-D7FF /         ; exclude surrogates
      %xE000-FFFD /       ; exclude FFFE and FFFF nonchars
      %x10000-10FFFF

While this subset does not exclude all the problematic code points, the C1 controls are less likely than the C0 controls to appear erroneously in data and have not been observed to be a frequent source of problems. Also, the noncharacters greater in value than U+FFFF are rarely encountered.

このサブセットはすべての問題のあるコードポイントを除外していませんが、C1コントロールはC0コントロールよりもデータに誤って表示される可能性が低く、頻繁に問題の原因であることが観察されていません。また、u+ffffよりも価値が大きい非特徴はめったに遭遇しません。

4.3. Unicode Assignables

This document defines the "Unicode Assignables" subset as all the Unicode code points that are not problematic. This, a proper subset of each of the others, comprises all code points that are currently assigned, excluding legacy control codes, or that might be assigned in the future.

このドキュメントでは、「Unicode Assignables」サブセットを、問題ではないすべてのUnicodeコードポイントとして定義しています。これは、他のそれぞれの適切なサブセットであり、現在割り当てられているすべてのコードポイント、レガシー制御コードを除く、または将来割り当てられる可能性のあるすべてのコードポイントで構成されています。

Unicode Assignables can be expressed as an ABNF production:

Unicode Assignablesは、ABNF生産として表現できます。

   unicode-assignable =
      %x9 / %xA / %xD /               ; useful controls
      %x20-7E /                       ; exclude C1 controls and DEL
      %xA0-D7FF /                     ; exclude surrogates
      %xE000-FDCF /                   ; exclude FDD0 nonchars
      %xFDF0-FFFD /                   ; exclude FFFE and FFFF nonchars
      %x10000-1FFFD / %x20000-2FFFD / ; (repeat per plane)
      %x30000-3FFFD / %x40000-4FFFD /
      %x50000-5FFFD / %x60000-6FFFD /
      %x70000-7FFFD / %x80000-8FFFD /
      %x90000-9FFFD / %xA0000-AFFFD /
      %xB0000-BFFFD / %xC0000-CFFFD /
      %xD0000-DFFFD / %xE0000-EFFFD /
      %xF0000-FFFFD / %x100000-10FFFD

5. Using Subsets

5. サブセットを使用します

Many IETF specifications rely on well-known data formats such as JSON, Internet JSON (I-JSON), CBOR, YAML, and XML. These formats specify default subsets. For example, JSON allows object member names and string values to include any Unicode code point, including all the problematic types.

多くのIETF仕様は、JSON、Internet JSON（I-JSON）、CBOR、YAML、XMLなどのよく知られたデータ形式に依存しています。これらの形式は、デフォルトのサブセットを指定します。たとえば、JSONでは、オブジェクトメンバー名と文字列値に、すべての問題のあるタイプを含むUnicodeコードポイントを含めることができます。

A protocol based on JSON can be made more robust and implementor-friendly by restricting the contents of object member names and string values to one of the subsets described in Section 4. Equivalent restrictions are possible for other packaging formats such as I-JSON, XML, YAML, and CBOR.

JSONに基づくプロトコルは、オブジェクトメンバー名と文字列値のコンテンツをセクション4で説明したサブセットのいずれかに制限することにより、より堅牢で実装者に優しいものにすることができます。I-JSON、XML、YAML、CBORなどの他のパッケージ形式では、同等の制限が可能です。

Note that escaping techniques such as those in the JSON example in Section 3 cannot be used to circumvent this sort of restriction, which applies to data content, not textual representation in packaging formats. If a specification restricted a JSON field value to the Unicode Assignables, the example would remain a conforming JSON text but the data it represents would not constitute Unicode Assignable code points.

セクション3のJSON例のような脱出手法は、パッケージング形式のテキスト表現ではなく、データコンテンツに適用されるこの種の制限を回避するために使用できないことに注意してください。仕様がJSONフィールド値をUnicode Assignablesに制限している場合、この例は適合性JSONテキストのままですが、それが表すデータはUnicode割り当て可能なコードポイントを構成しません。

6. IANA Considerations

6. IANAの考慮事項

This document has no IANA actions.

このドキュメントにはIANAアクションがありません。

7. Security Considerations

7. セキュリティに関する考慮事項

Section 3 of this document discusses security issues.

このドキュメントのセクション3では、セキュリティの問題について説明します。

Unicode Security Considerations [TR36] is a wide-ranging survey of the issues implementors should consider while writing software to process Unicode text. Unicode Source Code Handling [TR55] discusses use of Unicode in programming languages, with a focus on security issues. Many of the attacks they discuss are aimed at deceiving human readers, but vulnerabilities involving issues such as surrogates and noncharacters are also covered and, in fact, can contribute to human-deceiving exploits.

Unicodeセキュリティに関する考慮事項[TR36]は、実装者がソフトウェアを作成してUnicodeテキストを処理する際に考慮すべき問題に関する幅広い調査です。Unicodeソースコード処理[TR55]は、セキュリティの問題に焦点を当てて、プログラミング言語でのUnicodeの使用について説明します。彼らが議論する攻撃の多くは、人間の読者を欺くことを目的としていますが、代理人や非特徴などの問題を含む脆弱性もカバーされており、実際、人間を否定する搾取に貢献することができます。

The security considerations in Section 12 of [RFC8264] generally apply to this document as well.

[RFC8264]のセクション12のセキュリティ上の考慮事項は、一般にこのドキュメントにも適用されます。

Note that the Unicode-character subsets specified in this document are increasingly restrictive, omitting more and more problematic code points, and thus should be less and less susceptible to many of these exploits. The subset in Section 4.3, "Unicode Assignables", excludes all of these code points.

このドキュメントで指定されているユニコードキャラクターサブセットはますます制限されており、問題のあるコードポイントをますます省略しているため、これらのエクスプロイトの多くにますます影響を受けやすくする必要があります。セクション4.3のサブセット「Unicode Assignables」は、これらのすべてのコードポイントを除外します。

8. References

8. 参考文献

8.1. Normative References

8.1. 引用文献

   [RFC5234]  Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax
              Specifications: ABNF", STD 68, RFC 5234,
              DOI 10.17487/RFC5234, January 2008,
              <https://www.rfc-editor.org/info/rfc5234>.

   [TR36]     Davis, M., Ed. and M. Suignard, Ed., "Unicode Security
              Considerations", <https://www.unicode.org/reports/tr36/>.

   [TR55]     Leroy, R., Ed. and M. Davis, Ed., "Unicode Source Code
              Handling", <https://www.unicode.org/reports/tr55/>.

   [UNICODE]  The Unicode Consortium, "The Unicode Standard",
              <http://www.unicode.org/versions/latest/>.  Note that this
              reference is to the latest version of Unicode, rather than
              to a specific release.  It is not expected that future
              changes in the Unicode Standard will affect the referenced
              definitions.

8.2. Informative References

8.2. 参考引用

   [IDN]      "Internationalized Domain Name Working Group",
              <https://datatracker.ietf.org/group/idn/>.

   [PRECIS]   "PRECIS Working Group",
              <https://datatracker.ietf.org/group/precis/>.

   [RFC2277]  Alvestrand, H., "IETF Policy on Character Sets and
              Languages", BCP 18, RFC 2277, DOI 10.17487/RFC2277,
              January 1998, <https://www.rfc-editor.org/info/rfc2277>.

   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
              10646", STD 63, RFC 3629, DOI 10.17487/RFC3629, November
              2003, <https://www.rfc-editor.org/info/rfc3629>.

   [RFC8259]  Bray, T., Ed., "The JavaScript Object Notation (JSON) Data
              Interchange Format", STD 90, RFC 8259,
              DOI 10.17487/RFC8259, December 2017,
              <https://www.rfc-editor.org/info/rfc8259>.

   [RFC8264]  Saint-Andre, P. and M. Blanchet, "PRECIS Framework:
              Preparation, Enforcement, and Comparison of
              Internationalized Strings in Application Protocols",
              RFC 8264, DOI 10.17487/RFC8264, October 2017,
              <https://www.rfc-editor.org/info/rfc8264>.

   [RFC8949]  Bormann, C. and P. Hoffman, "Concise Binary Object
              Representation (CBOR)", STD 94, RFC 8949,
              DOI 10.17487/RFC8949, December 2020,
              <https://www.rfc-editor.org/info/rfc8949>.

   [RFC9413]  Thomson, M. and D. Schinazi, "Maintaining Robust
              Protocols", RFC 9413, DOI 10.17487/RFC9413, June 2023,
              <https://www.rfc-editor.org/info/rfc9413>.

   [W3C-CHAR] W3C, "Character encodings: Essential concepts",
              <https://www.w3.org/International/articles/definitions-
              characters/>.

   [XML]      Bray, T., Ed., Paoli, J., Ed., McQueen, C.M., Ed., Maler,
              E., Ed., and F. Yergeau, Ed., "Extensible Markup Language
              (XML) 1.0 (Fifth Edition)", W3C Recommendation, 26
              November 2008,
              <http://www.w3.org/TR/2008/REC-xml-20081126/>.

Acknowledgements

謝辞

Thanks are due to Guillaume Fortin-Debigaré, who filed an errata report against RFC 8259, "The JavaScript Object Notation (JSON) Data Interchange Format", noting frequent references to "Unicode characters", when in fact the RFC formally specifies the use of Unicode code points.

RFC 8259に対して「JavaScriptオブジェクト表記（JSON）データインターチェンジ形式」というRFC 8259に対してERRATAレポートを提出したGuillaume Fortin-Debigaréに感謝します。

Thanks also to Asmus Freytag for careful review and many constructive suggestions aimed at making the language more consistent with the structure of the Unicode Standard.

また、慎重にレビューしてくれたAsmus Freytagと、言語をUnicode標準の構造とより一致させることを目的とした多くの建設的な提案にも感謝します。

Thanks also to James Manger for the correctness of the ABNF and JSON samples.

ABNFおよびJSONサンプルの正しさについてもJames Mangerに感謝します。

Thanks also to Addison Phillips and the W3C Internationalization Working Group for helpful suggestions on language and references.

言語と参照に関する有益な提案について、Addison PhillipsとW3C Internationalization Working Groupにも感謝します。

Thoughtful comments during the many draft versions of this document, which helped tighten up wording and make difficult points clearer, were contributed by Harald Alvestrand, Martin J. Dürst, Donald E. Eastlake, John Klensin, Barry Leiba, Glyn Normington, Peter Saint-Andre, and Rob Sayre.

この文書の多くのドラフトバージョンでの思慮深いコメントは、文言の締め付けと困難なポイントをより明確にするのに役立ちましたが、ハラルド・アルベスランド、マーティン・J・デュルスト、ドナルド・E・イーストレイク、ジョン・クレンシン、バリー・レイバ、グリン・ノルミントン、ピーター・サン・アンドレ、ロブ・セイレによって貢献しました。

Authors' Addresses

著者のアドレス

   Tim Bray
   Textuality Services
   Email: tbray@textuality.com

   Paul Hoffman
   ICANN
   Email: paul.hoffman@icann.org

RFC 9839 - Unicode Character Repertoire Subsets 日本語訳

Unicode Character Repertoire Subsets

Unicode文字レパートリーサブセット

Abstract

概要

Status of This Memo

本文書の位置付け

Copyright Notice

著作権表示

Table of Contents

目次

1. Introduction

1. はじめに

1.1. Notation

1.1. 表記

2. Characters and Code Points

2. 文字とコードポイント

2.1. Encoding Forms

2.1. フォームのエンコード

2.2. Problematic Code Points

2.2. 問題のあるコードポイント

2.2.1. Surrogates

2.2.1. サロゲート

2.2.2. Control Codes

2.2.2. 制御コード

2.2.2.1. Useful Controls

2.2.2.1. 有用なコントロール

2.2.2.2. Legacy Controls

2.2.2.2. レガシーコントロール

2.2.3. Noncharacters

2.2.3. 非特徴

3. Dealing with Problematic Code Points

3. 問題のあるコードポイントを扱う

4. Subsets

4. サブセット

4.1. Unicode Scalars

4.1. ユニコードスカラー

4.2. XML Characters

4.2. XML文字

4.3. Unicode Assignables

4.3. Unicode Assignables

5. Using Subsets

5. サブセットを使用します

6. IANA Considerations

6. IANAの考慮事項

7. Security Considerations

7. セキュリティに関する考慮事項

8. References

8. 参考文献

8.1. Normative References

8.1. 引用文献

8.2. Informative References

8.2. 参考引用

Acknowledgements

謝辞

Authors' Addresses

著者のアドレス