Intl.Segmenter:JavaScript中的Unicode分段

翻译序言



这是提案说明部分的翻译,Intl.Segmenter很可能会添加到下一个ECMAScript规范中。



该建议已在V8中实现,并且没有标记可以在8.7版本中使用(更确切地说,在版本中8.7.38以及更高版本),因此可以在Google Chrome Canary(从版本开始87.0.4252.0)或Node.js V8 Canary(从版本开始v15.0.0-v8-canary202009025a2ca762b8;二进制文件可用于Windows)中进行测试 v15.0.0-v8-canary202009173b56586162)。



如果您使用标记在早期版本中进行测试--harmony-intl-segmenter,请注意,因为规范已更改,并且标记下的实现可能已过时。通过代码示例中的输出进行检查。



翻译后,将根据该提案解决的问题提供到材料的链接。






Intl.Segmenter:JavaScript中的Unicode分段



在Richard Gibson的支持下,提案处于第3阶段。



动机



Unicode中的代码点不是“字母”或在屏幕上显示文本的单位。字素由多个字位组成(例如,包括重音符号或连接韩语字符),它是由这个字素扮演的。Unicode定义了一种隔离字素的算法,有助于找到它们之间的边界。在构建现代编辑器,输入辅助工具或其他形式的文字处理时,这很有用。



Unicode还定义了算法,以查找CLDR(通用语言环境数据存储库)在语言环境(语言环境)之间分布的单词和句子之间的边界。例如,在创建支持单词或句子之间的过渡命令及其突出显示的文本编辑器时,这些边界会有所帮助。



, UAX 29. , JavaScript .



Chrome API Intl.v8BreakIterator. API . API, API JavaScript — , ES2015.







, segment(), Intl.Segmenter, Iterable.



//      .
let segmenter = new Intl.Segmenter("fr", {granularity: "word"});

//       .
let input = "Moi?  N'est-ce pas.";
let segments = segmenter.segment(input);

//    !
for (let {segment, index, isWordLike} of segments) {
  console.log("segment at code units [%d, %d): «%s»%s",
    index, index + segment.length,
    segment,
    isWordLike ? " (word-like)" : ""
  );
}

//  console.log:
// segment at code units [0, 3): «Moi» (word-like)
// segment at code units [3, 4): «?»
// segment at code units [4, 6): «  »
// segment at code units [6, 11): «N'est» (word-like)
// segment at code units [11, 12): «-»
// segment at code units [12, 14): «ce» (word-like)
// segment at code units [14, 15): « »
// segment at code units [15, 18): «pas» (word-like)
// segment at code units [18, 19): «.»


, API .



// ┃0 1 2 3 4 5┃6┃7┃8┃9
// ┃A l l o n s┃-┃y┃!┃
let input = "Allons-y!";

let segmenter = new Intl.Segmenter("fr", {granularity: "word"});
let segments = segmenter.segment(input);
let current = undefined;

current = segments.containing(0)
// → { index: 0, segment: "Allons", isWordLike: true }

current = segments.containing(5)
// → { index: 0, segment: "Allons", isWordLike: true }

current = segments.containing(6)
// → { index: 6, segment: "-", isWordLike: false }

current = segments.containing(current.index + current.segment.length)
// → { index: 7, segment: "y", isWordLike: true }

current = segments.containing(current.index + current.segment.length)
// → { index: 8, segment: "!", isWordLike: false }

current = segments.containing(current.index + current.segment.length)
// → undefined


API



.



new Intl.Segmenter(locale, options)



.



options , granularity, ("grapheme" ( ), "word" ( ) "sentence" ( ); — "grapheme").



Intl.Segmenter.prototype.segment(string)



%Segments% Iterable .





:



  • segment — .
  • index — (code unit index) , .
  • input — .
  • isWordLiketrue, "word" ( ) ( /// ..); false, "word" ( // ..); undefined, "word".


%Segments%.prototype:



%Segments%.prototype.containing(index)



, , (code unit) , undefined, .



%Segments%.prototype[Symbol.iterator]



%SegmentIterator%, "" (lazy, ) , .



%SegmentIterator%.prototype:



%SegmentIterator%.prototype.next()



next() Iterator, IteratorResult, value , .



FAQ



? ?



— , . . . CLDR. , CLDR/ICU , .



API ?



, 3- , . TC39 . ; , , .



?



API, , API : , API (, ). API CSS Houdini.



?



API:



  • .
  • .
  • , (.. Web API (Web Platform), ECMAScript).
  • , . CLDR ICU . CSS, . . , , , ; .


?



%SegmentIterator%.prototype, (, seek([inclusiveStartIndex = thisIterator.index + 1]) seekBefore([exclusiveLastIndex = thisIterator.index]), . ECMA-262 ( ). , , .



API Intl, String?



, . segment() SegmentIterator. , API Intl, ECMA-402. , . String, , .



?



n (code unit), . , "Hello, world\u{1F499}" ( , - — ), 0, 5, 6, 7 12. : ┃Hello┃,┃ ┃world┃\u{1F499}┃, (code units), (code point). , .



?



, next().



, ?



, - QA ;)



Number: null 0, — 0 1, , , Symbol BigInt, undefined NaN *. , ( , ).



* . "fail". Chrome Canary, Symbol BigInt TypeError, undefined NaN , 0.








JavaScript.



  1. Joel Spolsky. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
  2. Dmitri Pavlutin. What every JavaScript developer should know about Unicode
  3. Dr. Axel Rauschmayer. JavaScript for impatient programmers: 17. Unicode – a brief introduction
  4. Dr. Axel Rauschmayer. JavaScript for impatient programmers: 18.6. Atoms of text: Unicode characters, JavaScript characters, grapheme clusters
  5. Jonathan New. "\u{1F4A9}".length === 2
  6. Nicolás Bevacqua. ES6 Strings (and Unicode, ) in Depth
  7. Mathias Bynens. JavaScript has a Unicode problem
  8. Mathias Bynens. Unicode-aware regular expressions in ECMAScript 6
  9. Mathias Bynens. Unicode property escapes in JavaScript regular expressions
  10. Mathias Bynens. Unicode sequence property escapes
  11. Awesome Unicode: a curated list of delightful Unicode tidbits, packages and resources



All Articles