PHP 8.4: 新增 grapheme_str_split 函数

PHP 8.4 中的 Intl 扩展添加了一个名为 grapheme_str_split 的新函数，该函数将给定的字符串拆分为一个字素数组。

字素是语言系统中有意义和功能的最小单位。相比之下，来自 Mbstring 扩展的 mb_str_split 函数具有相似的语义，但主要区别在于mb_str_split 函数将字符串拆分为 Unicode 多字节字符，而 grapheme_str_silit 函数则拆分为书写系统的功能单元。

在某些复杂语言和带有修饰语的表情符号中呈现字符时，Unicode 字符和字形之间的区别很重要。mb_str_split 将字符串拆分为 Unicode 码点，而 grapheme_str_split 将字符串划分为功能性单元。单个 Unicode 码点是有效字符，但在复杂的脚本和表情符号中，用 mb_str_split 分割字符串可能会破坏某些字符，从而丢失元音字符等修饰语。

例如，僧伽罗语单词 අයේෂ්(英语发音为 "Ayesh")在僧伽罗文中由三个单位组成：අ + යේ + ෂ්. අ 是一个独立的字符，但 යේ 和 ෂ් 字符使用额外的 Unicode 码点作为元音修饰符。grapheme_str_split 将单词正确地拆分为符合僧伽罗语书写系统的单个字符，而 mb_str.split 将其拆分为单个 Unicode 码点：අ + ය + ේ + ෂ + ්。

以下是更多来自各种语言和表情符号的示例：

String Unicode representation	`grapheme_str_split` output Unicode representation	`mb_str_split` output Unicode representation
`PHP` `0050 0048 0050`	`P` + `H` + `P` `0050` + `0048` + `0050`	`P` + `H` + `P` `0050` + `0048` + `0050`
`你好` `4F60 597D`	`你` + `好` `4F60` + `597D`	`你` + `好` `4F60` + `597D`
`අයේෂ්` `0D85 0DBA 0DDA 0DC2 0DCA`	`අ` + `යේ` + `ෂ්` `0D85U` + `0DBA 0DDA` + `0DC2 0DCA`	`අ` + `ය` + `ේ` + `ෂ` + `්` `0D85` + `0DBAU` + `0DDAU` + `0DC2U` + `0DCA`
`สวัสดี` `0E2A 0E27 0E31 0E2A 0E14 0E35`	`ส` + `วั` + `ส` + `ดี` `0E2A` + `0E27 0E31` + `0E2A 0DCA` + `0E2A` + `0E14 0E35`	`ส` + `ว` + `ั` + `ส` + `ด` + `ี` `0E2A` + `0E27` + `0E31` + `0E2A` + `0E14` + `0E35`
`👭🏻👰🏿‍♂️` `1F46D 1F3FB 1F470 1F3FF 200D 2642 FE0F`	`👭🏻` + `👰🏿` `1F46D 1F3FB` + `1F470 1F3FF 200D 2642 FE0F`	`👭` + `🏻` + `👰` + `🏿` + `‍` + `♂` + `️` `1F46D` + `1F3FB` + `1F470` + `1F3FF` + `200D` + `2642` + `FE0F`

`grapheme_str_split` 摘要

grapheme_str_split 函数类似于 mb_str_split 函数，支持指定一个 int $length 参数来确定每个块的长度。如果长度大于整个或一块字素，则将返回字符串/块。

传入空字符串将返回空数组。

/**
 * Splits a string into an array of individual or chunks of graphemes.
 *
 * @param string $string The string to split into individual graphemes
 *  or chunks of graphemes.
 * @param int $length If specified, each element of the returned array
 *  will be composed of multiple graphemes instead of a single
 *  graphemes.
 *
 * @return array|false
 */
function grapheme_str_split(string $string, int $length = 1): array|false {}

`grapheme_str_split` 用例

grapheme_str_split("PHP");
// ["P", "H", "P"]

grapheme_str_split("你好");
// ["你", "好"]

grapheme_str_split("你好", length: 4);
// ["你好"]

grapheme_str_split("สวัสดี");
// ["ส", "วั", "ส", "ดี"]

grapheme_str_split("අයේෂ්");
// ["අ", "යේ", "ෂ්"]

grapheme_str_split("👭🏻👰🏿‍♂️");
// ["👭🏻", "👰🏿‍♂️"]

向后兼容性影响

grapheme_str_split 函数是 Intl 扩展中新增的，并且在全局命名空间声明了。除非现有的函数中有完全相同的名称，这一修改不会引入后续兼容性问题。

`grapheme_str_split` polyfill

可以使用 Unicode 正则表达式对 grapheme_str_split 函数进行 polyfill。/\X/ 选择器匹配一个完整的字素，可以用作 polyfill 的基础。

请注意，以下 polyfill 使用了 \X 正则表达式，该表达式与完整的 Grapheme 相匹配。然而，它不能正确分割复杂的表情符号(Enoji)，例如 PCRE 2 库版本 <=10.43 上带有皮肤修饰的表情符号。

/**
 * Splits a string into an array of individual or chunks of graphemes.
 *
 * @param string $string The string to split into individual graphemes
 *  or chunks of graphemes.
 * @param int $length If specified, each element of the returned array
 *  will be composed of multiple graphemes instead of a single
 *  graphemes.
 *
 * @return array|false
 */
function grapheme_str_split(string $string, int $length = 1): array|false {
    if ($length < 0 || $length > 1073741823) {
        throw new \ValueError('grapheme_str_split(): Argument #2 ($length) must be greater than 0 and less than or equal to 1073741823.');
    }
    if ($string === '') {
        return [];
    }

    preg_match_all('/\X/u', $string, $matches);

    if (empty($matches[0])) {
        return false;
    }

    if ($length === 1) {
        return $matches[0];
    }

    $chunks = array_chunk($matches[0], $length);

    array_walk($chunks, static function(&$value) {
        $value = implode('', $value);
    });

    return $chunks;
}

PHP 8.4: 新增 grapheme_str_split 函数

`grapheme_str_split` 摘要

`grapheme_str_split` 用例

向后兼容性影响

`grapheme_str_split` polyfill

相关推荐：

最新文章：

PHP 8.4: 新增 grapheme_str_split 函数

grapheme_str_split 摘要

grapheme_str_split 用例

向后兼容性影响

grapheme_str_split polyfill

相关推荐：

最新文章：

`grapheme_str_split` 摘要

`grapheme_str_split` 用例

`grapheme_str_split` polyfill