Unicode字符属性
自从 PHP 4.4.0 和 5.1.0,
三个额外的转义序列在选用 UTF-8模式时用于匹配通用字符类型。他们是:
-
\p{xx}
-
一个有属性 xx 的字符
-
\P{xx}
-
一个没有属性 xx 的字符
-
\X
-
一个扩展的 Unicode 字符
上面 xx 代表的属性名用于限制 Unicode 通常的类别属性。
每个字符都有一个这样的确定的属性,通过两个缩写的字母指定。
为了与 perl 兼容,
可以在左花括号 { 后面增加 ^ 表示取反。比如:
\p{^Lu} 就等同于 \P{Lu}。
如果通过 \p 或 \P 仅指定了一个字母,它包含所有以这个字母开头的属性。
在这种情况下,花括号的转义序列是可选的。
支持的Unicode属性
Property |
Matches |
Notes |
C |
Other |
|
Cc |
Control |
|
Cf |
Format |
|
Cn |
Unassigned |
|
Co |
Private use |
|
Cs |
Surrogate |
|
L |
Letter |
Includes the following properties: Ll,
Lm, Lo, Lt and
Lu.
|
Ll |
Lower case letter |
|
Lm |
Modifier letter |
|
Lo |
Other letter |
|
Lt |
Title case letter |
|
Lu |
Upper case letter |
|
M |
Mark |
|
Mc |
Spacing mark |
|
Me |
Enclosing mark |
|
Mn |
Non-spacing mark |
|
N |
Number |
|
Nd |
Decimal number |
|
Nl |
Letter number |
|
No |
Other number |
|
P |
Punctuation |
|
Pc |
Connector punctuation |
|
Pd |
Dash punctuation |
|
Pe |
Close punctuation |
|
Pf |
Final punctuation |
|
Pi |
Initial punctuation |
|
Po |
Other punctuation |
|
Ps |
Open punctuation |
|
S |
Symbol |
|
Sc |
Currency symbol |
|
Sk |
Modifier symbol |
|
Sm |
Mathematical symbol |
|
So |
Other symbol |
|
Z |
Separator |
|
Zl |
Line separator |
|
Zp |
Paragraph separator |
|
Zs |
Space separator |
|
InMusicalSymbols 等扩展属性在 PCRE 中不支持
指定大小写不敏感匹配对这些转义序列不会产生影响,比如,
\p{Lu} 始终匹配大写字母。
Unicode 字符集在具体文字中定义。使用文字名可以匹配这些字符集中的一个字符。例如:
不在确定文字中的则被集中到 Common。当前的文字列表中有:
支持的文字
Arabic |
Armenian |
Avestan |
Balinese |
Bamum |
Batak |
Bengali |
Bopomofo |
Brahmi |
Braille |
Buginese |
Buhid |
Canadian_Aboriginal |
Carian |
Chakma |
Cham |
Cherokee |
Common |
Coptic |
Cuneiform |
Cypriot |
Cyrillic |
Deseret |
Devanagari |
Egyptian_Hieroglyphs |
Ethiopic |
Georgian |
Glagolitic |
Gothic |
Greek |
Gujarati |
Gurmukhi |
Han |
Hangul |
Hanunoo |
Hebrew |
Hiragana |
Imperial_Aramaic |
Inherited |
Inscriptional_Pahlavi |
Inscriptional_Parthian |
Javanese |
Kaithi |
Kannada |
Katakana |
Kayah_Li |
Kharoshthi |
Khmer |
Lao |
Latin |
Lepcha |
Limbu |
Linear_B |
Lisu |
Lycian |
Lydian |
Malayalam |
Mandaic |
Meetei_Mayek |
Meroitic_Cursive |
Meroitic_Hieroglyphs |
Miao |
Mongolian |
Myanmar |
New_Tai_Lue |
Nko |
Ogham |
Old_Italic |
Old_Persian |
Old_South_Arabian |
Old_Turkic |
Ol_Chiki |
Oriya |
Osmanya |
Phags_Pa |
Phoenician |
Rejang |
Runic |
Samaritan |
Saurashtra |
Sharada |
Shavian |
Sinhala |
Sora_Sompeng |
Sundanese |
Syloti_Nagri |
Syriac |
Tagalog |
Tagbanwa |
Tai_Le |
Tai_Tham |
Tai_Viet |
Takri |
Tamil |
Telugu |
Thaana |
Thai |
Tibetan |
Tifinagh |
Ugaritic |
Vai |
Yi |
|
|
|
|
\X 转义匹配任意数量的 Unicode 字符。
\X 等价于 (?>\PM\pM*)
也就是说,它匹配一个没有 "mark" 属性的字符,紧接着任意多个由 "mark" 属性的字符。
并将这个序列认为是一个原子组(详见下文)。
典型的有 "mark" 属性的字符是影响到前面的字符的重音符。
用 Unicode 属性来匹配字符并不快,
因为 PCRE 需要去搜索一个包含超过 15000 字符的数据结构。
这就是为什么在 PCRE中 要使用传统的转义序列\d、
\w 而不使用 Unicode 属性的原因。
User Contributed Notes
php at lnx-bsp dot net
25-Sep-2017 05:53
Not made clear in the top of page explanation, but these escaped character classes can be included within square brackets to make a broader character class. For example:
<?php preg_match( '/[\p{N}\p{L}]+/', $data ) ?>
Will match any combination of letters and numbers.
huhwatnouDONTspamPLEASE at hotmail dot com
20-Jan-2016 09:00
To select UTF-8 mode for the additional escape sequences (\p{xx}, \P{xx}, and \X) , use the "u" modifier (see http://php.net/manual/en/reference.pcre.pattern.modifiers.php).
I wondered why a German sharp S (?) was marked as a control character by \p{Cc} and it took me a while to properly read the first sentence: "Since 5.1.0, three additional escape sequences to match generic character types are available when UTF-8 mode is selected. " :-$ and then to find out how to do so.
Yzmir Ramirez
11-Oct-2013 07:51
If you are working with older environments you will need to first check to see if the version of PCRE will work with unicode directives described above:
<?php
$allowInternational = false;
if (defined('PCRE_VERSION')) {
if (intval(PCRE_VERSION) >= 7) { $allowInternational = true;
}
}
?>
Now you can do a fallback regex (e.g. use "/[a-z]/i"), when the PCRE library version is too old or not available.
o_shes01 at uni-muenster dot de
22-Jan-2011 06:23
For those who wonder: 'letter_titlecase' applies to digraphs/trigraphs, where capitalization involves only the first letter.
For example, there are three codepoints for the "LJ" digraph in Unicode:
(*) uppercase "LJ": U+01C7
(*) titlecase "Lj": U+01C8
(*) lowercase "lj": U+01C9
o_shes01 at uni-muenster dot de
21-Jan-2011 10:08
For those who wonder: 'letter_titlecase' applies to digraphs/trigraphs, where capitalization involves only the first letter.
For example, there are three codepoints for the "LJ" digraph in Unicode:
(*) uppercase "LJ": U+01C7
(*) titlecase "Lj": U+01C8
(*) lowercase "lj": U+01C9
suit at rebell dot at
01-Mar-2010 05:13
these properties are usualy only available if PCRE is compiled with "--enable-unicode-properties"
if you want to match any word but want to provide a fallback, you can do something like that:
<?php
if(@preg_match_all('/\p{L}+/u', $str, $arr) {
}
?>