[indent]原帖由 凯文君 于 2005-12-28 18:22 发表
有女同车,我也知道制定Unicode的目的就是为了给所有可能出现的字符提供唯一确定的编码。但你所说的“对编码的可扩展性的要求远远高于针对某一项具体应用的要求”这一句似乎有可商榷之处。我不知您所说的某一项具体应用指的是什么。但我从一个程序员的角度来理解,Uncode当然首先是一种计算机字符的编码方案,这个编码方案跟它之前曾出现过的所有字符编码方案如ASCII,GB,BIG5,JIS一样,都是为了在计算机里面处理语言文字信息而服务的。而我所说的字符应该尽量分类编码,就是从计算机文本字符处理这个根本角度出发的。因为对计算机来程序来说,它所看到的字符只是一个个的二进制数据(也就是字符的编码)。而字符的分类也是文本信息处理的一个基本的功能模块,打个很简单的比方:如果程序连一个字符是英文字母还是汉字都分不清,你还能指望它能干些什么?那么字符分类功能实现所需的基本技术是什么,那肯定是字符的编码特征–也就是某个字符的编码属于那个字符种类的编码区间里面。如果 Unicode无视这种来自实际应用的需求,它绝不可能实现其“统一字符集”的雄心壮志。 …[/indent]
发表于 2005-12-28 21:06:34 |只看该作者 有女同车
具体是哪一项应用,我没有具体指出那就是泛指的意思。
Unicode 跟GB ASCII BIG5 JIS相比最大的优点就是建立了groups <–planes <–rows <–cells的模型,提供了(迄今为止)最大限度的可扩展性能。相对而言,GB、BIG5之类都是“死”编码或者说是僵硬的编码体系。
对字符的分类不是也不应该是UNICODE的职责。因为分类可以从不同角度进行,无数种特定应用要求无数种分类方案。GB系列编码本身就是自相矛盾的产物,如GB2312-1用普通话音序,-2用新华字典部首+笔画序,GBK跟GB2312又采用了不同的页面格式等等。如果分类方法前后矛盾的话,我宁可不要分类,君以为如何?
针对凯君所提出了问题,我以为。第一,UNICODE的区间划分划分非常清晰,并没有互相杂厕的地方,因此不可能出现分不清哪个编码是汉字,哪个编码是拉丁字母的情况。(详下表一)unicode]的码位安排不是随意的,实用的编码特征就是字符的线性坐标,unified中的CJK字符的214部首和部首外笔画数可以通过一个简单的函数求得,见下表二
[indent]
表一
平面,區名,首編碼,尾編碼,編碼的數,指定字,加入統一碼的版本
0,Basic Latin,000000,00007F,128,128,1.0.0
0,Latin-1 Supplement,000080,0000FF,128,128,1.0.0
0,Latin Extended-A,000100,00017F,128,128,1.0.0
0,Latin Extended-B,000180,00024F,208,194,1.0.0
0,IPA Extensions,000250,0002AF,96,96,1.0.0
0,Spacing Modifier Letters,0002B0,0002FF,80,80,1.0.0
0,Combining Diacritical Marks,000300,00036F,112,112,1.0.0
0,Greek and Coptic,000370,0003FF,144,124,1.0.0
0,Cyrillic,000400,0004FF,256,248,1.0.0
0,Cyrillic Supplement,000500,00052F,48,16,3.2
0,Armenian,000530,00058F,96,86,1.0.0
0,Hebrew,000590,0005FF,112,86,1.0.0
0,Arabic,000600,0006FF,256,235,1.0.0
0,Syriac,000700,00074F,80,77,3.0
0,Arabic Supplement,000750,00077F,48,30,4.1
0,Thaana,000780,0007BF,64,50,3.0
0,Devanagari,000900,00097F,128,106,1.0.0
0,Bengali,000980,0009FF,128,91,1.0.0
0,Gurmukhi,000A00,000A7F,128,77,1.0.0
0,Gujarati,000A80,000AFF,128,83,1.0.0
0,Oriya,000B00,000B7F,128,81,1.0.0
0,Tamil,000B80,000BFF,128,71,1.0.0
0,Telugu,000C00,000C7F,128,80,1.0.0
0,Kannada,000C80,000CFF,128,82,1.0.0
0,Malayalam,000D00,000D7F,128,78,1.0.0
0,Sinhala,000D80,000DFF,128,80,3.0
0,Thai,000E00,000E7F,128,87,1.0.0
0,Lao,000E80,000EFF,128,65,1.0.0
0,Tibetan,000F00,000FFF,256,195,2.0
0,Myanmar,001000,00109F,160,78,3.0
0,Georgian,0010A0,0010FF,96,83,1.0.0
0,Hangul Jamo,001100,0011FF,256,240,1.1
0,Ethiopic,001200,00137F,384,356,3.0
0,Ethiopic Supplement,001380,00139F,32,26,4.1
0,Cherokee,0013A0,0013FF,96,85,3.0
0,Unified Canadian Aboriginal Syllabics,001400,00167F,640,630,3.0
0,Ogham,001680,00169F,32,29,3.0
0,Runic,0016A0,0016FF,96,81,3.0
0,Tagalog,001700,00171F,32,20,3.2
0,Hanunoo,001720,00173F,32,23,3.2
0,Buhid,001740,00175F,32,20,3.2
0,Tagbanwa,001760,00177F,32,18,3.2
0,Khmer,001780,0017FF,128,114,3.0
0,Mongolian,001800,0018AF,176,155,3.0
0,Limbu,001900,00194F,80,66,4.0
0,Tai Le,001950,00197F,48,35,4.0
0,New Tai Lue,001980,0019DF,96,80,4.1
0,Khmer Symbols,0019E0,0019FF,32,32,4.0
0,Buginese,001A00,001A1F,32,30,4.1
0,Phonetic Extensions,001D00,001D7F,128,128,4.0
0,Phonetic Extensions Supplement,001D80,001DBF,64,64,4.1
0,Combining Diacritical Marks Supplement,001DC0,001DFF,64,4,4.1
0,Latin Extended Additional,001E00,001EFF,256,246,1.1
0,Greek Extended,001F00,001FFF,256,233,1.1
0,General Punctuation,002000,00206F,112,106,1.0.0
0,Superscripts and Subscripts,002070,00209F,48,34,1.0.0
0,Currency Symbols,0020A0,0020CF,48,22,1.0.0
0,Combining Diacritical Marks for Symbols,0020D0,0020FF,48,28,1.0.0
0,Letterlike Symbols,002100,00214F,80,77,1.0.0
0,Number Forms,002150,00218F,64,49,1.0.0
0,Arrows,002190,0021FF,112,112,1.0.0
0,Mathematical Operators,002200,0022FF,256,256,1.0.0
0,Miscellaneous Technical,002300,0023FF,256,220,1.0.0
0,Control Pictures,002400,00243F,64,39,1.0.0
0,Optical Character Recognition,002440,00245F,32,11,1.0.0
0,Enclosed Alphanumerics,002460,0024FF,160,160,1.0.0
0,Box Drawing,002500,00257F,128,128,1.0.0
0,Block Elements,002580,00259F,32,32,1.0.0
0,Geometric Shapes,0025A0,0025FF,96,96,1.0.0
0,Miscellaneous Symbols,002600,0026FF,256,175,1.0.0
0,Dingbats,002700,0027BF,192,174,1.0.0
0,Miscellaneous Mathematical Symbols-A,0027C0,0027EF,48,35,3.2
0,Supplemental Arrows-A,0027F0,0027FF,16,16,3.2
0,Braille Patterns,002800,0028FF,256,256,3.0
0,Supplemental Arrows-B,002900,00297F,128,128,3.2
0,Miscellaneous Mathematical Symbols-B,002980,0029FF,128,128,3.2
0,Supplemental Mathematical Operators,002A00,002AFF,256,256,3.2
0,Miscellaneous Symbols and Arrows,002B00,002BFF,256,20,4.0
0,Glagolitic,002C00,002C5F,96,94,4.1
0,Coptic,002C80,002CFF,128,114,4.1
0,Georgian Supplement,002D00,002D2F,48,38,4.1
0,Tifinagh,002D30,002D7F,80,55,4.1
0,Ethiopic Extended,002D80,002DDF,96,79,4.1
0,Supplemental Punctuation,002E00,002E7F,128,26,4.1
0,CJK Radicals Supplement,002E80,002EFF,128,115,3.0
0,Kangxi Radicals,002F00,002FDF,224,214,3.0
0,Ideographic Description Characters,002FF0,002FFF,16,12,3.0
0,CJK Symbols and Punctuation,003000,00303F,64,64,1.0.0
0,Hiragana,003040,00309F,96,93,1.0.0
0,Katakana,0030A0,0030FF,96,96,1.0.0
0,Bopomofo,003100,00312F,48,40,1.0.0
0,Hangul Compatibility Jamo,003130,00318F,96,94,1.0.0
0,Kanbun,003190,00319F,16,16,1.0.0
0,Bopomofo Extended,0031A0,0031BF,32,24,3.0
0,CJK Strokes,0031C0,0031EF,48,16,4.1
0,Katakana Phonetic Extensions,0031F0,0031FF,16,16,3.2
0,Enclosed CJK Letters and Months,003200,0032FF,256,242,1.0.0
0,CJK Compatibility,003300,0033FF,256,256,1.0.0
0,CJK Unified Ideographs Extension A,003400,004DBF,6592,6582,3.0
0,Yijing Hexagram Symbols,004DC0,004DFF,64,64,4.0
0,CJK Unified Ideographs,004E00,009FFF,20992,20924,1.0.1
0,Yi Syllables,00A000,00A48F,1168,1165,3.0
0,Yi Radicals,00A490,00A4CF,64,55,3.0
0,Modifier Tone Letters,00A700,00A71F,32,23,4.1
0,Syloti Nagri,00A800,00A82F,48,44,4.1
0,Hangul Syllables,00AC00,00D7AF,11184,11172,2.0
0,Private Use Area,00E000,00F8FF,6400,6400,1.0.0
0,CJK Compatibility Ideographs,00F900,00FAFF,512,467,1.0.1
0,Alphabetic Presentation Forms,00FB00,00FB4F,80,58,1.1
0,Arabic Presentation Forms-A,00FB50,00FDFF,688,595,1.1
0,Variation Selectors,00FE00,00FE0F,16,16,3.2
0,Vertical Forms,00FE10,00FE1F,16,10,4.1
0,Combining Half Marks,00FE20,00FE2F,16,4,1.1
0,CJK Compatibility Forms,00FE30,00FE4F,32,32,1.0.0
0,Small Form Variants,00FE50,00FE6F,32,26,1.0.0
0,Arabic Presentation Forms-B,00FE70,00FEFF,144,141,1.0.0
0,Halfwidth and Fullwidth Forms,00FF00,00FFEF,240,225,1.0.0
0,Specials,00FFF0,00FFFF,16,5,1.0.0
1,Linear B Syllabary,010000,01007F,128,88,4.0
1,Linear B Ideograms,010080,0100FF,128,123,4.0
1,Aegean Numbers,010100,01013F,64,57,4.0
1,Ancient Greek Numbers,010140,01018F,80,75,4.1
1,Old Italic,010300,01032F,48,35,3.1
1,Gothic,010330,01034F,32,27,3.1
1,Ugaritic,010380,01039F,32,31,4.0
1,Old Persian,0103A0,0103DF,64,50,4.1
1,Deseret,010400,01044F,80,80,3.1
1,Shavian,010450,01047F,48,48,4.0
1,Osmanya,010480,0104AF,48,40,4.0
1,Cypriot Syllabary,010800,01083F,64,55,4.0
1,Kharoshthi,010A00,010A5F,96,65,4.1
1,Byzantine Musical Symbols,01D000,01D0FF,256,246,3.1
1,Musical Symbols,01D100,01D1FF,256,219,3.1
1,Ancient Greek Musical Notation,01D200,01D24F,80,70,4.1
1,Tai Xuan Jing Symbols,01D300,01D35F,96,87,4.0
1,Mathematical Alphanumeric Symbols,01D400,01D7FF,1024,994,3.1
2,CJK Unified Ideographs Extension B,020000,02A6DF,42720,42711,3.1
2,CJK Compatibility Ideographs Supplement,02F800,02FA1F,544,542,3.1
14,Tags,0E0000,0E007F,128,97,3.1
14,Variation Selectors Supplement,0E0100,0E01EF,240,240,4.0
15,Supplementary Private Use Area-A,0F0000,0FFFFF,65536,65534,2.0
16,Supplementary Private Use Area-B,100000,10FFFF,65536,65534,2.0
[/indent]
[indent]
表二–tabel :radical
index A1 A2 B1 B2 C1 C2
1 一 丧 㐀 㐂 𠀀 𠁠
2 丨 丵 㐃 㐄 𠁡 𠁻
3 丶 举 丶 举 𠁼 𠂅
4 丿 乘 㐅 㐆 𠂆 𠃈
5 乙 亄 㐇 㐦 𠃉 𠄋
6 亅 事 㐧 㐨 𠄌 𠄝
7 二 亟 㐩 㐩 𠄞 𠅀
8 亠 亹 㐪 㐯 𠅁 𠆡
9 人 儾 㐰 㒪 𠆢 𠑵
10 儿 兤 㒫 㒯 𠑶 𠓚
11 入 兪 㒰 㒴 𠓛 𠓿
12 八 冁 㒵 㒹 𠔀 𠔻
13 冂 冕 㒺 㒿 𠔼 𠕲
14 冖 冪 㓀 㓄 𠕳 𠖫
15 冫 凟 㓅 㓗 𠖬 𠘦
16 几 凴 㓘 㓘 𠘧 𠙳
17 凵 凿 㓙 㓙 𠙴 𠚢
18 刀 劚 㓚 㔒 𠚣 𠠱
19 力 勸 㔓 㔧 𠠲 𠣋
20 勹 匔 㔨 㔪 𠣌 𠤍
21 匕 匙 㔫 㔮 𠤎 𠤫
22 匚 匷 㔯 㔶 𠤬 𠥬
23 匸 區 㔷 㔸 𠥭 𠥺
24 十 卛 㔹 㔼 𠥻 𠧑
25 卜 卨 㔽 㔽 𠧒 𠨌
26 卩 厁 㔾 㕁 𠨍 𠨫
27 厂 厵 㕂 㕔 𠨬 𠫒
28 厶 叇 㕕 㕙 𠫓 𠬙
29 又 叢 㕚 㕢 𠬚 𠮘
30 口 囖 㕣 㘜 𠮙 𡆟
31 囗 圞 㘝 㘥 𡆠 𡈻
32 土 壪 㘦 㚂 𡈼 𡔚
33 士 夁 㚃 㚄 𡔛 𡕑
34 夂 変 㚅 㚅 𡕒 𡕝
35 夊 夔 㚆 㚇 𡕞 𡖃
36 夕 夦 㚈 㚍 𡖄 𡗑
37 大 奲 㚎 㚡 𡗒 𡚥
38 女 孏 㚢 㜼 𡚦 𡤻
39 子 孿 㜽 㝈 𡤼 𡦸
40 宀 寷 㝉 㝲 𡦹 𡬜
41 寸 導 㝳 㝷 𡬝 𡭓
42 小 尡 㝸 㝻 𡭔 𡯀
43 尢 尷 㝼 㞊 𡯁 𡰢
44 尸 屭 㞋 㞡 𡰣 𡳽
45 屮 屰 㞢 㞥 𡳾 𡴬
46 山 巚 㞦 㠨 𡴭 𡿥
47 巛 巤 㠩 㠩 𡿦 𢀐
48 工 巰 㠪 㠮 𢀑 𢀲
49 己 巽 㠯 㠱 𢀳 𢁑
50 巾 幱 㠲 㡪 𢁒 𢆈
51 干 幹 干 幹 𢆉 𢆮
52 幺 幾 㡫 㡮 𢆯 𢇖
53 广 廳 㡯 㢞 𢇗 𢌖
54 廴 廽 㢟 㢠 𢌗 𢌫
55 廾 弊 㢡 㢣 𢌬 𢍹
56 弋 弒 㢤 㢦 𢍺 𢎖
57 弓 彏 㢧 㣆 𢎗 𢑎
58 彐 彠 㣇 㣈 𢑏 𢑿
59 彡 彲 㣉 㣓 𢒀 𢒻
60 彳 忂 㣔 㣹 𢒼 𢖨
61 心 戇 㣺 㦭 𢖩 𢦋
62 戈 戵 㦮 㦽 𢦌 𢨣
63 戶 扊 㦾 㧂 𢨤 𢩤
64 手 攮 㧃 㩹 𢩥 𢺴
65 支 攳 㩺 㩾 𢺵 𢻪
66 攴 斆 㩿 㪮 𢻫 𣁀
67 文 斖 㪯 㪱 𣁁 𣁫
68 斗 斣 㪲 㪻 𣁬 𣂐
69 斤 斸 㪼 㫂 𣂑 𣃖
70 方 旟 㫃 㫏 𣃗 𣄬
71 无 旤 无 旤 𣄭 𣄺
72 日 曯 㫐 㬭 𣄻 𣌠
73 曰 朇 㬮 㬲 𣌡 𣍜
74 月 朧 㬳 㭀 𣍝 𣎲
75 木 欟 㭁 㰜 𣎳 𣡿
76 欠 歡 㰝 㱎 𣢀 𣥁
77 止 歸 㱏 㱘 𣥂 𣦴
78 歹 殲 㱙 㱻 𣦵 𣪁
79 殳 毊 㱼 㲊 𣪂 𣫫
80 毋 毓 毋 毓 𣫬 𣬁
81 比 毚 㲋 㲋 𣬂 𣬚
82 毛 氎 㲌 㲲 𣬛 𣱄
83 氏 氓 㲳 㲳 𣱅 𣱔
84 气 氳 㲴 㲷 𣱕 𣱰
85 水 灪 㲸 㶠 𣱱 𤆁
86 火 爩 㶡 㸑 𤆂 𤓮
87 爪 爵 㸒 㸕 𤓯 𤕍
88 父 爺 㸖 㸙 𤕎 𤕛
89 爻 爾 㸚 㸚 𤕜 𤕩
90 爿 牆 㸛 㸜 𤕪 𤖧
91 片 牘 㸝 㸥 𤖨 𤘄
92 牙 牚 㸦 㸧 𤘅 𤘓
93 牛 犫 㸨 㹛 𤘔 𤜙
94 犬 玃 㹜 㺧 𤜚 𤣤
95 玄 玈 玄 玈 𤣥 𤣨
96 玉 瓛 㺨 㼈 𤣩 𤫩
97 瓜 瓥 㼉 㼖 𤫪 𤬥
98 瓦 甗 㼗 㽍 𤬦 𤮹
99 甘 甞 㽎 㽑 𤮺 𤯒
100 生 甧 㽒 㽔 𤯓 𤰂
101 用 甯 用 甯 𤰃 𤰑
102 田 疊 㽕 㽯 𤰒 𤴒
103 疋 疑 㽰 㽰 𤴓 𤴤
104 疒 癵 㽱 㿜 𤴥 𤼤
105 癶 發 癶 發 𤼥 𤼼
106 白 皭 㿝 㿩 𤼽 𤿅
107 皮 皾 㿪 㿺 𤿆 𥀾
108 皿 盭 㿻 䀍 𥀿 𥃣
109 目 矚 䀎 䂅 𥃤 𥍜
110 矛 矡 䂆 䂎 𥍝 𥎥
111 矢 矲 䂏 䂕 𥎦 𥐔
112 石 礹 䂖 䃻 𥐕 𥘄
113 示 禷 䃼 䄥 𥘅 𥜺
114 禸 禽 禸 禽 𥜻 𥝋
115 禾 穳 䄦 䆐 𥝌 𥤡
116 穴 竊 䆑 䇁 𥤢 𥩔
117 立 竸 䇂 䇕 𥩕 𥫖
118 竹 籲 䇖 䉹 𥫗 𥸤
119 米 糷 䉺 䊴 𥸥 𥾄
120 糸 缵 䊵 䍁 𥾅 𦈡
121 缶 罐 䍂 䍎 𦈢 𦉩
12
[/indent]
[indent]基于表二部首查询的sql 实现
select index from radical where (x>=A1 and x<=A2) or (x>=B1 and x<=B2) or (x>=C1 and x<=C2)
只用一条语句就够了[/indent]
第二,97年规范部件表中的部件有好多本身是就是汉字。UNICODE收录CJK汉字的排序准则就是《康熙字典》的部首笔画序。扩充A、B都严格地遵守了这个准则。事实上这也是唯一被CJKV四国和泛汉字区共同认可的准则,我不知道如果违背了这一准则UNICODE的CJKV还有没有实现的可能。所以说,给部件建区,可以,把部件从unified ideographs中剥离出来,不行(尤其是成字的部件)!