HanLP 里,汉字转简单,简体繁体转换,都用到了 双数组字典树 (Double-array Trie)、Aho-Corasick DoubleArrayTire 算法 ACDAT - 基于双数组字典树的AC自动机 需要先熟悉
对重载不是重任
进行转拼音,效果如下:
原文:重载不是重任 拼音(数字音调):chong2,zai3,bu2,shi4,zhong4,ren4, 拼音(符号音调):chóng,zǎi,bú,shì,zhòng,rèn, 拼音(无音调):chong,zai,bu,shi,zhong,ren, 声调:2,3,2,4,4,4, 声母:ch,z,b,sh,zh,r, 韵母:ong,ai,u,i,ong,en, 输入法头:ch,z,b,sh,zh,r,
pinyin.txt
一丁点儿=yi1,ding1,dian3,er5 一不小心=yi1,bu4,xiao3,xin1 一丘之貉=yi1,qiu1,zhi1,he2 一丝不差=yi4,si1,bu4,cha1 一丝不苟=yi1,si1,bu4,gou3 一个=yi1,ge4 一个半个=yi1,ge4,ban4,ge4 一个巴掌拍不响=yi1,ge4,ba1,zhang3,pai1,bu4,xiang3 一个萝卜一个坑=yi1,ge4,luo2,bo5,yi1,ge4,keng1 一举两得=yi1,ju3,liang3,de2 一之为甚=yi1,zhi1,wei2,shen4
训练,生成 pinyin.txt.bin
HanLP-1.7.5\src\main\java\com\hankcs\hanlp\corpus\dictionary\SimpleDictionary.java
加载语料库,每行读取,按 =
分隔,放入字典 trie
中
根据 =
右边每个字的拼音,通过 Pinyin.valueOf("yi1")
得到枚举中声母、韵母、音调、包含音调的字符串形式、不含音调的字符串形式
public enum Pinyin { a1(Shengmu.none, Yunmu.a, 1, "ā", "a", Head.a, 'a'), a2(Shengmu.none, Yunmu.a, 2, "á", "a", Head.a, 'a'), a3(Shengmu.none, Yunmu.a, 3, "ǎ", "a", Head.a, 'a'), a4(Shengmu.none, Yunmu.a, 4, "à", "a", Head.a, 'a'), a5(Shengmu.none, Yunmu.a, 5, "a", "a", Head.a, 'a'), ai1(Shengmu.none, Yunmu.ai, 1, "āi", "ai", Head.a, 'a'), ai2(Shengmu.none, Yunmu.ai, 2, "ái", "ai", Head.a, 'a'), ai3(Shengmu.none, Yunmu.ai, 3, "ǎi", "ai", Head.a, 'a'), ai4(Shengmu.none, Yunmu.ai, 4, "ài", "ai", Head.a, 'a'), ...... }
将Map构建成双数组树`trie.build(map)``,可查看:HanLP — 双数组字典树 (Double-array Trie) 实现原理 -- 代码 + 图文,看不懂你来打我
public void build(TreeMap<String, V> map) { // 把值保存下来 v = (V[]) map.values().toArray(); l = new int[v.length]; Set<String> keySet = map.keySet(); // 构建二分trie树 addAllKeyword(keySet); // 在二分trie树的基础上构建双数组trie树 buildDoubleArrayTrie(keySet); used = null; // 构建failure表并且合并output表 constructFailureStates(); rootState = null; loseWeight(); }
通过 saveDat(path, trie, map.entrySet());
生成模型文件
static boolean saveDat(String path, AhoCorasickDoubleArrayTrie<Pinyin[]> trie, Set<Map.Entry<String, Pinyin[]>> entrySet) { try { DataOutputStream out = new DataOutputStream(new BufferedOutputStream(IOUtil.newOutputStream(path + Predefine.BIN_EXT))); out.writeInt(entrySet.size()); for (Map.Entry<String, Pinyin[]> entry : entrySet) { Pinyin[] value = entry.getValue(); out.writeInt(value.length); for (Pinyin pinyin : value) { out.writeInt(pinyin.ordinal()); } } trie.save(out); out.close(); } catch (Exception e) { logger.warning("缓存值dat" + path + "失败"); return false; } return true; }
/** * 持久化 * * @param out 一个DataOutputStream * @throws Exception 可能的IO异常等 */ public void save(DataOutputStream out) throws Exception { out.writeInt(size); for (int i = 0; i < size; i++) { out.writeInt(base[i]); out.writeInt(check[i]); out.writeInt(fail[i]); int output[] = this.output[i]; if (output == null) { out.writeInt(0); } else { out.writeInt(output.length); for (int o : output) { out.writeInt(o); } } } out.writeInt(l.length); for (int length : l) { out.writeInt(length); } }
// path = data/dictionary/pinyin/pinyin.txt static boolean loadDat(String path) { ByteArray byteArray = ByteArray.createByteArray(path + Predefine.BIN_EXT); if (byteArray == null) return false; int size = byteArray.nextInt(); Pinyin[][] valueArray = new Pinyin[size][]; for (int i = 0; i < valueArray.length; ++i) { int length = byteArray.nextInt(); valueArray[i] = new Pinyin[length]; for (int j = 0; j < length; ++j) { valueArray[i][j] = pinyins[byteArray.nextInt()]; } } if (!trie.load(byteArray, valueArray)) return false; return true; } public boolean load(ByteArray byteArray, V[] value) { if (byteArray == null) return false; size = byteArray.nextInt(); base = new int[size + 65535]; // 多留一些,防止越界 check = new int[size + 65535]; fail = new int[size + 65535]; output = new int[size + 65535][]; int length; for (int i = 0; i < size; ++i) { base[i] = byteArray.nextInt(); check[i] = byteArray.nextInt(); fail[i] = byteArray.nextInt(); length = byteArray.nextInt(); if (length == 0) continue; output[i] = new int[length]; for (int j = 0; j < output[i].length; ++j) { output[i][j] = byteArray.nextInt(); } } length = byteArray.nextInt(); l = new int[length]; for (int i = 0; i < l.length; ++i) { l[i] = byteArray.nextInt(); } v = value; return true; }
通过 HanLP — Aho-Corasick DoubleArrayTire 算法 ACDAT - 基于双数组字典树的AC自动机 找出汉字的拼音
public static void main(String[] args) { String text = "重载不是重任"; List<Pinyin> pinyinList = HanLP.convertToPinyinList(text); System.out.print("原文:"); for (char c : text.toCharArray()) { System.out.printf("%c", c); } System.out.println(); System.out.print("拼音(数字音调):"); for (Pinyin pinyin : pinyinList) { System.out.printf("%s,", pinyin); } System.out.println(); System.out.print("拼音(符号音调):"); for (Pinyin pinyin : pinyinList) { System.out.printf("%s,", pinyin.getPinyinWithToneMark()); } System.out.println(); System.out.print("拼音(无音调):"); for (Pinyin pinyin : pinyinList) { System.out.printf("%s,", pinyin.getPinyinWithoutTone()); } System.out.println(); System.out.print("声调:"); for (Pinyin pinyin : pinyinList) { System.out.printf("%s,", pinyin.getTone()); } System.out.println(); System.out.print("声母:"); for (Pinyin pinyin : pinyinList) { System.out.printf("%s,", pinyin.getShengmu()); } System.out.println(); System.out.print("韵母:"); for (Pinyin pinyin : pinyinList) { System.out.printf("%s,", pinyin.getYunmu()); } System.out.println(); System.out.print("输入法头:"); for (Pinyin pinyin : pinyinList) { System.out.printf("%s,", pinyin.getHead()); } System.out.println(); }
输出:
原文:重载不是重任 拼音(数字音调):chong2,zai3,bu2,shi4,zhong4,ren4, 拼音(符号音调):chóng,zǎi,bú,shì,zhòng,rèn, 拼音(无音调):chong,zai,bu,shi,zhong,ren, 声调:2,3,2,4,4,4, 声母:ch,z,b,sh,zh,r, 韵母:ong,ai,u,i,ong,en, 输入法头:ch,z,b,sh,zh,r,
数据下载:http://download.hanlp.com/data-for-1.7.5.zip