转载请注明出处:http://blog.csdn.net/l1028386804/article/details/48897589
Coreseek+Mmseg 实现中文分词的安装配置的记录笔记。
安装前首先安装依赖的软件包
yum install make gcc g++ gcc-c++ libtool autoconf automake imake libxml2-devel expat-devel下载并编译安装mmseg:
wget http://www.coreseek.cn/uploads/csft/4.0/coreseek-4.1-beta.tar.gz tar -zxvf coreseek-4.1-beta cd mmseg-3.2.14 ./bootstrap ./configure --prefix=/usr/local/mmseg3 make && make install
编译mmseg提示cannot find input file: src/Makefile.inaclocal 错误的解决办法
执行:
libtoolize --force automake --add-missing autoconf autoheader make clean ./configure --prefix=/usr/local/mmseg3 make && make install编译安装Coreseek,其实就是Sphinx的中文修改版
cd csft-3.2.14/ sh buildconf.sh ./configure --prefix=/usr/local/coreseek --without-unixodbc --with-mmseg --with-mmseg-includes=/usr/local/mmseg3/include/mmseg/ --with-mmseg-libs=/usr/local/mmseg3/lib/ --with-mysql=/usr/local/mysql/ make && make install若仍提示 libiconv无法找到,需要修改vi src/Makefile 文件,找 LIBS = 开头的行,
将 LIBS = -lm -lz -lexpat -L/usr/local/lib –lpthread 修改成 LIBS = -lm -lz -lexpat -liconv -L/usr/local/lib -lpthread
到这里我们的Coreseek + Mmseg 的安装就结束了,下面就来玩一下中文分词了,在Coreseek的包中提供了一个测试用的数据库example.sql,就用这个数据库测试,将该数据库导入。
简单配置一下,在 /usr/local/coreseek/etc 下面新建sphinx.cof的配置文件,内容如下:
source src1 { type = mysql sql_host = localhost sql_user = root sql_pass = yuansir sql_db = test sql_port = 3306 # optional, default is 3306 sql_query = \ SELECT id, group_id, UNIX_TIMESTAMP(date_added) AS date_added, title, content \ FROM documents sql_attr_uint = group_id sql_attr_timestamp = date_added sql_query_info = SELECT * FROM documents WHERE id=$id } index test1 { source = src1 path = /usr/local/coreseek/var/data/test1 docinfo = extern charset_type = zh_cn.utf-8 charset_dictpath = /usr/local/mmseg3/etc/ ngram_len = 0 } indexer { mem_limit = 32M } searchd { port = 9312 log = /usr/local/coreseek/var/log/searchd.log query_log = /usr/local/coreseek/var/log/query.log read_timeout = 5 max_children = 30 pid_file = /usr/local/coreseek/var/log/searchd.pid max_matches = 1000 seamless_rotate = 1 preopen_indexes = 0 unlink_old = 1 }修改/usr/local/mmseg3/etc 下面的mmseg.ini如下:
[mmseg] merge_number_and_ascii=0; ;合并英文和数字 abc123/x number_and_ascii_joint=-; ;定义可以连接英文和数字的字符 compress_space=1; ;暂不支持 seperate_number_ascii=0; ;就是将字母和数字打散至此一个简单的配置就完成了,具体配置项说明请去 Coreseek的官方文档查看,很全面,这里只配置了一个数据源和一个索引。
/usr/local/coreseek/bin/searchd -c /usr/local/coreseek/etc/sphinx.conf /usr/local/coreseek/bin/indexer -c /usr/local/coreseek/etc/sphinx.conf --rotate --allsearchd和 indexer的具体参数请参考官方文档。
/** * sphinx 测试脚本 * @authorliuyazhuang */ $sp = new SphinxClient(); // 实例化sphinx,这里使用的是PHP的Sphinx扩展 $sp->setServer('192.168.1.100', 9312) or die("can't connect to sphinx"); //选择服务器 $sp->setConnectTimeout(1); //连接超时设置 $sp->setMatchMode(SPH_MATCH_ANY); //设置全文查询的匹配模式 $sp->setArrayResult(TRUE); //控制搜索结果集的返回格式 $res = $sp->query('我叫刘亚壮', 'test1'); //执行搜索查询,‘我叫刘亚壮’为关键词,test1索引名称 (可以为多个,使用逗号分割,或者为“*”表示全部索引). $conn = mysql_connect('192.168.106.131:3306', 'root', 'root') or die("can't connect to the mysql"); mysql_select_db('test', $conn); mysql_query("SET NAMES utf8"); if (isset($res['matches'])) { //获取匹配的id,并组成(1,2)这样的sql where查询条件格式 foreach ($res['matches'] as $item) { $id[] = $item['id']; } $ids = join(',', $id); $query = mysql_query("SELECT title,content From documents WHERE id IN({$ids})"); while ($row = mysql_fetch_array($query)) { $options = array("before_match" =>" <font color='red'>", "after_match" =>"</font>","chunk_separator"=>'',"limit"=>300); $result = $sp->buildExcerpts($row, 'test1', '我叫刘亚壮', $options); //生成摘要 echo $result[1] . "<br/><br/>" . $result[2] . "<br/><br/><br/>"; } }具体sphinx的api实现可以去php官方手册上查询。这里buildExcerpts方法就是将结果关键词高亮显示了,到此大功告成