UXDB支持包括中文在内的多语言全文检索,提供基于自然语言的文本查询能力,内置缺省的分词解析器解析英文,采用zhparser 插件调用 SCWS引擎进行中文分词,实现全文检索,并可按查询相关度排序,支持对短语的搜索和模糊搜索(ux-trgm)。
888,A019,000000428601,0000,00000,"",0000,青岛业务支持中心,20200410,165004,22
888,A019,000000010632,0000,00001,"",0000,电信广场5楼,20200623,083204,23
888,A019,000000428609,0000,00000,"",0000,南京管线维护站,20200410,165004,24
888,A019,000000428637,0000,00000,"",0000,北海市农业办公楼,20200410,165004,25
~100万条
zhparser是基于Simple Chinese Word Segmentation(SCWS)中文分词库实现的一个扩展
create extension zhparser;
其中simple为一个字典模板
mandt varchar(20) default '000' not null,
bukrs varchar(20) default '' not null,
anln1 varchar(50) default '' not null,
luntn varchar(20) default '',
lanep varchar(20) default '00000',
anupd varchar(20) default '',
funtn varchar(20) default '',
anlhtxt text default '',
zdate varchar(32) default '00000000',
ztime varchar(32) default '000000',
mykey serial primary key);
\copy anlh_tmp from '/home/uxdb/tmp.txt' DELIMITER ',' csv;
create table anlh_ts(mykey int,anlhtxt tsvector);
insert into anlh_ts select mykey,to_tsvector('testzhcfg',anlhtxt) from anlh_tmp;
select count(*) from anlh_tmp where anlhtxt like '%办公室%';
select count(*) from anlh_tmp t join anlh_ts s on t.mykey=s.mykey where s.anlhtxt @@ to_tsquery('testzhcfg','办公室');
explain select count(*) from anlh_tmp where anlhtxt like '%办公室%';
explain select count(*) from anlh_tmp t join anlh_ts s on t.mykey=s.mykey where s.anlhtxt @@ to_tsquery('testzhcfg','办公室');
SELECT anlhtxt, ts_rank_cd(to_tsvector('testzhcfg',anlhtxt),to_tsquery('testzhcfg','办公室') ) AS rank
FROM anlh_tmp
ORDER BY rank DESC
limit 5;
anlhtxt | rank
----------------------------------------------------------------+------
北滘君兰机场办公室外墙光分箱-机场办公室ZHX | 0.2
办公室,二车间办公室 | 0.2
均安鸿达服装公司办公室至鸿达服装办公室基站 | 0.2
凤凰南路59号香洲图书馆3F办公室墙壁 | 0.1
市信息大厦12楼办公室 | 0.1
(5 rows)
最后,把以上SQL引用到用户搜索页即可完成。
-FIN-