Edit Document

UDH Search去重与分组

作者：费英林

1. 索引数据去重

1.1. 修改 solrconfig.xml

1. 索引数据去重

去重的目的是阻止重复或近似重复的文档进入索引数据。

1.1. 修改 solrconfig.xml

下面是一个配置实例。字段 signatureID 用做一个文档的 Unique ID ，它是一个虚拟字段，代表字段 data_f1 与 data_f2 的组合， data_f1 与 data_f2 是实际存在的字段。即 data_f1 与 data_f2 组合的值必须是唯一的，如果重复，已在索引中的数据会被覆盖。

<str name="signatureField">signatureID</str>

<bool name="overwriteDupes">false</bool>

<str name="signatureClass">solr.processor.Lookup3Signature</str>

</processor>

</updateRequestProcessorChain>

启用去重：

<str name="update.chain"> dedupe </str>

</lst>

</requestHandler>

1.2. 修改 schema.xml

将 uniqueKey 设置为 signatureID （原值是 ID ，需要替换掉）：

<uniqueKey>signatureID</uniqueKey>

在 fields 属性组中添加属性 signatureID ：

修改 fields 属性组中的属性 ID ，将 required 由 true 改为 false ：

1.3. 验证

重建索引，查看结果，确认：

 结果中是否存在重复数据

 后面的重复数据会覆盖前面的数据

2. 索引数据分组

2.1. 修改 schema.xml

分组的字段要求不能是 tokenized 的字段，需要定义为 string 类型。假如我们需要依据 other_articles_title 进行分组，由于 other_articles_title 是定义为中文分词类型的，直接使用这个字段无法实现分组的效果。我们需要另建一个字段 other_articles_title_nt 用于分组：

2.2. 分组示例

select?q=*%3A*&group=true&group.field=other_articles_title_nt

查询条件为任意字段的任意值，设置 group 为 true ，分组字段为 other_articles_title_nt ，每组默认显示分值最高的 1 条记录，默认显示 10 个分组。

select?q=*%3A*&group=true&group.field=other_articles_title_nt&group.limit=3

同上。每个分组内显示前 3 条分值最高的记录。

select?q=*%3A*&group=true&group.field=other_articles_title_nt&rows=100

同上。每个分组内显示分值最高的 1 条记录，显示 100 个分组。

select?q=*%3A*&group=true&group.field=other_articles_title_nt&group.limit=3&rows=100

同上。每个分组内显示分值最高的 3 条记录，显示 100 个分组。

select?q=other_articles_sitename:"sina"&group=true&group.field=other_articles_title_nt&group.limit=3&rows=100

查询条件为 site name 是新浪，设置 group 为 true ，分组字段为 other_articles_title_nt ，每组默认显示分值最高的 3 条记录，显示 100 个分组。

select?q=other_articles_sitename:"sina"&group=true&group.field=other_articles_title_nt&group.limit=3&rows=100&group.ngroups=true

同上。显示分组个数。

select?q=other_articles_sitename:"sina" AND other_articles_source:" 新浪科技 "&group=true&group.field=other_articles_title_nt&group.limit=3&rows=100

（或者

select?q=other_articles_sitename:%22sina%22%20AND%20other_articles_source:%22%E6%96%B0%E6%B5%AA%E7%A7%91%E6%8A%80%22&group=true&group.field=other_articles_title_nt&group.limit=3&rows=100 ）

查询条件为 site name 是新浪并且来源是新浪科技，设置 group 为 true ，分组字段为 other_articles_title_nt ，每组默认显示分值最高的 3 条记录，显示 100 个分组。

知识库 : UDH Search去重与分组

1. 索引数据去重

1.1. 修改 solrconfig.xml

1.2. 修改 schema.xml

1.3. 验证

2. 索引数据分组

2.1. 修改 schema.xml

2.2. 分组示例

Attachments: