信息检索:布尔检索-求交集并集(1) ## 前言 布尔检索指对文档集进行布尔运算。比如,有以下三个文档(已归约化处理): ```python doc1 = ["1", "hello", "word", "i", "love", "dazhu"] doc2 = ["2", "hi", "i", "can", "speak", "love"] doc3 = ["3", "can", "i", "say", "hello", "make", "dazhu", "hi"] ``` 要求在这个文档集中求**同时**包含“i”和“can”的文档。假定输入如下: ``` "i" AND "can" ``` 返回结果应该是`[2,3]`。即,通过运算,得知`doc2,doc3`满足条件。 要实现布尔检索,关键在于建立`倒排索引`和求N个集合的交集,并集。在这里,首先实现两个集合的交并集简易算法。 ## 求交集并集 要布尔检索,首先要求两个集合的交集或并集。它们的时间复杂度都为 `o(x+y)` 参考代码如下: ```python def arr_and(arr1, arr2): p1 = 0 p2 = 0 result = [] while p1 != len(arr1) and p2 != len(arr2): if arr1[p1] == arr2[p2]: result.append(arr1[p1]) p1 += 1 p2 += 1 else: if arr1[p1] < arr2[p2]: p1 += 1 else: p2 += 1 return result def arr_or(arr1, arr2): p1 = 0 p2 = 0 result = [] while p1 != len(arr1) and p2 != len(arr2): if arr1[p1] == arr2[p2]: result.append(arr1[p1]) p1 += 1 p2 += 1 else: if arr1[p1] < arr2[p2]: result.append(arr1[p1]) p1 += 1 else: result.append(arr2[p2]) p2 += 1 if p1 < len(arr1): result += arr1[p1:] if p2 < len(arr2): result += arr2[p2:] return result ## test arr1 = [1,3,5,7,8,12] arr2 = [1,4,5,6,7,8] print(arr_and(arr1, arr2)) print(arr_or(arr1, arr2)) ``` 来自 大脸猪 写于 2017-01-03 20:08 -- 更新于2020-10-19 13:06 -- 0 条评论