根据Lucene的注释,No norms means that index-time field and document boosting and field length normalization are disabled. The benefit is less memory usage as norms take up one byte of RAM per indexed field for every document in the index, during searching. Note that once you index a given field with norms enabled, disabling norms will have no effect. 没有norms意味着索引阶段禁用了文档boost和域的boost及长度标准化。好处在于节省内存,不用在搜索阶段为索引中的每篇文档的每个域都占用一个字节来保存norms信息了。但是对norms信息的禁用是必须全部域都禁用的,一旦有一个域不禁用,则其他禁用的域也会存放默认的norms值。因为为了加快norms的搜索速度,Lucene是根据文档号乘以每篇文档的norms信息所占用的大小来计算偏移量的,中间少一篇文档,偏移量将无法计算。也即norms信息要么都保存,要么都不保存。
下面几个试验可以验证norms信息的作用:
试验一:Document Boost的作用
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
publicvoidtestNormsDocBoost()throws Exception {
File indexDir = new File("testNormsDocBoost");
IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);
writer.setUseCompoundFile(false);
Document doc1 = new Document();
Field f1 = new Field("contents", "common hello hello", Field.Store.NO, Field.Index.ANALYZED);
doc1.add(f1);
doc1.setBoost(100);
writer.addDocument(doc1);
Document doc2 = new Document();
Field f2 = new Field("contents", "common common hello", Field.Store.NO, Field.Index.ANALYZED_NO_NORMS);
doc2.add(f2);
writer.addDocument(doc2);
Document doc3 = new Document();
Field f3 = new Field("contents", "common common common", Field.Store.NO, Field.Index.ANALYZED_NO_NORMS);
IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);
writer.setUseCompoundFile(false);
Document doc1 = new Document();
Field f1 = new Field("title", "common hello hello", Field.Store.NO, Field.Index.ANALYZED);
f1.setBoost(100);
doc1.add(f1);
writer.addDocument(doc1);
Document doc2 = new Document();
Field f2 = new Field("contents", "common common hello", Field.Store.NO, Field.Index.ANALYZED_NO_NORMS);
IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);
writer.setUseCompoundFile(false);
Document doc1 = new Document();
Field f1 = new Field("contents", "common hello hello", Field.Store.NO, Field.Index.ANALYZED_NO_NORMS);
doc1.add(f1);
writer.addDocument(doc1);
Document doc2 = new Document();
Field f2 = new Field("contents", "common common hello hello hello hello", Field.Store.NO, Field.Index.ANALYZED_NO_NORMS);
IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);
writer.setUseCompoundFile(false);
Document doc1 = new Document();
Field f1 = new Field("title", "common hello hello", Field.Store.NO, Field.Index.ANALYZED);
doc1.add(f1);
writer.addDocument(doc1);
for (int i = 0; i < 10000; i++) {
Document doc2 = new Document();
Field f2 = new Field("contents", "common common hello hello hello hello", Field.Store.NO, Field.Index.ANALYZED_NO_NORMS);
IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);
Document doc1 = new Document();
Field f1 = new Field("contents", "common1 hello hello", Field.Store.NO, Field.Index.ANALYZED);
doc1.add(f1);
writer.addDocument(doc1);
Document doc2 = new Document();
Field f2 = new Field("contents", "common2 common2 hello", Field.Store.NO, Field.Index.ANALYZED);
那Query Boost是如何影响文档打分的呢? 根据Lucene的打分计算公式: score(q,d) = coord(q,d) · queryNorm(q) · ∑( tf(t in d) · idf(t)2 · t.getBoost() · norm(t,d) ) t in q 注:在queryNorm的部分,也有q.getBoost()的部分,但是对query向量的归一化(见向量空间模型与Lucene的打分机制[http://forfuture1978.javaeye.com/blog/588721])。
继承并实现自己的Similarity
Similariy是计算Lucene打分的最主要的类,实现其中的很多借口可以干预打分的过程。 (1) float computeNorm(String field, FieldInvertState state) (2) float lengthNorm(String fieldName, int numTokens) (3) float queryNorm(float sumOfSquaredWeights) (4) float tf(float freq) (5) float idf(int docFreq, int numDocs) (6) float coord(int overlap, int maxOverlap) (7) float scorePayload(int docId, String fieldName, int start, int end, byte [] payload, int offset, int length)
它们分别影响Lucene打分计算的如下部分: score(q,d) = (6)coord(q,d) · (3)queryNorm(q) · ∑( (4)tf(t in d) · (5)idf(t)2 · t.getBoost() · (1)norm(t,d) ) t in q norm(t,d) = doc.getBoost() · (2)lengthNorm(field) · ∏f.getBoost() field f in d named as t
(6) float coord(int overlap, int maxOverlap) 一次搜索可能包含多个搜索词,而一篇文档中也可能包含多个搜索词,此项表示,当一篇文档中包含的搜索词越多,则此文档则打分越高。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
publicvoidTestCoord()throws Exception {
MySimilarity sim = new MySimilarity();
File indexDir = new File("TestCoord");
IndexWriter writer = new IndexWriter(FSDirectory.open(indexDir), new StandardAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED);
Document doc1 = new Document();
Field f1 = new Field("contents", "common hello world", Field.Store.NO, Field.Index.ANALYZED);
doc1.add(f1);
writer.addDocument(doc1);
Document doc2 = new Document();
Field f2 = new Field("contents", "common common common", Field.Store.NO, Field.Index.ANALYZED);
doc2.add(f2);
writer.addDocument(doc2);
for(int i = 0; i < 10; i++){
Document doc3 = new Document();
Field f3 = new Field("contents", "world", Field.Store.NO, Field.Index.ANALYZED);