Topic
categories with confidence level. For example, an article about the Apache Lucene project might be categorized with Topic/Apache Software Foundation (0.34)
, Topic/Apache Lucene (0.95)
and Topic/Information Retrieval (0.84)
. An article about Apache Nutch might be categorized with Topic/Apache Software Foundation (0.22)
, Topic/Apache Nutch (0.93)
and Topic/Distributed Crawler (0.87)
. If you index the articles with those facets, and ignore the confidence level during facet aggregation, you will get Apache Software Foundation
as the top category (with count=2). This might give the user a false impression as if the result set focuses mainly on the Apache Software Foundation
topic, while the documents don't discuss general ASF issues at all. However, if you take the confidence level into account, you will get Apache Lucene
and Apache Nutch
as the top categories, while Apache Software Foundation
would come last.
Facet Associations
Lucene Facets let you index categories with confidence level very easily, by assigning aCategoryAssociation
with each category. The following short code snippet demonstrates how you would index the first article's categories:
FacetFields facetFields = new AssociationsFacetFields(taxoWriter); // first article's categories with confidence level CategoryAssociationsContainer article1 = new CategoryAssociationsContainer(); article1.setAssociation( new CategoryPath("Topic", "Apache Software Foundation"), new CategoryFloatAssociation(0.34f)); article1.setAssociation( new CategoryPath("Topic", "Apache Lucene"), new CategoryFloatAssociation(0.95f)); article1.setAssociation( new CategoryPath("Topic", "Information Retrieval"), new CategoryFloatAssociation(0.84f)); Document doc = new Document(); // add the facets to the document facetFields.addFields(doc, associations); // index the document indexWriter.addDocument(doc);Let's take a closer look at the code:
CategoryAssociationsContainer
holds a mapping from a category to its association.AssociationsFacetFields
adds the needed fields (drill-down terms and category list payload) to the document, along with the associations values.- Finally, the document is indexed with
IndexWriter
.
CategoryPath cp = new CategoryPath("Topic"); FacetRequest topic = new AssociationFloatSumFacetRequest(cp, 10); FacetSearchParams fsp = new FacetSearchParams(topic); FacetsCollector fc = new FacetsCollector(fsp, indexReader, taxoReader); searcher.search(new MatchAllDocsQuery(), fc);If you print the top categories (using e.g. the code snippet from here), you will get the following output. Note how the top categories are now those with the higher confidence level, rather than those that appear in more documents:
Topic (0.0) Apache Lucene (0.95) Apache Nutch (0.93) Distributed Crawler (0.87) Information Retrieval (0.84) Apache Software Foundation (0.56)Lucene provides two
CategoryAssociation
implementations for integer and float values, as well as two matching FacetRequests
which set the weight of a category as the sum of its association values (AssociationFloatSumFacetRequest
used in the above code sample). You can extend facet associations by either implementing a CategoryAssociation
(and matching FacetRequest
), or implement a FacetRequest
which computes a different function over the integer/float association values.
With the code given in the link provided for printing the output, it is not printing them. It is throwing an exception "this FacetRequest does not support this type of Aggregator anymore". I am using LUCENE 4.4.
ReplyDelete