Explain working of Custom Lucene Analyzer for full-text index search

With the example given here, we have implemented a custom Analyzer that supports 'case insensitive exact matches' by combining KeywordTokenizerFactory and LowerCaseFilterFactory .

Implementation:

package com.test.nosql.neo4j;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.core.KeywordTokenizerFactory;
import org.apache.lucene.analysis.core.LowerCaseFilterFactory;
import org.apache.lucene.analysis.custom.CustomAnalyzer;
import org.neo4j.annotations.service.ServiceProvider;
import org.neo4j.graphdb.schema.AnalyzerProvider;

import java.io.IOException;

@ServiceProvider
public class KeywordLowerAnalyzerProvider extends AnalyzerProvider {

    public static final String DESCRIPTION = "same as keyword analyzer, but additionally applies a lower case filter to all tokens";
    public static final String ANALYZER_NAME = "keyword_lower";

    public KeywordLowerAnalyzerProvider() {
        super(ANALYZER_NAME);
    }

    public Analyzer createAnalyzer() {
        try {
            return CustomAnalyzer.builder()
                    .withTokenizer(KeywordTokenizerFactory.class)
                    .addTokenFilter(LowerCaseFilterFactory.class)
                    .build();
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }

    @Override
    public String description() {
        return DESCRIPTION;
    }
}

We can now,

  • Create a index with custom analyzer

    CALL db.index.fulltext.createNodeIndex("acc_idx_1",["Account"],["nativeId"], { analyzer: "keyword_lower",eventually_consistent: "true" });

  • Upload some data - example create two nodes

CREATE (n:Account) set n.nativeId = 'JOhN DoE' return n;

CREATE (n:Account) set n.nativeId = 'John Doe' return n;

  • In the background, property nativeId is indexed and stored in some index table...
    • Question1: What is the process of applying custom index? In our example, is KeywordTokenizer and LowercaseFilter applied sequentially?
      • "JOhN DoE" --apply keyword tokenizer-> "JOhN DoE" -apply lowercase filter-> "john doe"
      • "John Doe" -> "John Doe" -> "john doe"
    • Question 2: What is the actual data stored in index table? Final state i.e. 'john doe' in our example? or an intermediary state?

Now query Graph using index:

CALL db.index.fulltext.queryNodes('acc_idx_1', 'john DOE') yield node as n return n

  • Question3: How is the index value compared to a user input at runtime? Is input parameter value(from user) tokenized, converted to lowercase and then compared?

Overall, what is the working of Custom Analyzer, What values are stored and how are they compared at runtime?