Hello all,
I've been using Neo4j for some weeks and I think it's awesome.
I'm building an NLP application, and basically, I'm using Neo4j for storing the dependency graph generated by a semantic parser, something like this:
In the nodes, I store the single words contained in the sentences, and I connect them through relations with a number of different types.
For my application, I have the requirement to find all the nodes that contain a given word, so basically I have to search through all the nodes, finding those that contain the input word. Of course, I've already created an index on the word text field.
I'm working on a very big dataset (by the way, the CSV importer is a great thing).
Here are the details of the graph.db:
-
47.108.544 nodes
-
45.442.034 relationships
-
13.39 GiB db size
-
Index created on token.text field
PROFILE MATCH (t:token) WHERE t.text="switch" RETURN t.text
NodeIndexSeek
251,679 db hits
Projection
251,678 db hits
ProduceResults
251,678 db hits
I was in doubt if indexing such amount of nodes was a good practice. In the first prototype db, I created a new node for each word I encountered in the text, even if the text is the same of other nodes.
Then I've re-implemented the db structure using unique words/nodes, the number of nodes dropped from 47.108.544 to 1.934.049, and the db size to 3.5 Gigabyte
I still have a huge number of relationships, 45.442.034 that now point to the unique nodes, and I'm not sure if this is a good architecture.
My end goal is to find specific patterns in sentence structures, like the following example
(John)<-[NSUBJ]-(eat)-[DOBJ]->(apple)
Could you please help me with a suggestion or best practice to adopt for this specific case? I think that Neo4j is a great piece of software and I'd like to make the most out of it :-)
thank you very much