Best practices for storing 40 millions of nodes, containing words, and 45 millions of relations


(Johncarlof1976) #1

Hello all,

I've been using Neo4j for some weeks and I think it's awesome.

I'm building an NLP application, and basically, I'm using Neo4j for storing the dependency graph generated by a semantic parser, something like this:

In the nodes, I store the single words contained in the sentences, and I connect them through relations with a number of different types.

For my application, I have the requirement to find all the nodes that contain a given word, so basically I have to search through all the nodes, finding those that contain the input word. Of course, I've already created an index on the word text field.

I'm working on a very big dataset (by the way, the CSV importer is a great thing).

Here are the details of the graph.db:

  • 47.108.544 nodes

  • 45.442.034 relationships

  • 13.39 GiB db size

  • Index created on token.text field

PROFILE MATCH (t:token) WHERE t.text="switch" RETURN t.text


NodeIndexSeek

251,679 db hits


Projection

251,678 db hits


ProduceResults

251,678 db hits

I was in doubt if indexing such amount of nodes was a good practice. In the first prototype db, I created a new node for each word I encountered in the text, even if the text is the same of other nodes.

Then I've re-implemented the db structure using unique words/nodes, the number of nodes dropped from 47.108.544 to 1.934.049, and the db size to 3.5 Gigabyte

I still have a huge number of relationships, 45.442.034 that now point to the unique nodes, and I'm not sure if this is a good architecture.

My end goal is to find specific patterns in sentence structures, like the following example

(John)<-[NSUBJ]-(eat)-[DOBJ]->(apple)

Could you please help me with a suggestion or best practice to adopt for this specific case? I think that Neo4j is a great piece of software and I'd like to make the most out of it :-)

thank you very much


(Michael Hunger) #2

I think it's better to continue from your original question and not do new posts :slightly_smiling_face:

Perhaps @Christophe_Willemsen has some suggestions.

What do your current queries look like and what's their PROFILE output?

PROFILE
MATCH path = (token:{text:"John")<-[:NSUBJ]-(:token {text:"eat"})-[:DOBJ]->(:token {text:"apple")
RETURN path

(Christophe Willemsen) #3

In neo4j-nlp ( https://github.com/graphaware/neo4j-nlp ), we store unique lemmas, and keep the occurrence of the word in a TagOccurrence nodes, which means the database can grow up easily when you want to keep the syntactic dependency graph in Neo4j. We also store the NER on the TagOccurrence and use indexes for the occurrence token value. 47 millions nodes is really nothing for Neo. What you need to take care is to have a good list of stopwords, because they will generally be useless and have a serious degree of incoming relationships, so avoid to store words like "the, if, ...".


(Johncarlof1976) #4

Thank you Christophe, this is really helpful!


(Johncarlof1976) #5

Thank you Michael, actually the PROFILE query freezes, but I've got great suggestion from the reply of Christophe below