Best practices for storing 40 millions of nodes, containing words, and 45 millions of relations

johncarlof1976 · October 10, 2018, 9:53pm

Hello all,

I've been using Neo4j for some weeks and I think it's awesome.

I'm building an NLP application, and basically, I'm using Neo4j for storing the dependency graph generated by a semantic parser, something like this:

In the nodes, I store the single words contained in the sentences, and I connect them through relations with a number of different types.

For my application, I have the requirement to find all the nodes that contain a given word, so basically I have to search through all the nodes, finding those that contain the input word. Of course, I've already created an index on the word text field.

I'm working on a very big dataset (by the way, the CSV importer is a great thing).

Here are the details of the graph.db:

47.108.544 nodes
45.442.034 relationships
13.39 GiB db size
Index created on token.text field

PROFILE MATCH (t:token) WHERE t.text="switch" RETURN t.text

NodeIndexSeek

251,679 db hits

Projection

251,678 db hits

ProduceResults

251,678 db hits

I was in doubt if indexing such amount of nodes was a good practice. In the first prototype db, I created a new node for each word I encountered in the text, even if the text is the same of other nodes.

Then I've re-implemented the db structure using unique words/nodes, the number of nodes dropped from 47.108.544 to 1.934.049, and the db size to 3.5 Gigabyte

I still have a huge number of relationships, 45.442.034 that now point to the unique nodes, and I'm not sure if this is a good architecture.

My end goal is to find specific patterns in sentence structures, like the following example

(John)<-[NSUBJ]-(eat)-[DOBJ]->(apple)

Could you please help me with a suggestion or best practice to adopt for this specific case? I think that Neo4j is a great piece of software and I'd like to make the most out of it :-)

thank you very much

michael.hunger · October 12, 2018, 12:24am

I think it's better to continue from your original question and not do new posts

Perhaps @Christophe_Willemsen has some suggestions.

What do your current queries look like and what's their PROFILE output?

PROFILE
MATCH path = (token:{text:"John")<-[:NSUBJ]-(:token {text:"eat"})-[:DOBJ]->(:token {text:"apple")
RETURN path

Christophe_Willemsen · October 15, 2018, 7:05am

In neo4j-nlp ( GitHub - graphaware/neo4j-nlp: NLP Capabilities in Neo4j ), we store unique lemmas, and keep the occurrence of the word in a TagOccurrence nodes, which means the database can grow up easily when you want to keep the syntactic dependency graph in Neo4j. We also store the NER on the TagOccurrence and use indexes for the occurrence token value. 47 millions nodes is really nothing for Neo. What you need to take care is to have a good list of stopwords, because they will generally be useless and have a serious degree of incoming relationships, so avoid to store words like "the, if, ...".

johncarlof1976 · October 15, 2018, 9:32am

Thank you Christophe, this is really helpful!

johncarlof1976 · October 15, 2018, 9:32am

Thank you Michael, actually the PROFILE query freezes, but I've got great suggestion from the reply of Christophe below

Topic		Replies	Views
Storing potentially large nodes in Neo4j Modeling performance , data-modeling	1	742	May 19, 2022
Best practice for document storage Import / Export performance , import	0	542	August 29, 2021
How to index sentences with syntactic dependency information between words in it? Newbie Questions	0	205	April 24, 2022
Optimizing Neo4j Database Storage for Large-Scale Servers General	0	57	November 20, 2024
Nodes with lots of data Neo4j Graph Platform performance , import	2	555	September 29, 2020

Best practices for storing 40 millions of nodes, containing words, and 45 millions of relations

Related topics