We would love to use some of the more advanced features from the Graph Data Science Library for doing things like training node embedding models using graphSAGE. However, right now it looks like the model catalog into which you can save the trained model, but it must be in-memory like the Graph Catalog since restarting the database wipes it out.
So my question is: how do we get these trained models (especially inductive ones like graphSAGE that can be reused when new data are put into the graph) into persistent storage so restarting the database doesn't cause them to be lost?
One idea (that might not be as elegant as you were hoping for) would be to write your GraphSAGE embeddings as node properties within the named graph. Then you would use something like gds.beta.graph.export.csv() to write that graph to CSV.
Another thought is that if you are planning on using these embeddings in an outside program written in something like Python, you could write the embeddings as node properties in the full database and then query that with your connector/package of choice.
Thanks for the idea Clair! I'm not sure this gets at what I need however. Wouldn't exporting the embeddings just provide me with node-specific results? Ideally I'd want to archive a version of the inductive model itself so I could load it up (if I lost the model from the catalog) and use it for inference on different graphs, new nodes in the same graph, etc. (so presumably in the form of its weights and biases, much like saving out a checkpoint file from PyTorch works for example).
There is alpha functionality in the Enterprise edition of GDS 1.5 that allows for storing and loading models to and from disks. You can read about it here. This will not allow you to, say, transfer that model computer-to-computer, but it will save you the time and frustration of having to completely redo your models in the catalog on a Neo4j restart.
Thanks! I saw that announcement right after replying to you, LOL. Great minds think alike I guess! Given that we are still experimenting with resource needs for our Neo4j instance (and thus spinning up and down instances to host them on the cloud), this is unfortunately still not an ideal fit, but definitely helpful to know about.
How does one get access to GDSL Enterprise? Is it activated automatically when using Enterprise Edition of the database? And is there any literature that shows the difference between Community and Enterprise GDSL?
The main differences between GDS community edition (CE) and enterprise edition (EE) are:
Maximum of 4 cores for computation in CE vs. unlimited in EE
Fine grained security support & integration in EE only
Model catalog only stores 1 model in CE vs. unlimited in EE
Model persistence and publishing (share models between users) in EE only
Low memory analytics graph format (up to 75% less heap consumption) in EE only
GDS Enterprise Edition is a separate product from the core database EE license, so having a Neo4j license doesn't lock EE features alone. Enterprise features are tagged in the docs with an "Enterprise" label, and also described here (although I noticed that hasn't been updated in several releases, so the list is incomplete - I've filed a card to fix that!)
If you want to send me a note at alicia.frame@neo4j.com, I'm happy to chat about access to a license :slight_smile:
'you could write the embeddings as node properties in the full database and then query that with your connector/package of choice.'
hi, @clair.sullivan: 'the full database', did you mean Neo4j DB or some external relational db, i.e. mySql? Adding embedding to node properties significantly increases the size of the graph. Would this slow down the query performance of the graph? What about storing the embedding vectors into MySql and a query can get results from neo4j with embeddings from mysql?
Back when I wrote that I meant the Neo4j DB. But one thing to note here is that queries (traversals) do not access the properties themselves, such as your GraphSAGE embeddings if you have written them to the nodes. Instead, they work on an index for queries, so you don't have to get the node properties upon lookup. If you are filtering on properties, then you might experience a slow down.
Another place you might get a slowdown then is in reading that data out of the database into something else like, for example, Python. I did quite follow the idea of using MySql to handle the embeddings, so perhaps you can clarify your idea a bit more?
I also do want to mention that the model catalog that I mentioned when we first discussed this is now out of the alpha level into the beta level. So you might want to check back in on it and see if it might help.
@clair.sullivan Regarding using mysql to handle embeddings, I mean storing nodes' embedding into mysql, then in neo4j storing node embedding's identifier in mysql as one node property, so downstream applications can pull out node's info primarily from neo4j, plus one additional piece of info from mysql. But if embedding needs to be used in neo4j at runtime, then those embeddings need to be read or written back from mysql to neo4j.
All those considerations are for scalability. Regarding embedding training, i.e. GraphSage, is there any plan to support single/multiple GPUs, or is there already one way to use GPUs that I am not aware of?
If you don't need your embeddings in SQL, you can also store them directly in Neo4j by using write mode with the embedding procedure call, and then querying the results with cypher individually. That will increase the size of your Neo4j database, but it should not slow down or impede query performance.
We do not support GPUs for GraphSAGE, and it's not currently on our roadmap. We do offer a concurrency parameter, however, that lets you take advantage of parallel processing with CPUs.
@alicia_frame1 Good to know that storing embedding in neo4j won't slow down and the export option. I will try the biggest graph I can train with graphSage in 3 days on a single machine with 40 cores. I currently have a graph with 4 millions nodes and 30 millions relationship. Hopefully I can train it within 3 days.
For speed, you might want to train graphSAGE on a subgraph (eg. run community detection and choose a single large community) and then use the model to generate embeddings for the remaining nodes? It's not super fast for training on a huge graph.
GraphSAGE is also faster if you run it unweighted, and use the mean aggregator instead of pool.
For really big graphs, we usually recommend using FastRPExtended - like GraphSAGE, it can encode properties as well as graph structure, but it uses linear algebra to encode the embedding instead of sampling and aggregation.
@alicia_frame1 For FastRPExtended, I checked that it can use node properties. However, it only works for homogenous graph as FastRP does? My graph has multiple both node and relationship types.
As for training subgraph on GraphSage, I have multiple node and relationship types, so I need to first perform a proper sampling of all the nodes and relationships? If the training data can't cover all the node types and relationship types, the model won't be good. To train a community detection model, I need to do a similar sampling of all the data?
For FastRPExtended, you'll need to add a feature property that encodes the labels, and also pad any missing properties when you create your graph projection.
For example, if you have People and Instrument nodes, when you project your graph, you'd want to create an IsPerson feature (0/1) and an IsInstrument feature - or just a labelEncoding property with dimension 2; and then if you have an age property for people - but not instruments, you'd want to set a defaultValue on loading your graph (probably to 0). GraphSAGE automatically does all this - in our next release, we'll be adding that extension for FastRP, but for now you have to do it manually.
If you want to sample your data using a community detection algorithm, I'd create a graph projection with all the node labels /relationship types you want to run over, and then run the community detection algorithm on that. Select a large community, ensure that it contains all the node / relationship types you want, then train graphSage on that sample.