How to export trained GDSL models from Model Catalog to persistent storage?

emigre459 · February 10, 2021, 9:24pm

We would love to use some of the more advanced features from the Graph Data Science Library for doing things like training node embedding models using graphSAGE. However, right now it looks like the model catalog into which you can save the trained model, but it must be in-memory like the Graph Catalog since restarting the database wipes it out.

So my question is: how do we get these trained models (especially inductive ones like graphSAGE that can be reused when new data are put into the graph) into persistent storage so restarting the database doesn't cause them to be lost?

clair.sullivan · February 12, 2021, 12:11am

One idea (that might not be as elegant as you were hoping for) would be to write your GraphSAGE embeddings as node properties within the named graph. Then you would use something like gds.beta.graph.export.csv() to write that graph to CSV.

Another thought is that if you are planning on using these embeddings in an outside program written in something like Python, you could write the embeddings as node properties in the full database and then query that with your connector/package of choice.

emigre459 · February 12, 2021, 3:11am

Thanks for the idea Clair! I'm not sure this gets at what I need however. Wouldn't exporting the embeddings just provide me with node-specific results? Ideally I'd want to archive a version of the inductive model itself so I could load it up (if I lost the model from the catalog) and use it for inference on different graphs, new nodes in the same graph, etc. (so presumably in the form of its weights and biases, much like saving out a checkpoint file from PyTorch works for example).

clair.sullivan · February 12, 2021, 2:28pm

Ah. My bad. I misunderstood your question.

There is alpha functionality in the Enterprise edition of GDS 1.5 that allows for storing and loading models to and from disks. You can read about it here. This will not allow you to, say, transfer that model computer-to-computer, but it will save you the time and frustration of having to completely redo your models in the catalog on a Neo4j restart.

I hope that helps!

emigre459 · February 13, 2021, 3:29pm

Thanks! I saw that announcement right after replying to you, LOL. Great minds think alike I guess! Given that we are still experimenting with resource needs for our Neo4j instance (and thus spinning up and down instances to host them on the cloud), this is unfortunately still not an ideal fit, but definitely helpful to know about.

How does one get access to GDSL Enterprise? Is it activated automatically when using Enterprise Edition of the database? And is there any literature that shows the difference between Community and Enterprise GDSL?

Thanks for your help again!

alicia_frame1 · February 13, 2021, 10:16pm

Hi @emigre459 !

The main differences between GDS community edition (CE) and enterprise edition (EE) are:

Maximum of 4 cores for computation in CE vs. unlimited in EE

Fine grained security support & integration in EE only

Model catalog only stores 1 model in CE vs. unlimited in EE

Model persistence and publishing (share models between users) in EE only

Low memory analytics graph format (up to 75% less heap consumption) in EE only

GDS Enterprise Edition is a separate product from the core database EE license, so having a Neo4j license doesn't lock EE features alone. Enterprise features are tagged in the docs with an "Enterprise" label, and also described here (although I noticed that hasn't been updated in several releases, so the list is incomplete - I've filed a card to fix that!)

If you want to send me a note at alicia.frame@neo4j.com, I'm happy to chat about access to a license :slight_smile:

lingvisa · October 8, 2021, 5:21am

'you could write the embeddings as node properties in the full database and then query that with your connector/package of choice.'

hi, @clair.sullivan: 'the full database', did you mean Neo4j DB or some external relational db, i.e. mySql? Adding embedding to node properties significantly increases the size of the graph. Would this slow down the query performance of the graph? What about storing the embedding vectors into MySql and a query can get results from neo4j with embeddings from mysql?

clair.sullivan · October 13, 2021, 3:00pm

Back when I wrote that I meant the Neo4j DB. But one thing to note here is that queries (traversals) do not access the properties themselves, such as your GraphSAGE embeddings if you have written them to the nodes. Instead, they work on an index for queries, so you don't have to get the node properties upon lookup. If you are filtering on properties, then you might experience a slow down.

Another place you might get a slowdown then is in reading that data out of the database into something else like, for example, Python. I did quite follow the idea of using MySql to handle the embeddings, so perhaps you can clarify your idea a bit more?

I also do want to mention that the model catalog that I mentioned when we first discussed this is now out of the alpha level into the beta level. So you might want to check back in on it and see if it might help.

lingvisa · October 13, 2021, 6:30pm

@clair.sullivan Regarding using mysql to handle embeddings, I mean storing nodes' embedding into mysql, then in neo4j storing node embedding's identifier in mysql as one node property, so downstream applications can pull out node's info primarily from neo4j, plus one additional piece of info from mysql. But if embedding needs to be used in neo4j at runtime, then those embeddings need to be read or written back from mysql to neo4j.

All those considerations are for scalability. Regarding embedding training, i.e. GraphSage, is there any plan to support single/multiple GPUs, or is there already one way to use GPUs that I am not aware of?

alicia_frame1 · October 13, 2021, 11:56pm

@lingvisa : you can export your embeddings using either stream mode or exporting them from the in memory graph (by way of one of our drivers), or you can write the node embeddings to csv with graph.export.csv . You can export node IDs along with the embeddings.

If you don't need your embeddings in SQL, you can also store them directly in Neo4j by using write mode with the embedding procedure call, and then querying the results with cypher individually. That will increase the size of your Neo4j database, but it should not slow down or impede query performance.

We do not support GPUs for GraphSAGE, and it's not currently on our roadmap. We do offer a concurrency parameter, however, that lets you take advantage of parallel processing with CPUs.

lingvisa · October 14, 2021, 12:13am

@alicia_frame1 Good to know that storing embedding in neo4j won't slow down and the export option. I will try the biggest graph I can train with graphSage in 3 days on a single machine with 40 cores. I currently have a graph with 4 millions nodes and 30 millions relationship. Hopefully I can train it within 3 days.

alicia_frame1 · October 15, 2021, 10:42pm

For speed, you might want to train graphSAGE on a subgraph (eg. run community detection and choose a single large community) and then use the model to generate embeddings for the remaining nodes? It's not super fast for training on a huge graph.

GraphSAGE is also faster if you run it unweighted, and use the mean aggregator instead of pool.

For really big graphs, we usually recommend using FastRPExtended - like GraphSAGE, it can encode properties as well as graph structure, but it uses linear algebra to encode the embedding instead of sampling and aggregation.

lingvisa · October 15, 2021, 11:06pm

@alicia_frame1 For FastRPExtended, I checked that it can use node properties. However, it only works for homogenous graph as FastRP does? My graph has multiple both node and relationship types.

As for training subgraph on GraphSage, I have multiple node and relationship types, so I need to first perform a proper sampling of all the nodes and relationships? If the training data can't cover all the node types and relationship types, the model won't be good. To train a community detection model, I need to do a similar sampling of all the data?

alicia_frame1 · October 18, 2021, 10:42pm

For FastRPExtended, you'll need to add a feature property that encodes the labels, and also pad any missing properties when you create your graph projection.

For example, if you have People and Instrument nodes, when you project your graph, you'd want to create an IsPerson feature (0/1) and an IsInstrument feature - or just a labelEncoding property with dimension 2; and then if you have an age property for people - but not instruments, you'd want to set a defaultValue on loading your graph (probably to 0). GraphSAGE automatically does all this - in our next release, we'll be adding that extension for FastRP, but for now you have to do it manually.

If you want to sample your data using a community detection algorithm, I'd create a graph projection with all the node labels /relationship types you want to run over, and then run the community detection algorithm on that. Select a large community, ensure that it contains all the node / relationship types you want, then train graphSage on that sample.

Topic		Replies	Views
Integrate scikit learn ML models into cypher policies Graph + AI	1	371	May 4, 2023
NEO4j, Storing more than `3` models in the catalog is available with a licensed Graph Data Science l Neo4j Graph Platform migrated	0	70	July 31, 2022
Reuse trained model (Node Classification) Graph Algorithms/Graph Data Science	1	199	October 11, 2023
GDS- ML -Catalogue Graph Algorithms/Graph Data Science	1	327	August 18, 2023
Classification models not listed in model catalog? Graph Algorithms/Graph Data Science	6	346	January 14, 2022

How to export trained GDSL models from Model Catalog to persistent storage?

Related topics