I am encountering challenges with leveraging Neo4j and the Graph Data Science (GDS) library for analyzing contextual textual similarity. The goal is to group and cluster contextually related records based on textual node properties and their relationships.
Problem Context:
Graph Schema:
- Nodes:
- Records (e.g.,
rec-1
), Titles (e.g.,Database Index Optimization Tips
), Links (e.g.,https://stackoverflow.com/q/db-indexes
), and Apps (e.g.,Google Chrome
). - Nodes have textual properties such as
name
orurl
.
- Records (e.g.,
- Relationships:
- Types include
USES
,HAS_TITLE
,HAS_LINK
, andNEXT
. - Relationships have weights (e.g.,
USES: 0.2
,HAS_TITLE: 0.3
,HAS_LINK: 0.5
).
- Types include
Objective:
- Detect and group similar records into sequences using relationships and node properties.
Current Approach:
- Graph Projection: Nodes and relationships projected with GDS.
- Embedding: FastRP used to create embeddings.
- Similarity: kNN algorithm applied for similarity calculations.
Challenges:
- Textual Data Support:
- GDS algorithms (e.g., FastRP, kNN) cannot natively process textual properties for similarity.
- Example: Titles like
"Database Index Optimization Tips"
and"Database Index Discussion"
are treated as dissimilar despite high contextual similarity.
- Embedding Limitations:
- Numeric embeddings (e.g., from FastRP) do not account for semantic similarities in textual properties.
- Relationship Weights:
- While weights (
USES: 0.2
,HAS_TITLE: 0.3
,HAS_LINK: 0.5
) are considered, they alone cannot bridge the gap caused by textual dissimilarity.
Questions:
1. Textual Property Handling:
- Is there a way to directly incorporate textual similarity metrics (e.g., cosine similarity of node property embeddings) into Neo4j GDS workflows?
- Are there plans to include native NLP support or semantic similarity in Neo4j for such use cases?
2. Workarounds:
- How can external embeddings (e.g., from NLP models) be integrated into Neo4j, and can they be utilized effectively in GDS pipelines?
3. Algorithm Adaptation:
- Are there recommended custom similarity metrics that combine textual similarity with relationship-based weights?
- Can existing algorithms (e.g., kNN, Louvain) be configured to handle textual and relationship data simultaneously?
Example Graph Data:
Nodes:
json I [ {"id": "rec-1", "name": "Design Document Overview", "type": "Record"}, {"id": "title-1", "name": "Database Index Optimization Tips", "type": "Title"}, {"id": "link-1", "url": "https://stackoverflow.com/q/db-indexes", "type": "Link"}, {"id": "app-1", "name": "Google Chrome", "type": "App"} ]
Relationships:
json s [ {"source": "rec-1", "target": "title-1", "type": "HAS_TITLE", "weight": 0.3}, {"source": "rec-1", "target": "link-1", "type": "HAS_LINK", "weight": 0.5}, {"source": "rec-1", "target": "app-1", "type": "USES", "weight": 0.2} ]
Environment:
- Neo4j Version: 5.x
- GDS Version: Latest
- Data Volume: 200 nodes, 600 relationships
- Use Case: Grouping and similarity analysis for nodes with textual and relational data.
Goal:
To group contextually related records into sequences and attach these sequences to tasks based on their similarity. The similarity should consider both textual properties and relationship types/weights.
Request:
- Guidance on how to best handle this scenario in Neo4j.
- Recommendations for incorporating textual similarity and relationship data effectively within GDS workflows.
- Suggestions for enhancing existing workflows to include semantic text processing.
Thank you for your assistance!