Hello Neo4j Community
I'm facing a puzzling issue and could use your expertise to troubleshoot it. Despite appearing to upload my dataset into the Neo4j database successfully, no data seems to appear in it. I've diligently followed the steps outlined in the documentation and even tested the database with sample applications, all of which worked flawlessly. However, when integrating my preprocessing code, the data doesn't seem to be making its way into the database.
Problem:
After setting up a Neo4j Enterprise Evaluation in Docker on an Azure VM, wrote code to preprocess and load a dataset into neo4j. Although my dataset uploads without errors, there's no sign of the data in the database.
Key Code Steps:
- Created project and data directories.
- Initialized a logger to track progress.
- Established connection with Neo4j database using the GraphDatabase driver.
- Read data from an Excel file into a Pandas DataFrame.
- Saved a subset of this DataFrame into a CSV file for testing purposes.
- Defined Cypher queries for creating relationships between events and perpetrators.
- Developed a function to execute these Cypher queries with error handling.
- Iterated through the DataFrame rows to create relationships in the Neo4j database.
- Logged the total number of loaded relationships.
Code Snippet:
import pandas as pd
from neo4j import GraphDatabase
from loguru import logger
import os
from pathlib import Path
# Define project and data directories
project_dir = os.getcwd()
data_dir = os.path.join(project_dir, "data")
# Initialize logger
logger.add(os.path.join(data_dir, 'Perp_Event_Relation.log'))
# Log dataset loading
logger.info('Dataset loading')
# Read dataset into DataFrame
GTD_df = pd.read_excel(os.path.join(data_dir, "globalterrorismdb_0522dist.xlsx"))
GTD_df_small = GTD_df.head() # Select only the first 5 rows
# Log dataset loaded
logger.info('The dataset is loaded')
# Save filtered DataFrame to CSV
csv_filename = os.path.join(data_dir, "filtered_data.csv")
GTD_df_small.to_csv(csv_filename, index=False)
logger.info(f'Filtered data saved to CSV: {csv_filename}')
# Define Neo4j connection parameters
uri = "neo4j://xxxxxx:7687"
database_name = "aml"
# Connect to Neo4j database
driver = GraphDatabase.driver(uri, auth=None)
logger.info('Connected to database')
# Define relationship creation query
query_create_rel_att = """
MATCH (e:Event {eventid: $eid}), (p:Perpetrator {name: $name_perpetrator})
MERGE (e)<-[r:IS_RESPONSIBLE_FOR]-(p)
ON CREATE SET r.Claimed = $claimed
"""
# Define function to execute query with error handling
def execute_query(tx, query, params):
try:
tx.run(query, **params)
except Exception as e:
logger.error(f"Error executing query: {str(e)}")
# Initialize counters
loaded = 0
nn = 0
# Iterate over DataFrame rows and create relationships
with driver.session(database=database_name) as session:
for ix, row in GTD_df_small.iterrows():
nn += 1
if nn % 1000 == 0:
logger.info(f"{nn} Loaded")
# Execute main relationship creation query
session.write_transaction(execute_query, query_create_rel_att, {
"eid": row["eventid"],
"claimed": bool(row["claimed"]) if not pd.isna(row["claimed"]) else None,
"name_perpetrator": row["gname"]
})
# Check and execute additional relationship creation queries
for i in range(2, 4):
gname_col = f"gname{i}"
claim_col = f"claim{i}"
if not pd.isna(row[gname_col]):
session.write_transaction(execute_query, query_create_rel_att, {
"eid": row["eventid"],
"claimed": bool(row[claim_col]) if not pd.isna(row[claim_col]) else None,
"name_perpetrator": row[gname_col]
})
# Log total loaded
logger.info('Total loaded: ' + str(nn))
Testing Dataset:
To facilitate diagnostics, I've prepared a small dataset containing just 5 rows. You can access it here. first_5_rows.xlsx
I'm eager to receive any insights or suggestions you might have to resolve this issue. Thank you immensely for your assistance.
Thanks,
Ayoola