Dataset Uploads Successfully but No Data Appears in Database

:mag:Hello Neo4j Community

I'm facing a puzzling issue and could use your expertise to troubleshoot it. Despite appearing to upload my dataset into the Neo4j database successfully, no data seems to appear in it. I've diligently followed the steps outlined in the documentation and even tested the database with sample applications, all of which worked flawlessly. However, when integrating my preprocessing code, the data doesn't seem to be making its way into the database.

Problem:
After setting up a Neo4j Enterprise Evaluation in Docker on an Azure VM, wrote code to preprocess and load a dataset into neo4j. Although my dataset uploads without errors, there's no sign of the data in the database.

Key Code Steps:

  1. Created project and data directories.
  2. Initialized a logger to track progress.
  3. Established connection with Neo4j database using the GraphDatabase driver.
  4. Read data from an Excel file into a Pandas DataFrame.
  5. Saved a subset of this DataFrame into a CSV file for testing purposes.
  6. Defined Cypher queries for creating relationships between events and perpetrators.
  7. Developed a function to execute these Cypher queries with error handling.
  8. Iterated through the DataFrame rows to create relationships in the Neo4j database.
  9. Logged the total number of loaded relationships.

Code Snippet:

import pandas as pd
from neo4j import GraphDatabase
from loguru import logger
import os
from pathlib import Path

# Define project and data directories
project_dir = os.getcwd()
data_dir = os.path.join(project_dir, "data")

# Initialize logger
logger.add(os.path.join(data_dir, 'Perp_Event_Relation.log'))

# Log dataset loading
logger.info('Dataset loading')

# Read dataset into DataFrame
GTD_df = pd.read_excel(os.path.join(data_dir, "globalterrorismdb_0522dist.xlsx"))
GTD_df_small = GTD_df.head()  # Select only the first 5 rows

# Log dataset loaded
logger.info('The dataset is loaded')

# Save filtered DataFrame to CSV
csv_filename = os.path.join(data_dir, "filtered_data.csv")
GTD_df_small.to_csv(csv_filename, index=False)
logger.info(f'Filtered data saved to CSV: {csv_filename}')

# Define Neo4j connection parameters
uri = "neo4j://xxxxxx:7687"
database_name = "aml"

# Connect to Neo4j database
driver = GraphDatabase.driver(uri, auth=None)
logger.info('Connected to database')

# Define relationship creation query
query_create_rel_att = """
MATCH (e:Event {eventid: $eid}), (p:Perpetrator {name: $name_perpetrator})
MERGE (e)<-[r:IS_RESPONSIBLE_FOR]-(p)
ON CREATE SET r.Claimed = $claimed
"""

# Define function to execute query with error handling
def execute_query(tx, query, params):
    try:
        tx.run(query, **params)
    except Exception as e:
        logger.error(f"Error executing query: {str(e)}")

# Initialize counters
loaded = 0
nn = 0

# Iterate over DataFrame rows and create relationships
with driver.session(database=database_name) as session:
    for ix, row in GTD_df_small.iterrows():
        nn += 1
        if nn % 1000 == 0:
            logger.info(f"{nn} Loaded")

        # Execute main relationship creation query
        session.write_transaction(execute_query, query_create_rel_att, {
            "eid": row["eventid"],
            "claimed": bool(row["claimed"]) if not pd.isna(row["claimed"]) else None,
            "name_perpetrator": row["gname"]
        })

        # Check and execute additional relationship creation queries
        for i in range(2, 4):
            gname_col = f"gname{i}"
            claim_col = f"claim{i}"
            if not pd.isna(row[gname_col]):
                session.write_transaction(execute_query, query_create_rel_att, {
                    "eid": row["eventid"],
                    "claimed": bool(row[claim_col]) if not pd.isna(row[claim_col]) else None,
                    "name_perpetrator": row[gname_col]
                })

    # Log total loaded
    logger.info('Total loaded: ' + str(nn))

Testing Dataset:
To facilitate diagnostics, I've prepared a small dataset containing just 5 rows. You can access it here. first_5_rows.xlsx

I'm eager to receive any insights or suggestions you might have to resolve this issue. Thank you immensely for your assistance.

Thanks,
Ayoola

Your query is matching on the two nodes you want to relate. Did you create these before in another script?

Can you try to consume your tx.run() results. e.g. with consume()

Otherwise the tx might not be concluded.
You can also use the newer and simpler driver API with driver.execute_query(statement, params, database=database)

See:

No, i believe it should be created automatically.

If you want them to be created in this same query, then you need to change “match” to “create” or “merge”. I generally like merge because it allows you to run the same query multiple times and have just one instance the node. Otherwise a new duplicate node will get created each time or an exception raised if you have a uniqueness constraint.

Thank you for the information. I managed to upload successfully. I appreciate your help.

Thanks, I am just beginning to get the hang of cypher. Appreciate your help.

1 Like