Dataset Uploads Successfully but No Data Appears in Database

ayoolafakoya · April 20, 2024, 8:19pm

Hello Neo4j Community

I'm facing a puzzling issue and could use your expertise to troubleshoot it. Despite appearing to upload my dataset into the Neo4j database successfully, no data seems to appear in it. I've diligently followed the steps outlined in the documentation and even tested the database with sample applications, all of which worked flawlessly. However, when integrating my preprocessing code, the data doesn't seem to be making its way into the database.

Problem:
After setting up a Neo4j Enterprise Evaluation in Docker on an Azure VM, wrote code to preprocess and load a dataset into neo4j. Although my dataset uploads without errors, there's no sign of the data in the database.

Key Code Steps:

Created project and data directories.
Initialized a logger to track progress.
Established connection with Neo4j database using the GraphDatabase driver.
Read data from an Excel file into a Pandas DataFrame.
Saved a subset of this DataFrame into a CSV file for testing purposes.
Defined Cypher queries for creating relationships between events and perpetrators.
Developed a function to execute these Cypher queries with error handling.
Iterated through the DataFrame rows to create relationships in the Neo4j database.
Logged the total number of loaded relationships.

Code Snippet:

import pandas as pd
from neo4j import GraphDatabase
from loguru import logger
import os
from pathlib import Path

# Define project and data directories
project_dir = os.getcwd()
data_dir = os.path.join(project_dir, "data")

# Initialize logger
logger.add(os.path.join(data_dir, 'Perp_Event_Relation.log'))

# Log dataset loading
logger.info('Dataset loading')

# Read dataset into DataFrame
GTD_df = pd.read_excel(os.path.join(data_dir, "globalterrorismdb_0522dist.xlsx"))
GTD_df_small = GTD_df.head()  # Select only the first 5 rows

# Log dataset loaded
logger.info('The dataset is loaded')

# Save filtered DataFrame to CSV
csv_filename = os.path.join(data_dir, "filtered_data.csv")
GTD_df_small.to_csv(csv_filename, index=False)
logger.info(f'Filtered data saved to CSV: {csv_filename}')

# Define Neo4j connection parameters
uri = "neo4j://xxxxxx:7687"
database_name = "aml"

# Connect to Neo4j database
driver = GraphDatabase.driver(uri, auth=None)
logger.info('Connected to database')

# Define relationship creation query
query_create_rel_att = """
MATCH (e:Event {eventid: $eid}), (p:Perpetrator {name: $name_perpetrator})
MERGE (e)<-[r:IS_RESPONSIBLE_FOR]-(p)
ON CREATE SET r.Claimed = $claimed
"""

# Define function to execute query with error handling
def execute_query(tx, query, params):
    try:
        tx.run(query, **params)
    except Exception as e:
        logger.error(f"Error executing query: {str(e)}")

# Initialize counters
loaded = 0
nn = 0

# Iterate over DataFrame rows and create relationships
with driver.session(database=database_name) as session:
    for ix, row in GTD_df_small.iterrows():
        nn += 1
        if nn % 1000 == 0:
            logger.info(f"{nn} Loaded")

        # Execute main relationship creation query
        session.write_transaction(execute_query, query_create_rel_att, {
            "eid": row["eventid"],
            "claimed": bool(row["claimed"]) if not pd.isna(row["claimed"]) else None,
            "name_perpetrator": row["gname"]
        })

        # Check and execute additional relationship creation queries
        for i in range(2, 4):
            gname_col = f"gname{i}"
            claim_col = f"claim{i}"
            if not pd.isna(row[gname_col]):
                session.write_transaction(execute_query, query_create_rel_att, {
                    "eid": row["eventid"],
                    "claimed": bool(row[claim_col]) if not pd.isna(row[claim_col]) else None,
                    "name_perpetrator": row[gname_col]
                })

    # Log total loaded
    logger.info('Total loaded: ' + str(nn))

Testing Dataset:
To facilitate diagnostics, I've prepared a small dataset containing just 5 rows. You can access it here. first_5_rows.xlsx

I'm eager to receive any insights or suggestions you might have to resolve this issue. Thank you immensely for your assistance.

Thanks,
Ayoola

glilienfield · April 20, 2024, 9:58pm

Your query is matching on the two nodes you want to relate. Did you create these before in another script?

michael.hunger · April 21, 2024, 6:12pm

Can you try to consume your tx.run() results. e.g. with consume()

Otherwise the tx might not be concluded.
You can also use the newer and simpler driver API with driver.execute_query(statement, params, database=database)

See:

ayoolafakoya · April 22, 2024, 8:05pm

No, i believe it should be created automatically.

glilienfield · April 22, 2024, 9:30pm

If you want them to be created in this same query, then you need to change “match” to “create” or “merge”. I generally like merge because it allows you to run the same query multiple times and have just one instance the node. Otherwise a new duplicate node will get created each time or an exception raised if you have a uniqueness constraint.

ayoolafakoya · April 30, 2024, 6:06am

Thank you for the information. I managed to upload successfully. I appreciate your help.

ayoolafakoya · April 30, 2024, 6:07am

Thanks, I am just beginning to get the hang of cypher. Appreciate your help.

Topic		Replies	Views
Trying to load csv data onto neo4j server nodes and relationships not showing Server browser , cypher , relationship , import	8	825	May 25, 2021
Neo4j-admin import successful but no data shown in database Import / Export	4	4902	April 30, 2020
Importing CSV Data into Neo4j - Transform Strings to Lists Sandbox Graph Academy & Certifications	8	425	July 10, 2023
Load CSV Issue in Neo4j Desktp Cypher	10	740	August 17, 2020
Load pandas data frame to neo4j database in batches Import / Export	1	962	September 25, 2023

Get Certified in June!

Dataset Uploads Successfully but No Data Appears in Database

Related topics