Data after loading csv

Hi everyone, this my first time in neo4j. I'm in struggle with something after I LOAD data in neo4j (I'm using Neo4j desktop). Because I have to show the power of neo4j at my client, I'm training my self on cypher checking the results by SQL.

I have 3 nodes all linked each others: (:DSO)--(:OPT)--(:OPTION)--(:DSO)
I'm using this cypher query to find a certain number of DSO

< // Number of ORDER without a specific OPTION
Match (option:OPTION)--(d:DSO)--(opt:OPT)
where (option.option_cd<> '094084680')
return opt.opt_cd , count(distinct d.dso_cd) >

But the problem is the follow After i execute the query above I get a different findings compared with the result of the SQL QUERY.
Seems like it creates a relationship with are not related to the rows of the csv I loaded , Any suggestions ? could be the structure of my nodes, the cypher queries I wrote or something else ?

Thank you in advance

First, the pattern in your query does not match the relationships you so in your explanation. That being said, the only what to help is if you provide the data, as I can't tell if there is an error or not from what you provided?

To be more clear, the pattern of relationship is the follow:

Screenshot 2024-05-24 190635

Maybe I can share you a sample of the data I'm using for the training

What would be helpful, as well as what you expect as the result.

1 Like

One thing I can thing of is that your query is not specifying the direction, so you are going to get matches in both directions, which should inflate your counts.

Try this:

Match (option:OPTION)<-[:HAS_OPTION]-(d:DSO)-[:HAS_OPT]->(opt:OPT)
where option.option_cd <> '094084680'
return opt.opt_cd, count(distinct d.dso_cd)
1 Like

Thank you for the cypher you provide me, but I got the same results.
I'll give the data I'm using ( but I don't know how can i share the csv here, can you give some tips to share it?) and also I share with you the result that I got by SQL and by Cypher
On the left the SQL RESULTS and on the right the CYPHER RESULTS

I don't know why with the same source of date in the cypher query each "OPT" has one more "DSO" than SQL query.

You should be able to upload a csv or text file.

Apparently, since I am a new user, I cannot attach txt, or csv.
Therefore I put the following drive link, where you can find two file(txt and csv) of the data I'm using.

https://drive.google.com/drive/folders/1_s7-74Q1eEdMm90yfyzie7_wQFypJqpW

Ok. I was able to reproduce the issue using MySqlWorkbench to investigate the difference. It turns out if you import the data using merge (as below), it will create triplets that are not in the original data. This results in more rows in the neo4j database when the rows are joined and exported.

load csv with headers from "file:///sample_data.csv" as row
merge(option:OPTION{option_cd:row.OPTION})
merge(opt:OPT{opt_cd:row.OPT})
merge(dso:DSO{dso_cd:row.DEALER_SALES_ORDER})
merge(option)<-[:HAS_OPTION]-(dso)-[:HAS_OPT]->(opt)

When the data is joined (as below), there are 633 rows, while the original data file only had 413.

Match (option:OPTION)<-[:HAS_OPTION]-(dso:DSO)-[:HAS_OPT]->(opt:OPT)
return opt.opt_cd, option.option_cd, dso.dso_cd

I imported your data and an export from above query into a MySql database. I then looked at the results of an outer join of the neo4j data to your original data. It showed there were rows in the neo4j data that were not in yours. These obviously changed the results of your aggregation query.

I then changed the neo4j import query to use create instead of merge. The thought was this would create exactly the correct triplet pairs in the database.

load csv with headers from "file:///sample_data.csv" as row
create(option:OPTION{option_cd:row.OPTION})
create(opt:OPT{opt_cd:row.OPT})
create(dso:DSO{dso_cd:row.DEALER_SALES_ORDER})
create(option)<-[:HAS_OPTION]-(dso)-[:HAS_OPT]->(opt)

When I did this and run the aggregation query, I got the correct counts.

What I believe is happening is that the merge is creating the correct relationships between pairs of codes (opt_cd to dso_cd and option_cd to dso_cd), but when joined using the match pattern across all three entities, it created relationships between opt_cd and option_cd that do not exist in your original data. This is not a defect in the queries. I believe it is a defect in your data model. It assumes if there is a relationship between a specific dso_cd and option_cd, and there is a relationship between the same dso_cd and an opt_cd, then there is a relationship between the corresponding option_cd's and opt_cd's, which is obviously not true.

Typically we create nodes in a graph database that correspond to domain entities, so we can investigate complex relationships between these entities. A graph database is great when these relationships result in a network of deep relationships of varies lengths. These scenarios are not handled well by a relational database. In my option, if the entity relationships are only one level deep, as as in one-to-many, many-to-one, or many-to-many, then a relational database is a good choice.

Is there a reason you chose to create these three entities?

Thank you so much for you help , it was useful use you method , using CREATE instead of MERGE, now it works.
But if I use CREATE I don't obtain a "good" results showing the nodes by graph.
Now I don't why MERGE Create data and relationships which don't exist in my csv, but it more I'd like that MERGE statement would work even for my data, because I need to also show the distinct value.

I'm also training on this data set, and I agree with you that this kind or query are more simple and efficient on SQL, but this my data are only the small part of the all data I'm using , I Can say that is The deepest part , and so far i'm using only it.

Thank you very much again, But I'm not sure, and I didn't understand why MERGE has a problem, could be very import to me to understand that

I thought another thing, If I have three different Table as a sources:
1 Table : OPT with opt_cd and opt_descriprion
2 Table : OPTION with option_cd and option_description
3 Table : with dso , opt_cd and option_Cd

In this case could be work?

I will provide example to demonstrate what has occurred. Below is a list of triplets that were in the Neo4j data, but not in your original data set.

The following is a query from your data for rows that have the dso_cd shown in the first row of the erroneous data in the above screenshot.

As you can see, your data has two rows with that dso_cd. When the first one is added with a 'merge', the opt_cd 'CECM' is associated with the dso_cd. When the second line is added with a 'merge', the option_cd '094084680' is associated with the dso_cd. Now when the cypher query below is executed, it will create a row with dso_cd '100006280170001247', opt_cd 'CECM', and option_cd '094084680'. This is a triplet that is not in your original data, explaining the incorrect aggregation counts.

Your example is a many-to-many relationship between option_cd and opt_cd. This would work.

let me be straight , I understood what happens, but I don't understand why this happens. to explain better, why during The MERGE , it add new values which are not in my csv?

If I use CREATE I will have the same numbers of nodes, for each node created which means, taking this case 413 nodes * 3. According to This sample, this method works but If i have a big amount of data is not convenient.

I need the distinct values for each node and not Duplicates.
In this case I have two Distinct OPT, I need to have in Knowledge graph, only two OPTs, then link those two OPTs to another two nodes DSO and OPTION ( where also those nodes must have unique values and not duplicates).

Sorry to remark again this topic, but your solution provided is good, but I also need a good view .
this one:


I don't have the view above by CREATE, but only by MERGE.

this is why i want to understand this strange behaviour of MERGE with my data.

What is happening is expected behavior. A 'merge' looks to match on an existing node. If found, it will use it. If not, it will create a new one. Either way, you end up with a node. 'Create' on the other hand always creates a new node.

I have taken the two rows from the previous post of triplets found in your data set that have the same dso_cd. You get the following structure when using merge:

unwind [
{
    dso_cd: '100006280170001247',
    opt_cd: 'CECM',
    option_cd: '094084683'
},
{
    dso_cd: '100006280170001247',
    opt_cd: 'CICM',
    option_cd: '094084680'
}
] as row
merge(option:OPTION{option_cd:row.option_cd})
merge(opt:OPT{opt_cd:row.opt_cd})
merge(dso:DSO{dso_cd:row.dso_cd})
merge(option)<-[:HAS_OPTION]-(dso)-[:HAS_OPT]->(opt)
return *

As you can see from the above returned result, the two sets of triplets share the same DSO node. This is because the merge on the second line matched the existing DSO node. This is expected behavior.

The side effect of this is then when you executing your query to find all combinations of OPTION, DSO, and OPT nodes that are related, you get four combinations from the above structure, instead of the two that where in your original data that was used to create the relationships. You can see this in the below query. Two of the rows represent triplets that are not in your original data.

When you repeated the exercise using 'create' instead, you end up with only two rows because the DSO node is not shared.

unwind [
{
    dso_cd: '100006280170001247',
    opt_cd: 'CECM',
    option_cd: '094084683'
},
{
    dso_cd: '100006280170001247',
    opt_cd: 'CICM',
    option_cd: '094084680'
}
] as row
create(option:OPTION{option_cd:row.option_cd})
create(opt:OPT{opt_cd:row.opt_cd})
create(dso:DSO{dso_cd:row.dso_cd})
create(option)<-[:HAS_OPTION]-(dso)-[:HAS_OPT]->(opt)
return *

Now there are only two combinations of triplets that match the pattern, resulting in two rows that were in your original data.

The issue is with the data model, which assumes that if an OPTION node is associated with a DSO node and an OPT node is associated with the same DSO node, then the OPTION and OPT nodes are related. This is not true when relating nodes by their option_cd, dso_cd, and opt_cd values.