Load CSV Issue in Neo4j Desktp

Hello guys,

Today I am loading a sample database from Kaggle. My intention is to follow the twitch series "Building Web Applications with Neo4j and Typescript" (@adam_cowley). So, I forked the project from Github, downloaded data from Kaggle, installed my Neo4j Desktop. All good, all set.

However, when loading the 1st dataset into neo4j, using the following commands

:auto
USING PERIODIC COMMIT 1000
LOAD CSV WITH HEADERS FROM 'file:///movies_metadata.csv' AS row
MERGE (m:Movie {id: toInteger(row.id)})
SET m += row {
    .overview,
    .imdb_id,
    .title,
    .poster_path,
    .backdrop_path,
    .original_title,
    .original_language,
    .tagline,
    .status,
    .homepage,
    runtime: toFloat(row.runtime),
    release_date: date(row.release_date),
    revenue: toFloat(row.revenue),
    popularity: toFloat(row.popularity),
    average_vote: toFloat(row.vote_average),
    vote_count: toInteger(row.vote_count),
    budget: toInteger(row.budget)
}

FOREACH (_ IN CASE WHEN row.original_language IS NOT NULL THEN [1] ELSE [] END |
    MERGE (l:Language {id: row.original_language})
    MERGE (m)-[:ORIGINAL_LANGUAGE]->(l)
)

FOREACH (_ IN CASE WHEN row.video = 'True' THEN [1] ELSE [] END | SET m:Video )
FOREACH (_ IN CASE WHEN row.adult = 'True' THEN [1] ELSE [] END | SET m:Adult )

FOREACH (language IN apoc.convert.fromJsonList(row.spoken_languages) |
    MERGE (l:Language {id: language.iso_639_1}) ON CREATE SET l.name = language.name
    MERGE (m)-[:SPOKEN_IN_LANGUAGE]->(l)
)

FOREACH (country IN apoc.convert.fromJsonList(row.production_countries) |
    MERGE (c:Country {id: country.iso_3166_1}) ON CREATE SET c.name = country.name
    MERGE (m)-[:PRODUCED_IN_COUNTRY]->(c)
)

FOREACH (genre IN apoc.convert.fromJsonList(row.genres) |
    MERGE (g:Genre {id: genre.id}) ON CREATE SET g.name = genre.name
    MERGE (m)-[:IN_GENRE]->(g)
)

FOREACH (company IN apoc.convert.fromJsonList(row.production_companies) |
    MERGE (c:ProductionCompany {id: company.id}) ON CREATE SET c.name = company.name
    MERGE (m)-[:PRODUCED_BY]->(c)
)

FOREACH (collection IN CASE WHEN apoc.convert.fromJsonMap(row.belongs_to_collection) IS NOT NULL THEN [apoc.convert.fromJsonMap(row.belongs_to_collection)] ELSE [] END |
    MERGE (c:Collection {id: collection.id}) ON CREATE SET c += collection
    MERGE (m)-[:IN_COLLECTION]->(c)
);

I get this error message:

Cannot merge node using null property value for id (Failure when processing file '/C:/Users/Renato/.Neo4jDesktop/neo4jDatabases/database-efd6656e-a0a8-4575-b51e-bc8fe30c50c9/installation-4.1.0/import/movies_metadata.csv' on line 19732.)

My approach was to identify line 19732 (and to process it alone). And, for my big surprise, it was loaded smoothly. So I tried once again, the entire dataset and again the error message appeared to me. This time I decided to load from line 19730 to 19735, and they were loaded just fine.

Now I'm stuck, I'm very confused with this unexpected behavior of LOAD CV. It really doesn't make any sense for me.

Thanks in advance.

Hello @renatospaka :slight_smile:

According to the error message, it looks like the id is null, can you check if the line 19732 has an ID?

Regards,
Cobra

Hi @cobra.

Yes, I selected the specific line, plus 3 lines above and 3 below (19729 to 19735) that line, mounted a new file, and loaded it. This smaller file, with 6 lines plus header was loaded smoothly, no issues reported. This is what surprised me the most, after all, if the id is missing in the large file, it would also be missing in the small file.

And, here lies my doubt: how can I be sure I have selected the correct line? I mean, line #19732 could be line #19733 if the header line is included. My impression is the reported error line number is not related to the physical position in the file but to one of the internal cursors. Anyway, it is messy to understand this situation.

Can you open the CSV file in an editor to check if all lines have an id? For example, in Excel you can import a CSV and you will can see quickly if an ID is missing.

Sure, I did that. Not the entire file, because it has 45K lines and Excel would take an entire life to load all of it. But I checked the exact line the message is highlighting and 3 above and 3 below. There is no problem with the id itself.

And another odd thing: the error msg points to line 19732. However, as soon as the error occurs, I execute the below query, and the # of nodes is higher.

// What kind of nodes exist
// Sample some nodes, reporting on property and relationship counts per node.
MATCH (n) 
RETURN
DISTINCT labels(n),
count(*) AS SampleSize
labels(n) SampleSize
["Movie"] 19492

The difference is 240. Is a huge difference. Nevertheless, I loaded both lines plus 3 above and 3 below of lines #19493, #19732, and, again, the smaller file loads with zero errors.

Can you share the CSV file?

1 Like

Hi Renato,

I've been getting the same error message when I attempt to merge a set of new nodes and relationships. Do you also have constraints set on your database before using "LOAD CSV"?

Hello @cobra.

The csv file is huge - 45K, but you may download it from Kaggle. There are 8 files, summarizing 900M. The script file to load data is below:
load.cypher.txt (4.0 KB)

Anyway, I figured out what was wrong and it took me a lot of time. The problem, you are correct, is that id field is null for 180 lines, with layout missing or broken, due to description fields have at least one linefeed character in the text, so the lines got split in two or more (there is a case with six new lines for the same id) and caused the misread of the procedure.

It was tricky to locate these lines among 44892 lines, only 180 caused an issue. And due to the size of the file, Excel wasn't able to filter those lines, forcing me to open the entire file in Textpad and search manually. If you figure a better way, do tell me, the way I did isn't either effective neither fast.

Best regards

Hello Eric

Yes, the 1st thing the script does is to create constraints in the database. However, this is not my decision, I just fork it from Github.

Good job @renatospaka, next time, you could create a little Python script and use Pandas, it will be more easier :slight_smile:

Yeah, I need to get my studies back to the road again. But, it would be too much. Currently I am learning nodes/nest, react & neo4j.

Abração.