Index Not Being Used

I am working on importing data through a CSV.

I am trying to insert a number of rows (more than 1 mil) from a CSV file into the graph database, the treasure table needs a relationship to a city, as in a treasure is found in ever city. city_id is formatted as such: 0x6234F00E0E994A6D8A7C5040FCA8CC1B.

I have created an index yet neo4j doesn't seem to be using it:

// Create index to speed up insertion
CREATE INDEX IF NOT EXISTS FOR (ci:City) ON (ci.id);

// Load treasures and link to cities
:auto LOAD CSV FROM 'file:///treasure.csv' AS row_treasure
CALL {
    WITH row_treasure
    CREATE (t:Treasure {id: row_treasure[0], difficulty: toInteger(row_treasure[1]), terrain: toInteger(row_treasure[2]), city_id: row_treasure[3]})
    WITH t, row_treasure
    MATCH (ci:City {id: row_treasure[3]})
    CREATE (t)-[:HIDDEN_IN]->(ci)
} IN TRANSACTIONS OF 500 ROWS;

This index not being used leads to greater than 10 minute wait time for 500 entries to be inserted. Elsewhere in the same execution I am also using an index yet this one is used correctly:

// Create index to speed up insertion
CREATE INDEX IF NOT EXISTS FOR (co:Country) ON (co.code);

// Load cities and link to countries
:auto LOAD CSV FROM 'file:///city.csv' AS row_city
CALL {
    WITH row_city
CREATE (ci:City {id: row_city[0], city_name: row_city[1], latitude: row_city[2], longitude: row_city[3], postal_code: row_city[4], country_code: row_city[5]})
WITH ci, row_city
MATCH (co:Country {code: row_city[5]})
CREATE (ci)-[:LOCATED_IN]->(co)
} IN TRANSACTIONS OF 500 ROWS;

I have tried looking up different types of indexes and how to use them but I have generally been unable to find a solution that does work and uses the indexes.

Please Insert “explain” at the beginning to create a query plan and share it here.

Here's the query plan

What didn't make it into the image in the "Result" step

You can see the index is getting used via the node index scan. Try increasing the number of rows to 10,000.

You're right the index does seem to be used even though show indexes doesn't say it's ever being read, increasing it to 10,000 rows still didn't help with execution time, it still takes forever, even longer than before to insert even 10,000 rows.

And the query execution plan didn't change at all with the row increase

The query is pretty straightforward, so I don't see any optimizations. I would just move your "match" before the "create", so you can eliminate the "with" clause.

:auto 
LOAD CSV FROM 'file:///city.csv' AS row_city
CALL {
    WITH row_city
    MATCH (co:Country {code: row_city[5]})
    CREATE (ci:City {
        id: row_city[0], 
        city_name: row_city[1], 
        latitude: row_city[2], 
        longitude: row_city[3], 
        postal_code: row_city[4], 
        country_code: row_city[5]
    })
    CREATE (ci)-[:LOCATED_IN]->(co)
} IN TRANSACTIONS OF 10000 ROWS;

If this is a test database, maybe change the "match" to a "create" and see if it runs faster, so you can know the "match" is the issue.

You mentioned the "result" step did not make it into the image. There is no result - correct? The plan shows an "EmptyResult". A return value would add time to the query.

There is no return value that's right. The issue was with the first query in my question (loading from treasure.csv), rearranging the query didn't seem to provide any sort of improvement in speed and it still takes forever to insert.

The issue I think is because city has so many entries it takes forever to process any of the treasure entries as it has to search through the cities to create the relationships. I couldn't find a more efficient way to create the relationships though.

For the first query, you need an index on Country(code). Do you have that?

Why would I need an index on country code when I am not using it there? I do make an index on country code before though, this is the part of the code that has the issue with being slow:

// Create index to speed up insertion
CREATE INDEX IF NOT EXISTS FOR (ci:City) ON (ci.id);

// Load treasures and link to cities
:auto LOAD CSV FROM 'file:///treasure.csv' AS row_treasure
CALL {
    WITH row_treasure
    MATCH (ci:City {id: row_treasure[3]})
    CREATE (t:Treasure {id: row_treasure[0], difficulty: toInteger(row_treasure[1]), terrain: toInteger(row_treasure[2]), city_id: row_treasure[3]})
    CREATE (t)-[:HIDDEN_IN]->(ci)
} IN TRANSACTIONS OF 10000 ROWS;

@roman.gordon

Do you have any details on the Neo4j version in play here?

I believe I am using version 5.12.0

image

Edit: I have since tried on 5.19.0 and it has not changed anything about how long it takes

You have two versions of the query. The first has a match on City(id) and the next Country(code).

They're 2 different queries, 1 has the index that works properly and the other not, I guess because the city table has a lot more entries than the country table is why it takes almost an hour to insert 10k entries

Is there a better way than what I am doing to insert such large amounts of data?

Is it better to create the relationships after inserting all the data maybe? And if so how could that be achieved

So essentially what I am trying to create with importing CSV's is something like this where TreasureLog is simply the relationship table and hunter and treasure should be directly linked

This is the full code I have importing it, the issue arises when importing the treasures as they are each located in 1 city.

// Load countries
LOAD CSV FROM 'file:///country.csv' AS row_country
MERGE (co:Country {code: row_country[0], code3: row_country[1], country_name: row_country[2]});

// Create index to speed up insertion
CREATE INDEX IF NOT EXISTS FOR (co:Country) ON (co.code);

// Load cities and link to countries
:auto LOAD CSV FROM 'file:///city.csv' AS row_city
CALL {
    WITH row_city
CREATE (ci:City {id: row_city[0], city_name: row_city[1], latitude: row_city[2], longitude: row_city[3], postal_code: row_city[4], country_code: row_city[5]})
WITH ci, row_city
MATCH (co:Country {code: row_city[5]})
CREATE (ci)-[:LOCATED_IN]->(co)
} IN TRANSACTIONS OF 10000 ROWS;

// Load users
:auto LOAD CSV FROM 'file:///user_table.csv' AS row_user
CALL {
    WITH row_user
CREATE (h:Hunter {id: row_user[0], first_name: row_user[1], last_name: row_user[2], email: row_user[3], street_num: row_user[4], street_name: row_user[5], city_id: row_user[6]})
} IN TRANSACTIONS OF 10000 ROWS;

// Create index to speed up insertion
CREATE INDEX IF NOT EXISTS FOR (ci:City) ON (ci.id);

// Load treasures and link to cities
:auto LOAD CSV FROM 'file:///treasure.csv' AS row_treasure
CALL {
    WITH row_treasure
    MATCH (ci:City {id: row_treasure[3]})
    CREATE (t:Treasure {id: row_treasure[0], difficulty: toInteger(row_treasure[1]), terrain: toInteger(row_treasure[2]), city_id: row_treasure[3]})
    CREATE (t)-[:HIDDEN_IN]->(ci)
} IN TRANSACTIONS OF 10000 ROWS;

// Create FOUND relationships
:auto LOAD CSV FROM 'file:///treasure_log.csv' AS row
CALL {
    WITH row
MATCH (h:Hunter {id: row[5]})
MATCH (t:Treasure {id: row[6]})
MERGE (h)-[r:FOUND]->(t)
} IN TRANSACTIONS OF 10000 ROWS;

First, you want to make sure your indexes are created first. The order you have in your steps has the indexes after the query that uses them. These only need to be executed once, so you can do it before your query steps.

From your query below, I see two queries (Cities and Treasures) that need to match on an existing node in order to create the relationship. This approach is probably matching on the same node many times for each city or treasure. We can eliminate that by grouping, so we need to match only ones. The drawback is it may require more memory, but you can see. Note, I was not able to test it since I don't have any test data.

// Create indexes to speed up insertion
CREATE INDEX IF NOT EXISTS FOR (co:Country) ON (co.code);
CREATE INDEX IF NOT EXISTS FOR (ci:City) ON (ci.id);

// Load countries
LOAD CSV FROM 'file:///country.csv' AS row_country
MERGE (co:Country {
    code: row_country[0], 
    code3: row_country[1], 
    country_name: row_country[2]
});

:auto 
LOAD CSV FROM 'file:///city.csv' AS row
CALL {
    WITH row
    WITH row[5] as code, collect(row) as row_city
    MATCH (co:Country {code: code})
    FOREACH(i in row_city |
        CREATE (ci:City {
            id: i[0], 
            city_name: i[1], 
            latitude: i[2], 
            longitude: i[3], 
            postal_code: i[4], 
            country_code: i[5]
        })
        CREATE (ci)-[:LOCATED_IN]->(co)
    )
} IN TRANSACTIONS OF 10000 ROWS;

// Load users
:auto 
LOAD CSV FROM 'file:///user_table.csv' AS row_user
CALL {
    WITH row_user
    CREATE (h:Hunter {
        id: row_user[0], 
        first_name: row_user[1], 
        last_name: row_user[2], 
        email: row_user[3], 
        street_num: row_user[4], 
        street_name: row_user[5], 
        city_id: row_user[6]
    })
} IN TRANSACTIONS OF 10000 ROWS;

// Load treasures and link to cities
:auto 
LOAD CSV FROM 'file:///treasure.csv' AS row
CALL {
    WITH row
    WITH row[3] as id, collect(row) as row_treasure
    MATCH (ci:City {id: id})
    FOREACH (i in row_treasure | 
        CREATE (t:Treasure {
            id: i[0],
            difficulty: toInteger(i[1]), 
            terrain: toInteger(i[2]), 
            city_id: i[3]
        })
        CREATE (t)-[:HIDDEN_IN]->(ci)
    )
} IN TRANSACTIONS OF 10000 ROWS;

// Create FOUND relationships
:auto 
LOAD CSV FROM 'file:///treasure_log.csv' AS row
CALL {
    WITH row
    MATCH (h:Hunter {id: row[5]})
    MATCH (t:Treasure {id: row[6]})
    MERGE (h)-[r:FOUND]->(t)
} IN TRANSACTIONS OF 10000 ROWS;

You may be able to refactor the treasure_log query too, but grouping on the 'id' that has the smaller number of distinct values.

Thank you, that has definitely improved performance of the treasure inserts, now it's the treasure_log query that's slower but I imagine I can refactor it to use the treasure table as the default as that has less entries.

I solved the treasure_log by using 2 indexes ultimately and keeping code as is

1 Like

Instead of a “forEach”, you can “unwind” so you can “match”.