Index Not Being Used

roman.gordon · May 19, 2024, 5:37pm

I am working on importing data through a CSV.

I am trying to insert a number of rows (more than 1 mil) from a CSV file into the graph database, the treasure table needs a relationship to a city, as in a treasure is found in ever city. city_id is formatted as such: 0x6234F00E0E994A6D8A7C5040FCA8CC1B.

I have created an index yet neo4j doesn't seem to be using it:

// Create index to speed up insertion
CREATE INDEX IF NOT EXISTS FOR (ci:City) ON (ci.id);

// Load treasures and link to cities
:auto LOAD CSV FROM 'file:///treasure.csv' AS row_treasure
CALL {
    WITH row_treasure
    CREATE (t:Treasure {id: row_treasure[0], difficulty: toInteger(row_treasure[1]), terrain: toInteger(row_treasure[2]), city_id: row_treasure[3]})
    WITH t, row_treasure
    MATCH (ci:City {id: row_treasure[3]})
    CREATE (t)-[:HIDDEN_IN]->(ci)
} IN TRANSACTIONS OF 500 ROWS;

This index not being used leads to greater than 10 minute wait time for 500 entries to be inserted. Elsewhere in the same execution I am also using an index yet this one is used correctly:

// Create index to speed up insertion
CREATE INDEX IF NOT EXISTS FOR (co:Country) ON (co.code);

// Load cities and link to countries
:auto LOAD CSV FROM 'file:///city.csv' AS row_city
CALL {
    WITH row_city
CREATE (ci:City {id: row_city[0], city_name: row_city[1], latitude: row_city[2], longitude: row_city[3], postal_code: row_city[4], country_code: row_city[5]})
WITH ci, row_city
MATCH (co:Country {code: row_city[5]})
CREATE (ci)-[:LOCATED_IN]->(co)
} IN TRANSACTIONS OF 500 ROWS;

I have tried looking up different types of indexes and how to use them but I have generally been unable to find a solution that does work and uses the indexes.

glilienfield · May 19, 2024, 5:53pm

Please Insert “explain” at the beginning to create a query plan and share it here.

roman.gordon · May 19, 2024, 6:36pm

Here's the query plan

What didn't make it into the image in the "Result" step

glilienfield · May 19, 2024, 11:22pm

You can see the index is getting used via the node index scan. Try increasing the number of rows to 10,000.

roman.gordon · May 20, 2024, 6:04am

You're right the index does seem to be used even though show indexes doesn't say it's ever being read, increasing it to 10,000 rows still didn't help with execution time, it still takes forever, even longer than before to insert even 10,000 rows.

And the query execution plan didn't change at all with the row increase

glilienfield · May 20, 2024, 12:22pm

The query is pretty straightforward, so I don't see any optimizations. I would just move your "match" before the "create", so you can eliminate the "with" clause.

:auto 
LOAD CSV FROM 'file:///city.csv' AS row_city
CALL {
    WITH row_city
    MATCH (co:Country {code: row_city[5]})
    CREATE (ci:City {
        id: row_city[0], 
        city_name: row_city[1], 
        latitude: row_city[2], 
        longitude: row_city[3], 
        postal_code: row_city[4], 
        country_code: row_city[5]
    })
    CREATE (ci)-[:LOCATED_IN]->(co)
} IN TRANSACTIONS OF 10000 ROWS;

If this is a test database, maybe change the "match" to a "create" and see if it runs faster, so you can know the "match" is the issue.

You mentioned the "result" step did not make it into the image. There is no result - correct? The plan shows an "EmptyResult". A return value would add time to the query.

roman.gordon · May 20, 2024, 12:38pm

There is no return value that's right. The issue was with the first query in my question (loading from treasure.csv), rearranging the query didn't seem to provide any sort of improvement in speed and it still takes forever to insert.

The issue I think is because city has so many entries it takes forever to process any of the treasure entries as it has to search through the cities to create the relationships. I couldn't find a more efficient way to create the relationships though.

glilienfield · May 20, 2024, 2:03pm

For the first query, you need an index on Country(code). Do you have that?

roman.gordon · May 20, 2024, 2:20pm

Why would I need an index on country code when I am not using it there? I do make an index on country code before though, this is the part of the code that has the issue with being slow:

// Create index to speed up insertion
CREATE INDEX IF NOT EXISTS FOR (ci:City) ON (ci.id);

// Load treasures and link to cities
:auto LOAD CSV FROM 'file:///treasure.csv' AS row_treasure
CALL {
    WITH row_treasure
    MATCH (ci:City {id: row_treasure[3]})
    CREATE (t:Treasure {id: row_treasure[0], difficulty: toInteger(row_treasure[1]), terrain: toInteger(row_treasure[2]), city_id: row_treasure[3]})
    CREATE (t)-[:HIDDEN_IN]->(ci)
} IN TRANSACTIONS OF 10000 ROWS;

dana_canzano · May 20, 2024, 2:53pm

@roman.gordon

Do you have any details on the Neo4j version in play here?

roman.gordon · May 20, 2024, 3:05pm

I believe I am using version 5.12.0

Edit: I have since tried on 5.19.0 and it has not changed anything about how long it takes

glilienfield · May 20, 2024, 4:12pm

You have two versions of the query. The first has a match on City(id) and the next Country(code).

roman.gordon · May 20, 2024, 4:39pm

They're 2 different queries, 1 has the index that works properly and the other not, I guess because the city table has a lot more entries than the country table is why it takes almost an hour to insert 10k entries

Is there a better way than what I am doing to insert such large amounts of data?

Is it better to create the relationships after inserting all the data maybe? And if so how could that be achieved

roman.gordon · May 20, 2024, 6:23pm

So essentially what I am trying to create with importing CSV's is something like this where TreasureLog is simply the relationship table and hunter and treasure should be directly linked

This is the full code I have importing it, the issue arises when importing the treasures as they are each located in 1 city.

// Load countries
LOAD CSV FROM 'file:///country.csv' AS row_country
MERGE (co:Country {code: row_country[0], code3: row_country[1], country_name: row_country[2]});

// Create index to speed up insertion
CREATE INDEX IF NOT EXISTS FOR (co:Country) ON (co.code);

// Load cities and link to countries
:auto LOAD CSV FROM 'file:///city.csv' AS row_city
CALL {
    WITH row_city
CREATE (ci:City {id: row_city[0], city_name: row_city[1], latitude: row_city[2], longitude: row_city[3], postal_code: row_city[4], country_code: row_city[5]})
WITH ci, row_city
MATCH (co:Country {code: row_city[5]})
CREATE (ci)-[:LOCATED_IN]->(co)
} IN TRANSACTIONS OF 10000 ROWS;

// Load users
:auto LOAD CSV FROM 'file:///user_table.csv' AS row_user
CALL {
    WITH row_user
CREATE (h:Hunter {id: row_user[0], first_name: row_user[1], last_name: row_user[2], email: row_user[3], street_num: row_user[4], street_name: row_user[5], city_id: row_user[6]})
} IN TRANSACTIONS OF 10000 ROWS;

// Create index to speed up insertion
CREATE INDEX IF NOT EXISTS FOR (ci:City) ON (ci.id);

// Load treasures and link to cities
:auto LOAD CSV FROM 'file:///treasure.csv' AS row_treasure
CALL {
    WITH row_treasure
    MATCH (ci:City {id: row_treasure[3]})
    CREATE (t:Treasure {id: row_treasure[0], difficulty: toInteger(row_treasure[1]), terrain: toInteger(row_treasure[2]), city_id: row_treasure[3]})
    CREATE (t)-[:HIDDEN_IN]->(ci)
} IN TRANSACTIONS OF 10000 ROWS;

// Create FOUND relationships
:auto LOAD CSV FROM 'file:///treasure_log.csv' AS row
CALL {
    WITH row
MATCH (h:Hunter {id: row[5]})
MATCH (t:Treasure {id: row[6]})
MERGE (h)-[r:FOUND]->(t)
} IN TRANSACTIONS OF 10000 ROWS;

glilienfield · May 21, 2024, 10:59pm

First, you want to make sure your indexes are created first. The order you have in your steps has the indexes after the query that uses them. These only need to be executed once, so you can do it before your query steps.

From your query below, I see two queries (Cities and Treasures) that need to match on an existing node in order to create the relationship. This approach is probably matching on the same node many times for each city or treasure. We can eliminate that by grouping, so we need to match only ones. The drawback is it may require more memory, but you can see. Note, I was not able to test it since I don't have any test data.

// Create indexes to speed up insertion
CREATE INDEX IF NOT EXISTS FOR (co:Country) ON (co.code);
CREATE INDEX IF NOT EXISTS FOR (ci:City) ON (ci.id);

// Load countries
LOAD CSV FROM 'file:///country.csv' AS row_country
MERGE (co:Country {
    code: row_country[0], 
    code3: row_country[1], 
    country_name: row_country[2]
});

:auto 
LOAD CSV FROM 'file:///city.csv' AS row
CALL {
    WITH row
    WITH row[5] as code, collect(row) as row_city
    MATCH (co:Country {code: code})
    FOREACH(i in row_city |
        CREATE (ci:City {
            id: i[0], 
            city_name: i[1], 
            latitude: i[2], 
            longitude: i[3], 
            postal_code: i[4], 
            country_code: i[5]
        })
        CREATE (ci)-[:LOCATED_IN]->(co)
    )
} IN TRANSACTIONS OF 10000 ROWS;

// Load users
:auto 
LOAD CSV FROM 'file:///user_table.csv' AS row_user
CALL {
    WITH row_user
    CREATE (h:Hunter {
        id: row_user[0], 
        first_name: row_user[1], 
        last_name: row_user[2], 
        email: row_user[3], 
        street_num: row_user[4], 
        street_name: row_user[5], 
        city_id: row_user[6]
    })
} IN TRANSACTIONS OF 10000 ROWS;

// Load treasures and link to cities
:auto 
LOAD CSV FROM 'file:///treasure.csv' AS row
CALL {
    WITH row
    WITH row[3] as id, collect(row) as row_treasure
    MATCH (ci:City {id: id})
    FOREACH (i in row_treasure | 
        CREATE (t:Treasure {
            id: i[0],
            difficulty: toInteger(i[1]), 
            terrain: toInteger(i[2]), 
            city_id: i[3]
        })
        CREATE (t)-[:HIDDEN_IN]->(ci)
    )
} IN TRANSACTIONS OF 10000 ROWS;

// Create FOUND relationships
:auto 
LOAD CSV FROM 'file:///treasure_log.csv' AS row
CALL {
    WITH row
    MATCH (h:Hunter {id: row[5]})
    MATCH (t:Treasure {id: row[6]})
    MERGE (h)-[r:FOUND]->(t)
} IN TRANSACTIONS OF 10000 ROWS;

You may be able to refactor the treasure_log query too, but grouping on the 'id' that has the smaller number of distinct values.

roman.gordon · May 22, 2024, 5:47am

Thank you, that has definitely improved performance of the treasure inserts, now it's the treasure_log query that's slower but I imagine I can refactor it to use the treasure table as the default as that has less entries.

I solved the treasure_log by using 2 indexes ultimately and keeping code as is

glilienfield · May 22, 2024, 6:09am

Instead of a “forEach”, you can “unwind” so you can “match”.

Topic		Replies	Views
CSV import issue Import / Export	26	624	June 21, 2023
Import CSV hanging Import / Export	4	710	June 27, 2019
LOAD CSV taking time Import / Export cypher , import	6	690	September 25, 2021
CREATE INDEX ON :Label(prop) is really slow after neo4j-admin import Import / Export	4	2043	June 8, 2020
Load CSV taking excessive amount of time Neo4j Graph Platform	2	473	February 22, 2023

Index Not Being Used

Related topics