Scrapy Pipeline to Neo4j Bulk Store Very Slow

aditya.badve · July 24, 2020, 3:46pm

I need help speeding up the process for inputting items from a Scrapy pipeline into Neo4j. I am currently working on a project where I am scraping the data for about a million patents and storing their information and connections with Neo4j. Each patent will have on average have 10 different connections including assignees, inventors, classifications, and most importantly connections to other patents.

Neo4j Server version: 4.0.4 (community)
Neo4j Browser version: 4.0.8
Py2Neo Version: 5.0b1

I have tried searching for a way, using python to store these items into Neo4j using py2neo and UNWIND queries, but it takes WAY too long (several seconds) per item. Any suggestions on how to speed up this process? Here's an example snippet from my code:

def assignee(item):
                    user = item.get("user")
                    for assignee in user['assignees']:
                        assignee_user = parse_user(assignee)

                        fullname = assignee_user['fullname'] if 'fullname' in assignee_user else '',
                        first_name = assignee_user['first_name'] if 'first_name' in assignee_user else '',
                        last_name = assignee_user['last_name'] if 'last_name' in assignee_user else ''

                        assignee = {
                            "fullname": fullname,
                            "first_name": first_name,
                            "last_name": last_name
                        }

                        if assignee_user['status'] == 3:
                            city_located = assignee_user['city']
                            state_abbreviation =  assignee_user['state']
                            country_abbreviation = assignee_user['country']

                            location = {
                                "city": city, 
                                "state": state_abbreviation,
                                "country": country_abbreviation
                            }

                        elif assignee_user['status'] == 2:
                            city = assignee_user['city']
                            country_abbreviation = assignee_user['country']

                            location = {
                                "city": city, 
                                "state": None,
                                "country": country_abbreviation
                            }

                        elif assignee_user['status'] == 0:
                            location = {
                                "city": None,
                                "state": None,
                                "country": None,
                            }

                        yield assignee, location



params = []
                    for individual in assignee(item):
                        assignee, location = individual
                        params.append({
                                        'fullname': assignee['fullname'], 
                                        'first_name': assignee['first_name'],
                                        'last_name': assignee['last_name'],
                                        'city': location['city'],
                                        'state': location['state'],
                                        'country': location['country']
                                    })

                    q = """
                        MATCH(patent:Patent) WHERE patent.document_number = '"""+document_number+"""'
                        UNWIND {$datas} as data
                        MERGE(assignee:User {fullname: data.fullname})
                        SET assignee.first_name = data.first_name,
                            assignee.last_name = data.last_name
                        MERGE(city:City {name: data.city})
                        MERGE(patent)-[:ASSIGNEE]->(assignee)
                        MERGE(assignee)-[:LOCATED_IN]->(city)
                    """

sameerG · July 24, 2020, 4:44pm

aditya.badve · July 24, 2020, 9:08pm

Thanks Sameer, I'll take a look at it

Topic		Replies	Views
Upload mass data in neo4j Import / Export	9	245	July 4, 2024
I have at least 4,000 "cypher" sentences to write into neo4j, it takes too many time. how can i be quick? Cypher	1	959	July 30, 2019
Why is my neo4j insertion python script slowing down over time? Newbie Questions neomodel	6	869	March 5, 2022
Improving data writing efficiency in python Cypher cypher	7	2168	April 12, 2020
Export data to csv using py2neo Python	5	2361	November 12, 2019

Scrapy Pipeline to Neo4j Bulk Store Very Slow

Related topics