I have code that reads data from the web, cleans it up, and stores it in Neo4j. I'm wondering how to "parallelize" this process, since getting the data from web can be slow sometimes. My current setup is something like this:
In config.py
:
from neo4j import GraphDatabase
class cfg_holder():
''' Container for global variables.'''
def __init__(self, params):
self.params = params
self.uri = "bolt://localhost:7687"
self.driver = GraphDatabase.driver(self.uri, auth=("user", "pass"))
self.db = self.driver.session()
def init(param):
return cfg_holder(param)
In main.py
:
import concurrent.futures
import config
def func(h):
# get data
# build queries
# when enough data has been collected:
with h.db.begin_transaction() as tx:
tx.run(q)
tx.success = True
if __name__ == "__main__":
holders = []
for i in [10, 20]:
holders.append(config.init(i))
with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
executor.map(func, holders)
So each cfg_holder
has access to its own db connection. I'm not sure this is the correct way to set things up.
It's possible that my design pattern is entirely off here. What's the right way to set this kind of thing up? Do I need to be locking the threads somewhere? Are threads even the right way to go about this? Looking for some general advice here...