I have code that reads data from the web, cleans it up, and stores it in Neo4j. I'm wondering how to "parallelize" this process, since getting the data from web can be slow sometimes. My current setup is something like this:
from neo4j import GraphDatabase class cfg_holder(): ''' Container for global variables.''' def __init__(self, params): self.params = params self.uri = "bolt://localhost:7687" self.driver = GraphDatabase.driver(self.uri, auth=("user", "pass")) self.db = self.driver.session() def init(param): return cfg_holder(param)
import concurrent.futures import config def func(h): # get data # build queries # when enough data has been collected: with h.db.begin_transaction() as tx: tx.run(q) tx.success = True if __name__ == "__main__": holders =  for i in [10, 20]: holders.append(config.init(i)) with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor: executor.map(func, holders)
cfg_holder has access to its own db connection. I'm not sure this is the correct way to set things up.
It's possible that my design pattern is entirely off here. What's the right way to set this kind of thing up? Do I need to be locking the threads somewhere? Are threads even the right way to go about this? Looking for some general advice here...