Another "speed up the load" question from a relatively inexperienced Neo4j user

Cobra · July 9, 2020, 1:36pm

You can change my above code a bit and it will work with JSON, you just don't need to convert to dict each element in of your json in merge functions since it's already a list of dict

bill_dickenson · July 9, 2020, 7:35pm

All changed over and it does look better but having two odd problems.

Here are the two json files

{
	"0": {
		"EIEO": true,
		"FILECOUNT": 1,
		"KDM": "data:Reads",
		"changed": false,
		"ctx": "1482759651",
		"level": "code",
		"location": [
			22540,
			53,
			22540,
			53
		],
		"node": "54914",
		"quvioDensity": 1.0,
		"quviolations": 2,
		"szAFP": "",
		"szaep": 8,
		"szlocs": 2,
		"text": "UserService",
		"type": "typeName"
	},
	"1": {
		"EIEO": true,
		"FILECOUNT": 1,
		"KDM": "data:Reads",
		"changed": false,
		"ctx": "1482759651",
		"level": "code",
		"location": [
			22540,
			53,
			22540,
			53
		],
		"node": "54914",
		"quvioDensity": 1.0,
		"quviolations": 2,
		"szAFP": "",
		"szaep": 8,
		"szlocs": 2,
		"text": "UserService",
		"type": "typeName"
	},

and the relationships

{
	"0": {
		"compile": "webgoat.combined.source",
		"from": "0",
		"to": "54690"
	},
	"1": {
		"compile": "webgoat.combined.source",
		"from": "1",
		"to": "2"
	},
	"100": {
		"compile": "webgoat.combined.source",
		"from": "100",
		"to": "101"
	},

Code - I dropped some housekeeping

MERGE (:ProgNode:typeImportOnDemandDeclaration {nodeSeq:4,name:'importjavdef merge_relation(args):
    """
    Function to create relations from a batch.
    """
    if len(BATCH['batch']) > 1000:
        with graphDB_Driver.session() as ses:
            ses.run("UNWIND $batch AS row MATCH (a:ProgNode{inode:row.a}) MATCH (b:ProgNode{inode:row.b}) CALL apoc.merge.relationship(a, 'PROGRAM', {}, apoc.map.removeKeys(properties(row), ['a', 'b']), b) YIELD rel RETURN 1", batch=BATCH["batch"])
        reset_batch()
    BATCH['batch'].append(args.to_dict())


def merge_node(args):
    """
    Function to create nodes from a batch.
    """
    if len(BATCH['batch']) > 1000:
        with graphDB_Driver.session() as ses:
            ses.run("UNWIND $batch AS row CALL apoc.merge.node(['ProgNode', row.nodetype], {inode:row.inode}, apoc.map.removeKeys(properties(row), ['nodetype', 'inode'])) YIELD node RETURN 1", batch=BATCH["batch"])
        reset_batch()
    BATCH['batch'].append(args.to_dict())



def main(fname):
    print("Starting load of %s - nodes \n" % filenode)
    nodes = pd.read_json(filenode, encoding='utf-8')
    print("Starting load of %s - connections \n" % filematch)
    relations = pd.read_json(filematch, encoding='utf-8')
    print("Files loaded %s - connections \n" % filematch)
    nodes.apply(lambda h: merge_node(h), axis=1)
    reset_batch()
    relations.apply(lambda h: merge_relation(h), axis=1)

Two issues.

Nodes load correctly, relationships do not and the error is a bit obscure.

 File "C:\Users\Bill Dickenson\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\json\_json.py", line 1089, in _parse_no_numpy
    loads(json, precise_float=self.precise_float), dtype=None
ValueError: Expected object or value

call is the same as before.

And I assume the names have to change but I hesitate.

Cobra · July 9, 2020, 8:16pm

Did you check in Neo4j browser if the nodes were loaded correctly?

Can you print the content of the batch of relations to check what is in it?

bill_dickenson · July 9, 2020, 10:02pm

Nothing loaded into Neo4j at all. The whole json fole is about 54K nodes in this example. I confirmed that a few places. Now when it loads into json, it does look like the whole file loaded, but pivoted fields first. (APL strikes back - lol)

    nodes = pd.read_json(filenode, encoding='utf-8')
    print(nodes)

So this is working....

Now we run the apoc after we pivot the nodes file with nodes.apply using the lambda. I added the rest of the section.

    start_time = time.time()    
    print("Starting load of %s - nodes \n" % filenode)
    nodes = pd.read_json(filenode, encoding='utf-8')
    print(nodes)
    print("Starting load of %s - connections \n" % filematch)
#    relations = pd.read_json(filematch, encoding='utf-8')
    print("Files loaded %s - connections \n" % filematch)
    nodes.apply(lambda h: merge_node(h), axis=1)
    reset_batch()
#    relations.apply(lambda h: merge_relation(h), axis=1)

but it doesnt look like it pivoted. No error, just silent. I think I am actually seeing each field (15) not the row.

I also commented out the relation load as that wasn't loading into Pandas. Thats funny as it is simple compared to the other.

I do feel guilty about asking, but if you do have a rate and are up to the consulting (or even codementor) I am willing to pay to solve this.

At any rate, thank you for the help so far.

Cobra · July 10, 2020, 6:53am

Hello, my boss will contact you

Can you try this?

def merge_node(args):
    """
    Function to create nodes from a batch.
    """
    if len(BATCH['batch']) > 1000:
        with graphDB_Driver.session() as ses:
            ses.run("UNWIND $batch AS row CALL apoc.merge.node(['ProgNode', row.nodetype], {inode:row.inode}, apoc.map.removeKeys(properties(row), ['nodetype', 'inode'])) YIELD node RETURN 1", batch=BATCH["batch"])
        reset_batch()
    BATCH['batch'].append(args.to_dict())

nodes = pd.read_json(filenode, encoding='utf-8')
nodes = nodes.T
nodes['inode'] = nodes.index
nodes.apply(lambda h: merge_node(h), axis=1)
reset_batch()

bill_dickenson · July 10, 2020, 2:00pm

I have some good news, and some bad but we are close.

Its all working as far as code. I can't see whats being sent, but it is working. Minor changes (inode is no node) and I added some last record logic. But as soon as it ends, it send back a pair of error messages and nothing is showing up in Neo4j. However, I think its the same issue.

I did make some minor changes to the code:

def merge_node(args):
    global INNODE, NODECOUNT
    """
    Function to create nodes from a batch.
    """
    INNODE += 1
    if (INNODE % 10000) == 0:
        print("...Sent %s of %s for processig" % (INNODE, NODECOUNT))
    if (len(BATCH['batch']) > 1000) or (INNODE == NODECOUNT):
        if INNODE == NODECOUNT:
            print("...Final Record (%s) added and transmitted" % INNODE)
            BATCH['batch'].append(args.to_dict())            
        with graphDB_Driver.session() as ses:
            ses.run("UNWIND $batch AS row CALL apoc.merge.node(['ProgNode', row.nodetype], {node:row.inode}, apoc.map.removeKeys(properties(row), ['nodetype', 'node'])) YIELD node RETURN 1", batch=BATCH["batch"])
        reset_batch()
    BATCH['batch'].append(args.to_dict())

This is the content in batch

Load Neo4j file webgoat
Sections : ['Neo4J', 'SourceMachine']
GraphDatabase.driver(bolt://dev.Veriprism.net:7687
webgoat.combined.source
files: webgoat.combined.source.neo-n webgoat.combined.sourceneo-c
Starting load of webgoat.combined.source.neo-n - nodes

{'batch': [{'EIEO': True, 'FILECOUNT': 1, 'KDM': 'data:Reads', 'changed': False, 'ctx': '1033320531', 'level': 'code', 'location': [4835, 30, 4835, 30], 'node': 10001, 'quvioDensity': 0.5, 'quviolations': 1, 'szAFP': '', 'szaep': 17, 'szlocs': 2, 'text': 'user', 'type': 'typeName'},

This does look reasonable. 

Here is the code that made the connection

uri=configur.get("Neo4J","host")
userName        = configur.get("Neo4J","id")
password        = configur.get("Neo4J","pw")
print("GraphDatabase.driver("+uri)
graphDB_Driver  = GraphDatabase.driver(uri, auth=(userName, password))

Cobra · July 10, 2020, 2:11pm

Can you try the code on a local database (build one in Neo4j Desktop) and not a remote one?
Which version of Neo4j are you using? (I advice you to use the latest one 4.1)

bill_dickenson · July 10, 2020, 3:36pm

Loaded up local, added APOC

Ran the code. This is what I found

Starting load of webgoat.combined.source.neo-n - nodes 

Traceback (most recent call last):
  File "F:/ClientSide/current/testload1.py", line 125, in <module>
    main(fname)
  File "F:/ClientSide/current/testload1.py", line 98, in main
    nodes.apply(lambda h: merge_node(h), axis=1)
  File "C:\Users\Bill Dickenson\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\frame.py", line 6878, in apply
    return op.get_result()
  File "C:\Users\Bill Dickenson\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\apply.py", line 186, in get_result
    return self.apply_standard()
  File "C:\Users\Bill Dickenson\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\apply.py", line 296, in apply_standard
    values, self.f, axis=self.axis, dummy=dummy, labels=labels
  File "pandas\_libs\reduction.pyx", line 620, in pandas._libs.reduction.compute_reduction
  File "pandas\_libs\reduction.pyx", line 128, in pandas._libs.reduction.Reducer.get_result
  File "F:/ClientSide/current/testload1.py", line 98, in <lambda>
    nodes.apply(lambda h: merge_node(h), axis=1)
  File "F:/ClientSide/current/testload1.py", line 54, in merge_node
    ses.run("UNWIND $batch AS row CALL apoc.merge.node(['ProgNode', row.nodetype], {node:row.inode}, apoc.map.removeKeys(properties(row), ['nodetype', 'node'])) YIELD node RETURN 1", batch=BATCH["batch"])
  File "C:\Users\Bill Dickenson\AppData\Local\Programs\Python\Python37\lib\site-packages\neo4j\__init__.py", line 499, in run
    self._connection.fetch()
  File "C:\Users\Bill Dickenson\AppData\Local\Programs\Python\Python37\lib\site-packages\neobolt\direct.py", line 422, in fetch
    return self._fetch()
  File "C:\Users\Bill Dickenson\AppData\Local\Programs\Python\Python37\lib\site-packages\neobolt\direct.py", line 464, in _fetch
    response.on_failure(summary_metadata or {})
  File "C:\Users\Bill Dickenson\AppData\Local\Programs\Python\Python37\lib\site-packages\neobolt\direct.py", line 759, in on_failure
    raise CypherError.hydrate(**metadata)
neobolt.exceptions.ClientError: Failed to invoke procedure `apoc.merge.node`: Caused by: java.lang.NullPointerException
>>>

Cobra · July 10, 2020, 3:41pm

Did you upgrade the Python Neo4j driver too?

pip install --upgrade neo4j

bill_dickenson · July 10, 2020, 4:14pm

I did. So Neo4j 4.1, new APOC, New driver. Same issue. Rebooted. Restarted - same issue.

thanks

Cobra · July 10, 2020, 4:23pm

To be honest I don't know from where is coming this error.

Can you print the content of the batch before the send to the database?

bill_dickenson · July 10, 2020, 5:59pm

This was back a few but is the contents of batch. Only the first line but he rest same format.

Cobra · July 11, 2020, 8:11am

Do you have an example in the batch where a record has one or severals null values? I think the problem is coming from here:)

When you have your DataFrame, try to replace all nan and null values by an empty string for example or whatever.

bill_dickenson · July 11, 2020, 6:10pm

Embarrassed to say, I found it.

My naming conventions were done in a hurry and I had introduced some inconsistencies. Someone noted that two referenced variables were not there and when fixed, it worked fine. So all is working now mechanically. The relations are not creating correctly, but since the nodes are, I think I can puzzle it out.

Thank you again, this was way more complicated that it should have been and you solved it.

Cobra · July 11, 2020, 7:55pm

No problem, I'm happy to hear this

Regards,
Cobra

bill_dickenson · July 14, 2020, 4:09am

I do need just a tad more help.

So the content of batch is:

[
 {"child": "54690", "compile": "webgoat.combined.source", "parent": "0", "tree": "runs", "from": 0},
 {"child": "2", "compile": "webgoat.combined.source", "parent": "1", "tree": "calls", "from": 1},
 {"child": "101", "compile": "webgoat.combined.source", "parent": "100", "tree": "runs", "from": 100},
 {"child": "1001", "compile": "webgoat.combined.source", "parent": "1000", "tree": "runs", "from": 1000},
 {"child": "10001", "compile": "webgoat.combined.source", "parent": "10000", "tree": "runs", "from": 10000},
 {"child": "10004", "compile": "webgoat.combined.source", "parent": "10003", "tree": "runs", "from": 10003},
 {"child": "10009", "compile": "webgoat.combined.source", "parent": "10004", "tree": "runs", "from": 10004},
 {"child": "10007", "compile": "webgoat.combined.source", "parent": "10005", "tree": "runs", "from": 10005},
 {"child": "10008", "compile": "webgoat.combined.source", "parent": "10007", "tree": "runs", "from": 10007},
 {"child": "1005", "compile": "webgoat.combined.source", "parent": "1001", "tree": "runs", "from": 1001},
 {"child": "1003", "compile": "webgoat.combined.source", "parent": "1002", "tree": "runs", "from": 1002}
 ]

and of course the nodes ( which are created already) are:

[{"EIEO": false, "FILECOUNT": 1, "KDM": "code:StorableUnit", "changed": false, "ctx": "1793546528", "inode": "5050", "level": "code", "location": [2607, 18, 2607, 18], "quvioDensity": 1.0, "quviolations": 2, "szAFP": "", "szaep": 10, "szlocs": 2, "text": "final", "type": "fieldModifier", "node": 5050},
 {"EIEO": false, "FILECOUNT": 1, "KDM": "Action:Addresses", "changed": false, "ctx": "259837957", "inode": "50500", "level": "code", "location": [20399, 39, 20399, 39], "quvioDensity": 0.0, "quviolations": 0, "szAFP": "", "szaep": 28, "szlocs": 2, "text": "e", "type": "variableDeclaratorId", "node": 50500},
 {"EIEO": true, "FILECOUNT": 1, "KDM": "data:Writes", "changed": false, "ctx": "1571545022", "inode": "50501", "level": "code", "location": [20399, 42, 20401, 8], "quvioDensity": 0.0, "quviolations": 0, "szAFP": "", "szaep": 27, "szlocs": 4, "text": "{log.error(\"Error occurred while writing the logfile\",e);}", "type": "block", "node": 50501}]

and I need to create a relationship between parent and child, but ONLY if they share the same compileunit. Its possible that two different compiles could have a node 0 ( in fact, thats a certainty) and I don't want to create it out of school.

Now based on your example, this is my code

ses.run("UNWIND $batch AS row MATCH (a:ProgNode{inode:row.parent}) MATCH (b:ProgNode{inode:row.child}) CALL apoc.merge.relationship(a, row.tree, {compileunit:row.compile}, apoc.map.removeKeys(properties(row), ['parent', 'child']),b) YIELD rel RETURN 1", batch=BATCH["batch"])

I am not getting an error (good) but I am also not getting a relationship (bad)

in Cypher I would have written this as

MATCH (a:ProgNode { inode:parent,compileunit:compile }) WITH a MATCH (b:ProgNode { inode: child, compileunit:compile}) MERGE (a)-[r:tree{compileunit:'%s', source:'%s'}]->(b);\n"

The r:tree adds a wrinkle also.

This is close to the last thing I need to do. Can you help ?

Cobra · July 14, 2020, 7:23am

The first batch you gave is the one for the relationships?
Are you sure inode is a string and not an integer?

bill_dickenson · July 14, 2020, 12:14pm

Duh - thank you - that was it. Made the nodes integers and it worked. Thank you again

Topic		Replies	Views
Help me merge 170M relationships with LOAD CSV Cypher load-csv	10	3645	October 23, 2019
My long importing query never ends Cypher	26	1088	April 12, 2020
Importing data using Cypher shell General	4	97	June 7, 2024
Importing relationships from multiple csv file Import / Export performance , load-csv	12	3208	June 5, 2020
Load matrix (node x node) into neo4j Import / Export import	2	318	December 24, 2021

Demystifying Neo4j UX Research

Another "speed up the load" question from a relatively inexperienced Neo4j user

Related topics