Another "speed up the load" question from a relatively inexperienced Neo4j user

Hello @bill_dickenson :slight_smile:

You can change my above code a bit and it will work with JSON, you just don't need to convert to dict each element in of your json in merge functions since it's already a list of dict :slight_smile:

All changed over and it does look better but having two odd problems.

Here are the two json files

{
	"0": {
		"EIEO": true,
		"FILECOUNT": 1,
		"KDM": "data:Reads",
		"changed": false,
		"ctx": "1482759651",
		"level": "code",
		"location": [
			22540,
			53,
			22540,
			53
		],
		"node": "54914",
		"quvioDensity": 1.0,
		"quviolations": 2,
		"szAFP": "",
		"szaep": 8,
		"szlocs": 2,
		"text": "UserService",
		"type": "typeName"
	},
	"1": {
		"EIEO": true,
		"FILECOUNT": 1,
		"KDM": "data:Reads",
		"changed": false,
		"ctx": "1482759651",
		"level": "code",
		"location": [
			22540,
			53,
			22540,
			53
		],
		"node": "54914",
		"quvioDensity": 1.0,
		"quviolations": 2,
		"szAFP": "",
		"szaep": 8,
		"szlocs": 2,
		"text": "UserService",
		"type": "typeName"
	},

and the relationships

{
	"0": {
		"compile": "webgoat.combined.source",
		"from": "0",
		"to": "54690"
	},
	"1": {
		"compile": "webgoat.combined.source",
		"from": "1",
		"to": "2"
	},
	"100": {
		"compile": "webgoat.combined.source",
		"from": "100",
		"to": "101"
	},

Code - I dropped some housekeeping

MERGE (:ProgNode:typeImportOnDemandDeclaration {nodeSeq:4,name:'importjavdef merge_relation(args):
    """
    Function to create relations from a batch.
    """
    if len(BATCH['batch']) > 1000:
        with graphDB_Driver.session() as ses:
            ses.run("UNWIND $batch AS row MATCH (a:ProgNode{inode:row.a}) MATCH (b:ProgNode{inode:row.b}) CALL apoc.merge.relationship(a, 'PROGRAM', {}, apoc.map.removeKeys(properties(row), ['a', 'b']), b) YIELD rel RETURN 1", batch=BATCH["batch"])
        reset_batch()
    BATCH['batch'].append(args.to_dict())


def merge_node(args):
    """
    Function to create nodes from a batch.
    """
    if len(BATCH['batch']) > 1000:
        with graphDB_Driver.session() as ses:
            ses.run("UNWIND $batch AS row CALL apoc.merge.node(['ProgNode', row.nodetype], {inode:row.inode}, apoc.map.removeKeys(properties(row), ['nodetype', 'inode'])) YIELD node RETURN 1", batch=BATCH["batch"])
        reset_batch()
    BATCH['batch'].append(args.to_dict())



def main(fname):
    print("Starting load of %s - nodes \n" % filenode)
    nodes = pd.read_json(filenode, encoding='utf-8')
    print("Starting load of %s - connections \n" % filematch)
    relations = pd.read_json(filematch, encoding='utf-8')
    print("Files loaded %s - connections \n" % filematch)
    nodes.apply(lambda h: merge_node(h), axis=1)
    reset_batch()
    relations.apply(lambda h: merge_relation(h), axis=1)    

Two issues.

Nodes load correctly, relationships do not and the error is a bit obscure.

 File "C:\Users\Bill Dickenson\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\json\_json.py", line 1089, in _parse_no_numpy
    loads(json, precise_float=self.precise_float), dtype=None
ValueError: Expected object or value

call is the same as before.

And I assume the names have to change but I hesitate.

Did you check in Neo4j browser if the nodes were loaded correctly?

Can you print the content of the batch of relations to check what is in it?

Nothing loaded into Neo4j at all. The whole json fole is about 54K nodes in this example. I confirmed that a few places. Now when it loads into json, it does look like the whole file loaded, but pivoted fields first. (APL strikes back - lol)

    nodes = pd.read_json(filenode, encoding='utf-8')
    print(nodes)

So this is working....

Now we run the apoc after we pivot the nodes file with nodes.apply using the lambda. I added the rest of the section.

    start_time = time.time()    
    print("Starting load of %s - nodes \n" % filenode)
    nodes = pd.read_json(filenode, encoding='utf-8')
    print(nodes)
    print("Starting load of %s - connections \n" % filematch)
#    relations = pd.read_json(filematch, encoding='utf-8')
    print("Files loaded %s - connections \n" % filematch)
    nodes.apply(lambda h: merge_node(h), axis=1)
    reset_batch()
#    relations.apply(lambda h: merge_relation(h), axis=1)    

but it doesnt look like it pivoted. No error, just silent. I think I am actually seeing each field (15) not the row.

image

I also commented out the relation load as that wasn't loading into Pandas. Thats funny as it is simple compared to the other.

I do feel guilty about asking, but if you do have a rate and are up to the consulting (or even codementor) I am willing to pay to solve this.

At any rate, thank you for the help so far.

Hello, my boss will contact you :slight_smile:

Can you try this? :slight_smile:

def merge_node(args):
    """
    Function to create nodes from a batch.
    """
    if len(BATCH['batch']) > 1000:
        with graphDB_Driver.session() as ses:
            ses.run("UNWIND $batch AS row CALL apoc.merge.node(['ProgNode', row.nodetype], {inode:row.inode}, apoc.map.removeKeys(properties(row), ['nodetype', 'inode'])) YIELD node RETURN 1", batch=BATCH["batch"])
        reset_batch()
    BATCH['batch'].append(args.to_dict())
nodes = pd.read_json(filenode, encoding='utf-8')
nodes = nodes.T
nodes['inode'] = nodes.index
nodes.apply(lambda h: merge_node(h), axis=1)
reset_batch()

I have some good news, and some bad but we are close.

Its all working as far as code. I can't see whats being sent, but it is working. Minor changes (inode is no node) and I added some last record logic. But as soon as it ends, it send back a pair of error messages and nothing is showing up in Neo4j. However, I think its the same issue.

I did make some minor changes to the code:

def merge_node(args):
    global INNODE, NODECOUNT
    """
    Function to create nodes from a batch.
    """
    INNODE += 1
    if (INNODE % 10000) == 0:
        print("...Sent %s of %s for processig" % (INNODE, NODECOUNT))
    if (len(BATCH['batch']) > 1000) or (INNODE == NODECOUNT):
        if INNODE == NODECOUNT:
            print("...Final Record (%s) added and transmitted" % INNODE)
            BATCH['batch'].append(args.to_dict())            
        with graphDB_Driver.session() as ses:
            ses.run("UNWIND $batch AS row CALL apoc.merge.node(['ProgNode', row.nodetype], {node:row.inode}, apoc.map.removeKeys(properties(row), ['nodetype', 'node'])) YIELD node RETURN 1", batch=BATCH["batch"])
        reset_batch()
    BATCH['batch'].append(args.to_dict())

This is the content in batch

Load Neo4j file webgoat
Sections : ['Neo4J', 'SourceMachine']
GraphDatabase.driver(bolt://dev.Veriprism.net:7687
webgoat.combined.source
files: webgoat.combined.source.neo-n webgoat.combined.sourceneo-c
Starting load of webgoat.combined.source.neo-n - nodes

{'batch': [{'EIEO': True, 'FILECOUNT': 1, 'KDM': 'data:Reads', 'changed': False, 'ctx': '1033320531', 'level': 'code', 'location': [4835, 30, 4835, 30], 'node': 10001, 'quvioDensity': 0.5, 'quviolations': 1, 'szAFP': '', 'szaep': 17, 'szlocs': 2, 'text': 'user', 'type': 'typeName'},

This does look reasonable. 

Here is the code that made the connection

uri=configur.get("Neo4J","host")
userName        = configur.get("Neo4J","id")
password        = configur.get("Neo4J","pw")
print("GraphDatabase.driver("+uri)
graphDB_Driver  = GraphDatabase.driver(uri, auth=(userName, password))    

Can you try the code on a local database (build one in Neo4j Desktop) and not a remote one?
Which version of Neo4j are you using? (I advice you to use the latest one 4.1)

Loaded up local, added APOC

Ran the code. This is what I found

Starting load of webgoat.combined.source.neo-n - nodes 

Traceback (most recent call last):
  File "F:/ClientSide/current/testload1.py", line 125, in <module>
    main(fname)
  File "F:/ClientSide/current/testload1.py", line 98, in main
    nodes.apply(lambda h: merge_node(h), axis=1)
  File "C:\Users\Bill Dickenson\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\frame.py", line 6878, in apply
    return op.get_result()
  File "C:\Users\Bill Dickenson\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\apply.py", line 186, in get_result
    return self.apply_standard()
  File "C:\Users\Bill Dickenson\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\apply.py", line 296, in apply_standard
    values, self.f, axis=self.axis, dummy=dummy, labels=labels
  File "pandas\_libs\reduction.pyx", line 620, in pandas._libs.reduction.compute_reduction
  File "pandas\_libs\reduction.pyx", line 128, in pandas._libs.reduction.Reducer.get_result
  File "F:/ClientSide/current/testload1.py", line 98, in <lambda>
    nodes.apply(lambda h: merge_node(h), axis=1)
  File "F:/ClientSide/current/testload1.py", line 54, in merge_node
    ses.run("UNWIND $batch AS row CALL apoc.merge.node(['ProgNode', row.nodetype], {node:row.inode}, apoc.map.removeKeys(properties(row), ['nodetype', 'node'])) YIELD node RETURN 1", batch=BATCH["batch"])
  File "C:\Users\Bill Dickenson\AppData\Local\Programs\Python\Python37\lib\site-packages\neo4j\__init__.py", line 499, in run
    self._connection.fetch()
  File "C:\Users\Bill Dickenson\AppData\Local\Programs\Python\Python37\lib\site-packages\neobolt\direct.py", line 422, in fetch
    return self._fetch()
  File "C:\Users\Bill Dickenson\AppData\Local\Programs\Python\Python37\lib\site-packages\neobolt\direct.py", line 464, in _fetch
    response.on_failure(summary_metadata or {})
  File "C:\Users\Bill Dickenson\AppData\Local\Programs\Python\Python37\lib\site-packages\neobolt\direct.py", line 759, in on_failure
    raise CypherError.hydrate(**metadata)
neobolt.exceptions.ClientError: Failed to invoke procedure `apoc.merge.node`: Caused by: java.lang.NullPointerException
>>> 

Did you upgrade the Python Neo4j driver too?

pip install --upgrade neo4j

I did. So Neo4j 4.1, new APOC, New driver. Same issue. Rebooted. Restarted - same issue.

thanks

To be honest I don't know from where is coming this error.

Can you print the content of the batch before the send to the database?

This was back a few but is the contents of batch. Only the first line but he rest same format.

Do you have an example in the batch where a record has one or severals null values? I think the problem is coming from here:)

When you have your DataFrame, try to replace all nan and null values by an empty string for example or whatever.

Embarrassed to say, I found it.

My naming conventions were done in a hurry and I had introduced some inconsistencies. Someone noted that two referenced variables were not there and when fixed, it worked fine. So all is working now mechanically. The relations are not creating correctly, but since the nodes are, I think I can puzzle it out.

Thank you again, this was way more complicated that it should have been and you solved it.

1 Like

No problem, I'm happy to hear this :slight_smile:

Regards,
Cobra

I do need just a tad more help.

So the content of batch is:

[
 {"child": "54690", "compile": "webgoat.combined.source", "parent": "0", "tree": "runs", "from": 0},
 {"child": "2", "compile": "webgoat.combined.source", "parent": "1", "tree": "calls", "from": 1},
 {"child": "101", "compile": "webgoat.combined.source", "parent": "100", "tree": "runs", "from": 100},
 {"child": "1001", "compile": "webgoat.combined.source", "parent": "1000", "tree": "runs", "from": 1000},
 {"child": "10001", "compile": "webgoat.combined.source", "parent": "10000", "tree": "runs", "from": 10000},
 {"child": "10004", "compile": "webgoat.combined.source", "parent": "10003", "tree": "runs", "from": 10003},
 {"child": "10009", "compile": "webgoat.combined.source", "parent": "10004", "tree": "runs", "from": 10004},
 {"child": "10007", "compile": "webgoat.combined.source", "parent": "10005", "tree": "runs", "from": 10005},
 {"child": "10008", "compile": "webgoat.combined.source", "parent": "10007", "tree": "runs", "from": 10007},
 {"child": "1005", "compile": "webgoat.combined.source", "parent": "1001", "tree": "runs", "from": 1001},
 {"child": "1003", "compile": "webgoat.combined.source", "parent": "1002", "tree": "runs", "from": 1002}
 ]

and of course the nodes ( which are created already) are:

[{"EIEO": false, "FILECOUNT": 1, "KDM": "code:StorableUnit", "changed": false, "ctx": "1793546528", "inode": "5050", "level": "code", "location": [2607, 18, 2607, 18], "quvioDensity": 1.0, "quviolations": 2, "szAFP": "", "szaep": 10, "szlocs": 2, "text": "final", "type": "fieldModifier", "node": 5050},
 {"EIEO": false, "FILECOUNT": 1, "KDM": "Action:Addresses", "changed": false, "ctx": "259837957", "inode": "50500", "level": "code", "location": [20399, 39, 20399, 39], "quvioDensity": 0.0, "quviolations": 0, "szAFP": "", "szaep": 28, "szlocs": 2, "text": "e", "type": "variableDeclaratorId", "node": 50500},
 {"EIEO": true, "FILECOUNT": 1, "KDM": "data:Writes", "changed": false, "ctx": "1571545022", "inode": "50501", "level": "code", "location": [20399, 42, 20401, 8], "quvioDensity": 0.0, "quviolations": 0, "szAFP": "", "szaep": 27, "szlocs": 4, "text": "{log.error(\"Error occurred while writing the logfile\",e);}", "type": "block", "node": 50501}]

and I need to create a relationship between parent and child, but ONLY if they share the same compileunit. Its possible that two different compiles could have a node 0 ( in fact, thats a certainty) and I don't want to create it out of school.

Now based on your example, this is my code

ses.run("UNWIND $batch AS row MATCH (a:ProgNode{inode:row.parent}) MATCH (b:ProgNode{inode:row.child}) CALL apoc.merge.relationship(a, row.tree, {compileunit:row.compile}, apoc.map.removeKeys(properties(row), ['parent', 'child']),b) YIELD rel RETURN 1", batch=BATCH["batch"])

I am not getting an error (good) but I am also not getting a relationship (bad)

in Cypher I would have written this as

MATCH (a:ProgNode { inode:parent,compileunit:compile }) WITH a MATCH (b:ProgNode { inode: child, compileunit:compile}) MERGE (a)-[r:tree{compileunit:'%s', source:'%s'}]->(b);\n"

The r:tree adds a wrinkle also.

This is close to the last thing I need to do. Can you help ?

The first batch you gave is the one for the relationships?
Are you sure inode is a string and not an integer?

Duh - thank you - that was it. Made the nodes integers and it worked. Thank you again

1 Like