Import data to show employee overlap

I am having trouble figuring out how to import this data to get a graph that shows connections between people that have worked together at the different companies. Essentially, this would show how familiar each programmer is with every other programmer in the data. When the 'END_DATE' is 1/1/999, the person currently works at that company.

Does anyone have an idea of how to accomplish this?

To load this data, you should export it from a spreadsheet to CSV and then use the LOAD CSV command:

Give this a shot -- and if you have issues, come back and post a new topic -- and make sure to share what Cypher you've written, and what it's doing that doesn't match your expectations.

Thanks, david.allen! That definitely helped.

Below is what I used to upload the data and the query that I used to attempt to get the graph I wanted.

LOAD CSV WITH HEADERS FROM "file:///UF.csv" AS line
WITH line
MERGE (person:Person{name_id:line.PERSON_ID})
MERGE (comp:Company {comp_id:line.COMPANY_ID})
MERGE (person)-[worked_at:WORKED_AT{
`Start Date`:line.START_DATE, 
`End Date`:line.END_DATE, 
}]->(comp)

-------------------------------------------------------------------------------

MATCH (p1:Person)-[r1:WORKED_AT]->(c:Company {comp_id:'11111'})<-[r2:WORKED_AT]-(p2:Person)
WHERE (date(r1.`Start Date`) > date(r2.`Start Date`)) AND  ((CASE WHEN r1.`End Date` = '9999-1-1' THEN date() ELSE date(r1.`End Date`) END) < (CASE WHEN r2.`End Date` = '9999-1-1' THEN date() ELSE date(r2.`End Date`) END))
RETURN p1,p2,r1,r2,c

The query has been running for two hours and hasn't returned anything yet. Any ideas on how to speed this up?

Thanks!

Can you run an EXPLAIN of this, expand all elements of the resulting plan (we need that), and then add it here? Also do you have an index or unique constraint on :Company(comp_id)?

  1. Build the graph; Query the graph, in separate commands.
  2. How big is the csv: periodic commit.
  3. When in doubt, apoc overnight.

It's probably better not to have the LOAD CSV and MATCH ... RETURN commands running in one query. Not a big deal, but just good practice.

Periodic Commit

Neo4j is limited in memory usage by your Java environment config, and the Neo4j config. LOAD CSV will put the whole CSV, and all those MERGED nodes, into ram. If the ram is full, it will slow way down, or even hang entirely. That's where USING PERIODIC COMMIT comes in handy

USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM "file:///UF.csv" AS line
WITH line
MERGE (person:Person{name_id:line.PERSON_ID})
MERGE (comp:Company {comp_id:line.COMPANY_ID})
MERGE (person)-[worked_at:WORKED_AT{
`Start Date`:line.START_DATE, 
`End Date`:line.END_DATE 
}]->(comp)

(all those merges are still gonna slow things down a bit.)

APOC for the win

If you're still having trouble, try APOC for all your data-heavy needs. However, you're going to need to build a map of your csv.

:param apoccsv => {
    header:true,
    nullValues: [''],
    mapping: {
      `PERSON_ID`: {type: 'int', name: 'person'},
      `COMPANY_ID`: {type: 'int', name: 'company'},
      `COMPANY_NAME`: {ingore: true},
      `START_DATE`: {type: 'string', name: 'start'},
      `END_DATE`: {type: 'string', name: 'end'}
    }
};

CALL apoc.periodic.iterate(
"
    CALL apoc.load.csv('file:////UF.csv',$apoccsv)
    YIELD map AS row 
    RETURN row;
","
    MERGE (person:Person{name_id: row.person})
    MERGE (comp:Company {comp_id: row.company})
    MERGE (person)-[:WORKED_AT {
        `Start Date`: row.start, 
        `End Date`: row.end 
    }]->(comp)
", {batchSize:500, iterateList:true, parallel:true, params: {apoccsv: $apoccsv}}
);

thanks for the help andrew.bowman! I created a constraint on comp_id as unique and below is the output I got from EXPLAIN.

Thanks, looks like the right indexes are being used.

I think to speed some things up we should only expand out to one :Person node in the pattern, then collect those and UNWIND them to get the desired cross product. Right now we're paying the cost for cross product expansions, which can be costly (so for 100 employees of the company, that's 100 x 100 = 10k expansions to get the cross product of 10k rows. We can get these easier by doing just the 100 expansions, collecting and UNWINDing twice to get the results we want without paying that exponential expansion cost).

Give this a try:

MATCH (p:Person)-[r:WORKED_AT]->(c:Company {comp_id:'11111'})
WITH c, collect({p:p, r:r}) as employees
UNWIND employees as emp1
UNWIND employees as emp2
WITH c, emp1, emp2
WHERE emp1.p <> emp2.p  // no comparing employees to themselves
WITH c, emp1.p as p1, emp1.r as r1, emp2.p as p2, emp2.r as r2
WHERE (date(r1.`Start Date`) > date(r2.`Start Date`)) AND  ((CASE WHEN r1.`End Date` = '9999-1-1' THEN date() ELSE date(r1.`End Date`) END) < (CASE WHEN r2.`End Date` = '9999-1-1' THEN date() ELSE date(r2.`End Date`) END))
RETURN p1,p2,r1,r2,c

And if you're looking for any overlap, and not the specific type of overlap in your query, then you can use the approach in this StackOverflow answer which can be logically reduced down to: DateRangesOverlap = max(start1, start2) < min(end1, end2) That would also let us get rid of mirrored results (where we have two rows with the exact same persons to compare, but they've just switched variables, so p1 becomes p2 and p2 becomes p1).

MATCH (p:Person)-[r:WORKED_AT]->(c:Company {comp_id:'11111'})
WITH c, collect({p:p, r:r}) as employees
UNWIND employees as emp1
UNWIND employees as emp2
WITH c, emp1, emp2
WHERE id(emp1.p) < id(emp2.p)  // get rid of mirrored results
WITH c, emp1.p as p1, emp1.r as r1, emp2.p as p2, emp2.r as r2
WHERE apoc.coll.max([date(r1.`Start Date`), date(r2.`Start Date`)]) < apoc.coll.min([r1.`End Date`, r2.`End Date`])
RETURN p1,p2,r1,r2,c
1 Like

Thanks! I think I'm close to a solution, but I'm getting this error:

Neo.ClientError.Statement.SyntaxError: Text cannot be parsed to a Date
"8/8/2019"

I cannot figure out how to change the format of all the dates in the database.

Unfortunately this format is not one of the common formats supported for creating temporal Dates. You would either need to split the datestring and create a date() from its components, or use APOC Procedures date parsing to parse the date into a value you can use in the temporal functions (though you may need to create a dateTime() object and create a date() from it, as dateTime() may be needed to take in an epochMillis from the result of the APOC parse).

Try this:
RETURN apoc.date.format(apoc.date.parse("8/8/2019", 'ms', 'dd/MM/yyyy'), 'ms', 'yyyy-MM-dd') as frm

Result: "2019-08-08"

I am trying to customize the graph visualization in Neo4j Browser and I've seen some information on *.grass files to do so. Is it possible to use a grass file to adjust the thickness of the edges based on the relationship duration (e.g., duration.between(Start Date, End Date))? Also, adjust the color intensity based on the same value?