cancel
Showing results for 
Search instead for 
Did you mean: 

Linear regression model in neo4j 4.1

raghad_baiad
Node Clone

Hello all,

I'm trying to compute predictions for certain properties in group of nodes, it resembles the solution provided by Lauren Shin in https://towardsdatascience.com/graphs-and-ml-multiple-linear-regression-c6920a1f2e70
except that the model is built for a group of nodes. However, I tried installing this package in neo4j 4.1 but it gave some java errors. As a solution I tried first to look into GDS available functions provided by neo4j, but nothing worked well as this is neither a classification problem nor a link prediction problem. Therefore, I have decide to go for the apoc.math.regr () function and would continue the missing pieces of the output of this function as cyphers, so I have written the following cypher MATCH (r:route) // for each route, order the routes by their score WITH r ORDER BY r.Date_ID DESC // for each route, group the Route_ID (now ordered by their dates) WITH r, COLLECT(r.Route_ID) AS routes // iterate over the sequence 0..number_of_routes UNWIND range(0,size(routes)) AS i match (r:route) where r.Route_ID=routes[i] with r,i,routes call apoc.math.regr(r.Route_ID, 'State_BandwidthUsed', 'Date_ID') yield r2 return r2

the problem of this cypher is that the regr function is expecting a label and not a subset of nodes. Can you help me either fixing the package to be compatible with neo4j 4.1 or correcting the cypher query. Or wether there is a function I'm missing in the GDS library. I'm willing to work on editing the package but would like some directions to start with. Appreciate any type of help as this is a problem I'm trying to solve for couple of days now

4 REPLIES 4

johnmattgrogan
Node Clone

Hi @raghad.baiad,

I've not used the apoc.math.regr function before, so I'm going to spend a little time getting familiar and hopefully we can figure something out. However, in the mean time, if I understand your problem correct you are trying to fit 'State_BandwidthUsed' using 'Date_ID' for the different routes in your data? Is your Date coded a continuous numeric variable?

Would creating a temporary label for each route be a viable option?

Thank you,

-Matt

Update, @johnmattgrogan thanks to your suggestion in labeling the groups, I'm able now to get the math.apoc.math.regr() outps as properties of each group, implementing the following cypher:

MATCH (r:route)          
// for each route, order the routes by their score 
WITH r ORDER BY r.Date_ID DESC       
// for each route, group the Route_ID (now ordered by their dates)
WITH r, COLLECT(r.Route_ID) AS routes   
// iterate over the sequence 0..number_of_routes
UNWIND range(0,size(routes)) AS i      
match (r:route{Route_ID:routes[i]})
call apoc.create.addLabels(r,[r.Route_ID])
Yield node
return r.Route_ID,labels(r),count(r)

and then:

call apoc.math.regr('OTNB-0001', 'State_BandwidthUsed', 'Date_ID') 
yield r2,slope with r2,slope match(r:route:`OTNB-0001`) 
set r.r2=r2,r.slope=slope

So my question now is how to loop through those specific labels and then delete them ?
Also, is there anyway to play the output of this function to add more linear regression related parameters ? I can help in developing the original code if this helps because I really believe this is going to be extremely valuable for certain use cases

raghad_baiad
Node Clone

Hello @johnmattgrogan ,
Thanks a lot for your reply. Yes I'm trying to fit the 'State_BandwidthUsed' (dependant variable) using 'Date_ID' (Independant variable ) for different routes, so creating a temporary
label for each group of routes (where we want to fit ) is valid. the date is coded as continuous numeric variable. My progress now is that I gave up on using the apoc.math.regr and I'm writing the equivalent cypher to compute the output. It looks tedious but this is what I'm working on, below my progress in my qyery and sample output:

MATCH (r:route)          
// for each route, order the routes by their score 
WITH r ORDER BY r.Date_ID DESC       
// for each route, group the Route_ID (now ordered by their dates)
WITH r, COLLECT(r.Route_ID) AS routes   
// iterate over the sequence 0..number_of_routes
UNWIND range(0,size(routes)) AS i      
match (r:route) where r.Route_ID=routes[i]  
with r.Route_ID as Route_ID ,avg(r.State_BandwidthUsed) as avgY,avg(r.Date_ID) as avgX ,count(r)as c
with  Route_ID as r ,c,avgX,avgY
return c,avgX,avgY,r,count(r)

As you can see I'm able to get avgX and avgY and working on the rest of the leasr square criterion estimators, but I'm worried about the complexity of the cypher. Many thanks in advance

johnmattgrogan
Node Clone

Hi @raghad.baiad,

Do your route names have a specific pattern? If so, using a predicate you can filter for only those labels that match a regex pattern. Something like:

MATCH (n:route)
WHERE ANY(l in labels(n) where l =~ 'OTNB.*')
WITH collect(distinct labels(n)) as label_list
UNWIND label_list as label_to_remove
MATCH (a:route)
WHERE head(label_to_remove) in labels(a)
CALL apoc.create.setLabels(a, ['route']) yield node
RETURN node

From the apoc documentation:

apoc.create.setLabels( [node,id,ids,nodes], ['Label',…​]) - sets the given labels, non matching labels are removed on the node or nodes

This seems like it will answer the direct question, however it feels a bit hacky. I will leave it to others if there is a more Neo4j / Cypher friendly way of doing things.

As for the other parameters on the regression, I'm really not sure. Again, hopefully someone here has some better insight than me

Thanks,

-Matt

Nodes 2022
Nodes
NODES 2022, Neo4j Online Education Summit

On November 16 and 17 for 24 hours across all timezones, you’ll learn about best practices for beginners and experts alike.