I am working with network flow logs. That data has the following attributes:
src-ipaddr
dst-ipaddr
src-port
dst-port
protocol
bytes
packets
starttime
endtime
The flows are directed: src->dst.
The problem is as follows:
Suppose I have
50 flows from node X to node A,
50 flows from node Y to Node A,
70 flows from Node A to node B
I want to predict if the 70 flows from A to B originated at A with A as the true source, or did they originate from either X or Y using A as an intermediate node to hop to B (its final destination). Further, I want to predict how many flows in 70 originated from its true source i.e., X or Y if the true source of flows are not A.
Tasks:
-
Determine whether the specific flows from Node A to Node B truly originated from Node A or if Node A was used as an intermediate node by other sources (Node X, Node Y).
-
If Node A is not the true source for some or all of the specific flows, estimate the number of flows that originated from each of the possible true source nodes (Node X, Node Y).
In networking trace-route is a time-complex process that inspects packets containing the bytes of data to determine the hops of the flows in network. I want to assess the viability of predicting the flow links using neo4j and graph ML, and check its success rate without packet inspection.
Any directions in developing the data model, and choice of algorithms is much appreciated.
My initial thoughts on graph data model:
Node (ipaddr)= Ipaddr (src or dst)
Edge (flow_to)= directed relationship between source IP nodes and destination IP nodes.
Properties to this relationship are **bytes, packets, protocol, start, and end**.