Typo or correct in 04_Predictions.ipynb of Data Science with Neo4j 3.5

In the middle of the jupyter notebook, there are 4 lines of code like this:
training_df = apply_graphy_features(training_df, "CO_AUTHOR_EARLY")
training_df.head()
test_df = apply_graphy_features(test_df, "CO_AUTHOR")
test_df.head()

"CO_AUTHOR" is the typo of "CO-AUTHOR_LATE" or correct?

Hi Dongho
Please try below snippet and see if it works for you

training_df = apply_graphy_features(training_df, "CO_AUTHOR_EARLY")
training_df.head()
test_df = apply_graphy_features(test_df, "CO_AUTHOR_EARLY")
test_df.head()
...
Thanks
-Sameer

No, I don't think so. This training separate a citation graph into two part based on the year 2006. Before 2006 is CO_AUTHOR_EARLY graph and from 2006 is CO_AUTHOR_LATE graph.

But I found many same issues in later part of this hands-on jupyter notebook:
They have all used CO_AUTHOR_EARLY and CO_AUTHOR(instead of CO_AUTHOR_LATE).

Please anyone helps to verify this is typo or any reason.

You should look at the documentation for jupyter notebook on Neo4j Admin Guide.

What do you mean? Neo4j Admin Guide?

Hey,

No it isn't a typo.

So we do the splitting into EARLY (train) and LATE (test) graphs to help pick pairs of positive and negative examples to go into the feature matrices.

And then when we're computing the scores for the train matrix we need to make sure that we don't look at any data that's in the test graph, hence using CO_AUTHOR_EARLY for all our computations there.

But when we compute the scores for the test matrix we don't need to worry about that, and it wouldn't actually make sense if we only computed the scores based on the LATE graph, as we'd be missing all of the collaborations that have already happened.

Hope that makes sense.

Cheers, Mark

2 Likes

Thank for your precious time to help me, Mark.
Now I can understand the reason. Hope God bless you.

Now I realized I don't need to worry about overlapping even though apply_graphy_features() functions use CO_AUTHOR relations for test data since apply_graphy_features() takes test_df as data argument and its rows are made only for test_df rows by the last step of the function:
"return pd.merge(data, features, on = ["node1", "node2"])"

def apply_graphy_features(data, rel_type):
query = """
UNWIND $pairs AS pair
MATCH (p1) WHERE id(p1) = pair.node1
MATCH (p2) WHERE id(p2) = pair.node2
RETURN pair.node1 AS node1,
pair.node2 AS node2,
algo.linkprediction.commonNeighbors(
p1, p2, {relationshipQuery: $relType}) AS cn,
algo.linkprediction.preferentialAttachment(
p1, p2, {relationshipQuery: $relType}) AS pa,
algo.linkprediction.totalNeighbors(
p1, p2, {relationshipQuery: $relType}) AS tn
"""
pairs = [{"node1": node1, "node2": node2} for node1,node2 in data[["node1", "node2"]].values.tolist()]
features = graph.run(query, {"pairs": pairs, "relType": rel_type}).to_data_frame()
return pd.merge(data, features, on = ["node1", "node2"])

Hi Mark,
thanks for this reply. I checked it changing CO_AUTHOR to CO_AUTHOR_LATE and the results are the same ;-)