Typo or correct in 04_Predictions.ipynb of Data Science with Neo4j 3.5

Dongho · November 29, 2020, 7:05am

In the middle of the jupyter notebook, there are 4 lines of code like this:
training_df = apply_graphy_features(training_df, "CO_AUTHOR_EARLY")
training_df.head()
test_df = apply_graphy_features(test_df, "CO_AUTHOR")
test_df.head()

"CO_AUTHOR" is the typo of "CO-AUTHOR_LATE" or correct?

sam_gijare · November 30, 2020, 8:48am

Hi Dongho
Please try below snippet and see if it works for you

training_df = apply_graphy_features(training_df, "CO_AUTHOR_EARLY")
training_df.head()
test_df = apply_graphy_features(test_df, "CO_AUTHOR_EARLY")
test_df.head()
...
Thanks
-Sameer

Dongho · November 30, 2020, 1:23pm

No, I don't think so. This training separate a citation graph into two part based on the year 2006. Before 2006 is CO_AUTHOR_EARLY graph and from 2006 is CO_AUTHOR_LATE graph.

But I found many same issues in later part of this hands-on jupyter notebook:
They have all used CO_AUTHOR_EARLY and CO_AUTHOR(instead of CO_AUTHOR_LATE).

Please anyone helps to verify this is typo or any reason.

sam_gijare · November 30, 2020, 1:48pm

You should look at the documentation for jupyter notebook on Neo4j Admin Guide.

Dongho · November 30, 2020, 1:50pm

What do you mean? Neo4j Admin Guide?

mark.needham · November 30, 2020, 1:54pm

Hey,

No it isn't a typo.

So we do the splitting into EARLY (train) and LATE (test) graphs to help pick pairs of positive and negative examples to go into the feature matrices.

And then when we're computing the scores for the train matrix we need to make sure that we don't look at any data that's in the test graph, hence using CO_AUTHOR_EARLY for all our computations there.

But when we compute the scores for the test matrix we don't need to worry about that, and it wouldn't actually make sense if we only computed the scores based on the LATE graph, as we'd be missing all of the collaborations that have already happened.

Hope that makes sense.

Cheers, Mark

Dongho · November 30, 2020, 2:33pm

Thank for your precious time to help me, Mark.
Now I can understand the reason. Hope God bless you.

Dongho · December 3, 2020, 1:24am

Now I realized I don't need to worry about overlapping even though apply_graphy_features() functions use CO_AUTHOR relations for test data since apply_graphy_features() takes test_df as data argument and its rows are made only for test_df rows by the last step of the function:
"return pd.merge(data, features, on = ["node1", "node2"])"

def apply_graphy_features(data, rel_type):
query = """
UNWIND $pairs AS pair
MATCH (p1) WHERE id(p1) = pair.node1
MATCH (p2) WHERE id(p2) = pair.node2
RETURN pair.node1 AS node1,
pair.node2 AS node2,
algo.linkprediction.commonNeighbors(
p1, p2, {relationshipQuery: $relType}) AS cn,
algo.linkprediction.preferentialAttachment(
p1, p2, {relationshipQuery: $relType}) AS pa,
algo.linkprediction.totalNeighbors(
p1, p2, {relationshipQuery: $relType}) AS tn
"""
pairs = [{"node1": node1, "node2": node2} for node1,node2 in data[["node1", "node2"]].values.tolist()]
features = graph.run(query, {"pairs": pairs, "relType": rel_type}).to_data_frame()
return pd.merge(data, features, on = ["node1", "node2"])

andreas_kuczera · February 1, 2021, 9:26am

Hi Mark,
thanks for this reply. I checked it changing CO_AUTHOR to CO_AUTHOR_LATE and the results are the same ;-)

Topic		Replies	Views
Question about: Using a Machine Learning Workflow for Link Prediction - Using a Machine Graph Academy & Certifications	1	405	February 1, 2021
Can't proceed with 04_Predictions.ipynb - Data Science with Neo4j Graph Academy & Certifications	2	299	November 12, 2020
Data science training - part 3 - quiz question 2 Graph Academy & Certifications	3	521	May 21, 2020
Typo in "Importing CSV files" exercise Graph Academy & Certifications	1	210	May 10, 2022
Typo in "Importing CSV files" exercise General migrated	3	138	September 13, 2022

July Summer Fun!

Typo or correct in 04_Predictions.ipynb of Data Science with Neo4j 3.5

Related topics