Questions about Graph-Modeling

Hi,

I am creating a Graph about research papers by authors. The research papers are in special topics (marked by tags).

Assumptions:

  • If I am likely to search for sth., it’s faster/easier to update to create a relationship and a node for this, than to hide it away inside a property of a node
  • a uuid property only if need to identify a unique node

→ are these assumptions correct?

I have some conceptual questions. I’m unsure how to:

  • create properties automatically: if a person has another email, another email-property should automatically be created
  • in contrast I am likely to search for “which of these authors speaks Chinese”, therefore speaks_language is a relationship. How to automatically create a node if another language which does not have a node yet is detected?
  • how to model dates so that it makes sense. Is it a relationship “date”-node with subsequent relationships to a year and its months? How to create them automatically
  • how would you model the author relationship “co_authored_paper” ... at which time (Month. Year)...at which age? …at which institution

node author
property_uuid
property first_name
property middle_name
property last_name
property email
property emailXY (additional)
relationship speaks_language
relationship gender
??relationship employed_from_date_at_at_institution
??relationship employed_until_date_at_institution
??relationship employed_currently_at_institution
??relationship co_authored_paper ... at which time (Month. Year)...at which age?

node speaks_language_english
node speaks_language_chinese
node speaks_language_german
node speaks_language_japanese
node speaks_language_french
node speaks_language_indian
node speaks_language_XY (additional)

node gender male
node gender female
node gender unknown

node institution_type_university
relationship is_institution_university
node institution_type_company
relationship is_institution_company
node institution_type_ngo
relationship is_institution_ngo
node institution_type_government
relationship is_institution_ngo
node institution_type_ngo
relationship is_institution_ngo
node institution_type_other
relationship is_institution_other

node institution_type_xy (see above)
property institution_name
property adress_street
property adress_postcode
property adress_city
property adress_county
property adress_country
relationship institution_country
relationship is_institution_of_type

node institution_country_US
node institution_country_UK
node institution_country_China
node institution_country_Germany
node institution_country_France
node institution_country_Italy
node institution_country_Japan
node institution_country_other

node research_paper
property no_of_pages
property file_name
porperty file_type
property file_size
property updated_yes_or_no
property peer_reviewed_yes_or_no
property download_url
property short_summary
relationship source
relationship paper_is_in_language
relationship tag1
relationship tagXY (additional)
relationship cited_by to author
relationship co_authored_by
relationship paper_publication_date
relationship updated_at_date

node tag1
node tag (additional)

node source_Arxiv
node source_Elsevier
node source_Research_Gate
node source_institution
node source_behind_paywall
node source_XY (additional)

node paper_in_language_english
node paper_in_language_chinese
node paper_in_language_german
node paper_in_language_japanese
node paper_in_language_french
node paper_in_language_indian
node paper_in_language_other

node paper_publication_date …how to model?

Could someone please help me with these questions? What did I mess up/misunderstand completely? Is this Graph-modeling or is it creating a Ontology? I’ve done most Graph-Academy-courses but I lack experience. Do you have “tipps & tricks” from real world projects?

Bye

Michael

I can help with some of these…

For alternate properties (like email) it may be best to model this with separate nodes and relationships, similar to what you will have to do with spoken languages.

Let’s look at languages as an example. Let’s say your desired pattern would be something like this:

(:Author)-[:SPEAKS]→(:Language {name})

Assuming you’ve already created the :Author node, here’s how you would create the “author speaks Chinese” part of their graph:

// assume Cypher already has matched to the :Author with variable `a`
...
MERGE (l:Language {name:'Chinese'})
MERGE (a)-[:SPEAKS]->(l)
...

The two separate MERGEs are necessary.

  1. The first MERGE will either match to an existing :Language node, or if it doesn’t already exist, create a new one with that name.
  2. The second MERGE will either MATCH to an existing :SPEAKS relationship from that author for that language, or it will create such a relationship if it doesn’t already exist.
  3. If you had tried to do a single MERGE for the full pattern, it wouldn’t have done what you expected
    1. MERGE will look for the full pattern and if it doesn’t exist it will create the full pattern, excepting only variables in the pattern that were already bound before the MERGE.
    2. So a MERGE (a)-[:SPEAKS]->(l:Language {name:'Chinese'}) if such a pattern didn’t exist, would try to create the entire pattern, including creating a new :Language node for Chinese (since l wasn’t already bound before the MERGE). If there was already a Chinese language node, then this would be creating duplicate Chinese language nodes each time instead of connecting to a single common node.

If you wanted this to be more general, provided that you had a parameter list of strings of languages the author spoke, you could process them like this:

// assume Cypher already has matched to the :Author with variable `a`
...
FOREACH (language IN $languages | 
         MERGE (l:Language {name:language})
         MERGE (a)-[:SPEAKS]->(l))
...

You could do similar with emails, or other contact info like phone number etc., where there could be multiple, and in general this is a better way to go about creating new data where there could be some unknown number of possible “properties”, vs creating a new property on the node or each, which would require more complicated Cypher for interrogating such data, and complicate efforts to index such data. For example, if you had multiple emails, you can’t have a single index on multiple properties (well you can with a fulltext index, but not any of Neo4j’s native indexes). But by breaking out :Email into its own node and attaching to :Author nodes, you could create an index on :Email(email) and use that for the lookup, and from there traverse to DISTINCT :Author nodes.

Here’s how you would check that an author you matched on spoke at least one of a parameter list of languages:

// assume Cypher already has matched to the :Author with variable `a`
...
WITH a, EXISTS {
      MATCH (a)-[:SPEAKS]->(l:Language)
      WHERE l.name IN $allowedLanguages
      RETURN l
   } as speaksAnAllowedLanguage
...

If you had matched to two authors, and wanted to know if they spoke any common languages:

// assume you've already matched to a1 and a2
...
WITH a1, a2, COLLECT {
    MATCH (a1)-[:SPEAKS]->(l:Language)<-[:SPEAKS]-(a2)
    RETURN l.name
  } as commonLanguages
...

________

For papers that are authored, you would put common information on the :Composition node itself (just using this for the node label, in this example). This would typically have info like the date it was written, etc.

As far as metadata about an author’s authorship of a paper, you could either model this as properties on a relationship between the author and the paper they authored, or you could create something like an :Authorship node to capture that snapshot of information (like the age of the :Author at the time of authorship, what university they were at, etc).

So it could look like this:

(:Author)-[:AUTHORED {age, university}]->(:Composition)

or something like this:

(:Author)-[:AUTHORED]->(:Authorship {age, university)-[:AUTHORSHIP_FOR]->(:Composition),
(:Authorship)-[:CREATED_AT]->(:University)

The advantages of breaking out such info into its own node, instead of as just properties on a relationship, include the ability to link that node to other related nodes (since relationships can only connect 2 nodes, but nodes can have relationships to any number of nodes), and in general they are easier to index.

3 Likes

Hi Andrew,

thank you very much! It helped very much.

Later i found out about “Create multiple nodes with a parameter for their properties" and “CREATE using dynamic node labels and relationship types".

I do have one additional question: What should I use as property of the researcher node?
valid
not valid
→ is there an advantage of having 2 properties of which one is “not active” instead of having one/not one in case it’s not valid?
date of snapshot
first name
middle name
birthday
last name
date_of_retrieved_name

The other properties I’ve outsourced as relationships (and subsequent nodes).

And: Is it possible to merge two accounts or are “set alternate emails” about to arrive at https://graphacademy.neo4j.com/account/settings/

Bye

Michael