Loading Nested XML elements with APOC

rlukas · February 14, 2023, 2:28pm

Please I am certain that this is obvious, just not for me at the moment

I need to create a graph from an XML file, which contains nested XML elements. XML is a graph represented in a flat text file after all. All the examples I came across are based on a single layer XML format. It seems that the APOC Parse mechanism can understand nested XML structures and returns a map. How can I then use this map to create my graph? Or is there a better way?

Is there a Neo4j mechanism that accepts an XSD file, like JaxB would in the Java world?

There must be a simple solution to this problem.

Can you help me out?

Thanks

glilienfield · February 14, 2023, 9:00pm

I admit the using the library is a bit clunky. There are examples in the documentation. We can try to help with specific xml if you want to post some xml and describe what you want to extract. You do end up with very verbose cypher.

rlukas · February 19, 2023, 3:37pm

yes.. i went through these documentation sections.. Does not seem to show how to load Nested Elements.. I see how the parse works, and it does, but.. . Lets say I have something like this. What would the apoc.load.xml look like to load this into a graph.. Project would have attributes type, name, description and the Project node would point to a Dependencies node. The Dependencies node would point to three Dependency nodes holding those settings.. See what I am asking Captain?

.. Thanks you for this help.. I know this is easy but.. I am not able to get this working right. thanks man

rlukas · February 19, 2023, 3:48pm

please forgive.. the interface destroyed my xml file that I attached. ..

glilienfield · February 19, 2023, 6:10pm

This is a clear example why I avoid xml format. The import is very convoluted.

Anyway, I gave it a try. The following parses the xml into variables, so you can see the output.

WITH '<project type="some type"><modelVersion>4.0.0</modelVersion><groupId>xxx</groupId><version>0.0.1-SNAPSHOT</version><name>123</name><description>456</description><dependencies><dependency><groupId>org.neo4j.driver</groupId><artifactId>org.neo4j.driver</artifactId><version>5.5.0</version></dependency><dependency><groupId>org.apache.logging.log4j</groupId><artifactId>log4j-api</artifactId><version>2.14.0</version></dependency><dependency><groupId>org.apache.logging</groupId><artifactId>olog4j-core</artifactId><version>2.14.0</version></dependency></dependencies></project>' AS xmlString
WITH apoc.xml.parse(xmlString) AS value
WITH value.type as project_type,
[i in value._children where i._type = 'modelVersion'|i._text][0] as modelVersion,
[i in value._children where i._type = 'groupId'|i._text][0] as groupId,
[i in value._children where i._type = 'version'|i._text][0] as version,
[i in value._children where i._type = 'name'|i._text][0] as name,
[i in value._children where i._type = 'description'|i._text][0] as description,
[x in [i in value._children where i._type = 'dependencies'|i._children][0]| x._children] as dependencies
with project_type, modelVersion, groupId, version, name, description, [i in dependencies | {
    groupId: [x in i where x._type = 'groupId'|x._text][0],
    artifactId:  [x in i where x._type = 'artifactId'|x._text][0],
    version:  [x in i where x._type = 'version'|x._text][0]
}] as dependencyList
return *

Adding creating entities to the above parsing query:

WITH '<project type="some type"><modelVersion>4.0.0</modelVersion><groupId>xxx</groupId><version>0.0.1-SNAPSHOT</version><name>123</name><description>456</description><dependencies><dependency><groupId>org.neo4j.driver</groupId><artifactId>org.neo4j.driver</artifactId><version>5.5.0</version></dependency><dependency><groupId>org.apache.logging.log4j</groupId><artifactId>log4j-api</artifactId><version>2.14.0</version></dependency><dependency><groupId>org.apache.logging</groupId><artifactId>olog4j-core</artifactId><version>2.14.0</version></dependency></dependencies></project>' AS xmlString
WITH apoc.xml.parse(xmlString) AS value
WITH value.type as project_type,
[i in value._children where i._type = 'modelVersion'|i._text][0] as modelVersion,
[i in value._children where i._type = 'groupId'|i._text][0] as groupId,
[i in value._children where i._type = 'version'|i._text][0] as version,
[i in value._children where i._type = 'name'|i._text][0] as name,
[i in value._children where i._type = 'description'|i._text][0] as description,
[x in [i in value._children where i._type = 'dependencies'|i._children][0]| x._children] as dependencies
with project_type, modelVersion, groupId, version, name, description, [i in dependencies | {
    groupId: [x in i where x._type = 'groupId'|x._text][0],
    artifactId:  [x in i where x._type = 'artifactId'|x._text][0],
    version:  [x in i where x._type = 'version'|x._text][0]
}] as dependencyList
create(p:Project{project_type:project_type, description:description, groupId:groupId, modelVersion:modelVersion, name:name, version:version})
foreach(k in dependencyList |
create(d:Dependency) set d=k
merge(p)-[:HAS_DEPENDENCY]-(d)
)

rlukas · February 20, 2023, 2:29pm

THANKS SIR.. I am digging through this NOW!

rlukas · February 20, 2023, 2:44pm

I guess I am confused by this section. could you maybe suffer me an explanation? why the "|i._children][0]| x._children] as dependencies" section.

glilienfield:

[x in [i in value._children where i._type = 'dependencies'|i._children][0]| x._children] as dependencies
with project_type, modelVersion, groupId, version, name, description, [i in dependencies | {

THANKS For Helping Me Out!!

glilienfield · February 20, 2023, 4:14pm

Here is the json 'value' returned from the parser method.

{
  "_children": [
    {
      "_type": "modelVersion",
      "_text": "4.0.0"
    },
    {
      "_type": "groupId",
      "_text": "xxx"
    },
    {
      "_type": "version",
      "_text": "0.0.1-SNAPSHOT"
    },
    {
      "_type": "name",
      "_text": "123"
    },
    {
      "_type": "description",
      "_text": "456"
    },
    {
      "_children": [
        {
          "_children": [
            {
              "_type": "groupId",
              "_text": "org.neo4j.driver"
            },
            {
              "_type": "artifactId",
              "_text": "org.neo4j.driver"
            },
            {
              "_type": "version",
              "_text": "5.5.0"
            }
          ],
          "_type": "dependency"
        },
        {
          "_children": [
            {
              "_type": "groupId",
              "_text": "org.apache.logging.log4j"
            },
            {
              "_type": "artifactId",
              "_text": "log4j-api"
            },
            {
              "_type": "version",
              "_text": "2.14.0"
            }
          ],
          "_type": "dependency"
        },
        {
          "_children": [
            {
              "_type": "groupId",
              "_text": "org.apache.logging"
            },
            {
              "_type": "artifactId",
              "_text": "olog4j-core"
            },
            {
              "_type": "version",
              "_text": "2.14.0"
            }
          ],
          "_type": "dependency"
        }
      ],
      "_type": "dependencies"
    }
  ],
  "_type": "project",
  "type": "some type"
}

The first key from the root is '_children'. It is an array of json objects. The first five elements contain the modelVersion, groupId, version, name, and description values. The sixth element is a json object with key/value pairs '_type:dependencies' and '_children', which is another array of json objects that have the dependency information.

To get the dependencies array, we need to get the element from the top-level '_children' array that has key/value pair '_type:dependencies'. This is done with the following expression. It loops through the elements of the value._children array looking for the json object that has its '_type' key equal to 'dependencies'. It then returns only the '_children' element of the json object. There is only one 'type' key with value 'dependency', so the result of the list comprehension will be a list with one element, thus, we get the zeroth element.

[i in value._children where i._type = 'dependencies'|i._children][0]

{
  "_children": [
    {
      "_type": "groupId",
      "_text": "org.neo4j.driver"
    },
    {
      "_type": "artifactId",
      "_text": "org.neo4j.driver"
    },
    {
      "_type": "version",
      "_text": "5.5.0"
    }
  ],
  "_type": "dependency"
},
{
  "_children": [
    {
      "_type": "groupId",
      "_text": "org.apache.logging.log4j"
    },
    {
      "_type": "artifactId",
      "_text": "log4j-api"
    },
    {
      "_type": "version",
      "_text": "2.14.0"
    }
  ],
  "_type": "dependency"
},
{
  "_children": [
    {
      "_type": "groupId",
      "_text": "org.apache.logging"
    },
    {
      "_type": "artifactId",
      "_text": "olog4j-core"
    },
    {
      "_type": "version",
      "_text": "2.14.0"
    }
  ],
  "_type": "dependency"
}

The above result is itself a json array of json objects with a '_children' key whose value is the array of dependency data and a '_type key with value '_dependency'. This is the dependency data we are seeking. To extract the '_children' objects into a list, the expression now iterates through the data shown below and returns just the '_children' values. The final result of this expression is a list of the dependency objects:

[[
{
  "_type": "groupId",
  "_text": "org.neo4j.driver"
}
, 
{
  "_type": "artifactId",
  "_text": "org.neo4j.driver"
}
, 
{
  "_type": "version",
  "_text": "5.5.0"
}
], [
{
  "_type": "groupId",
  "_text": "org.apache.logging.log4j"
}
, 
{
  "_type": "artifactId",
  "_text": "log4j-api"
}
, 
{
  "_type": "version",
  "_text": "2.14.0"
}
], [
{
  "_type": "groupId",
  "_text": "org.apache.logging"
}
, 
{
  "_type": "artifactId",
  "_text": "olog4j-core"
}
, 
{
  "_type": "version",
  "_text": "2.14.0"
}
]]

The following expression in the next line, extracts the dependency data into a list of maps, where 'dependencies' is the variable representing the data above.

[i in dependencies | {
    groupId: [x in i where x._type = 'groupId'|x._text][0],
    artifactId:  [x in i where x._type = 'artifactId'|x._text][0],
    version:  [x in i where x._type = 'version'|x._text][0]
}]

rlukas · February 21, 2023, 3:01pm

Thanks Gary.. thanks man.. this is a huge help.. is there anything I can do for you? rlukas@mitre.org

glilienfield · February 21, 2023, 3:37pm

Not use xml...it gives me a headache parsing it with cypher.

No worries, your welcome.

Topic		Replies	Views
Nested XML file load Procedures & APOC	3	530	June 8, 2020
Importing nested XML Elements Neo4j Graph Platform migrated	5	110	August 18, 2022
APOC - apoc.load.xml Neo4j Graph Platform migrated	4	117	July 4, 2022
Import XML Import / Export	0	309	April 29, 2020
Importing XML Data Newbie Questions	3	333	October 28, 2021

July Summer Fun!

Loading Nested XML elements with APOC

Related topics