Loading Nested XML elements with APOC

Please I am certain that this is obvious, just not for me at the moment :slightly_smiling_face:

I need to create a graph from an XML file, which contains nested XML elements. XML is a graph represented in a flat text file after all. All the examples I came across are based on a single layer XML format. It seems that the APOC Parse mechanism can understand nested XML structures and returns a map. How can I then use this map to create my graph? Or is there a better way?

Is there a Neo4j mechanism that accepts an XSD file, like JaxB would in the Java world?

There must be a simple solution to this problem.

Can you help me out?

Thanks

I admit the using the library is a bit clunky. There are examples in the documentation. We can try to help with specific xml if you want to post some xml and describe what you want to extract. You do end up with very verbose cypher.

yes.. i went through these documentation sections.. Does not seem to show how to load Nested Elements.. I see how the parse works, and it does, but.. . Lets say I have something like this. What would the apoc.load.xml look like to load this into a graph.. Project would have attributes type, name, description and the Project node would point to a Dependencies node. The Dependencies node would point to three Dependency nodes holding those settings.. See what I am asking Captain?

.. Thanks you for this help.. I know this is easy but.. I am not able to get this working right. thanks man

please forgive.. the interface destroyed my xml file that I attached. ..

This is a clear example why I avoid xml format. The import is very convoluted.

Anyway, I gave it a try. The following parses the xml into variables, so you can see the output.

WITH '<project type="some type"><modelVersion>4.0.0</modelVersion><groupId>xxx</groupId><version>0.0.1-SNAPSHOT</version><name>123</name><description>456</description><dependencies><dependency><groupId>org.neo4j.driver</groupId><artifactId>org.neo4j.driver</artifactId><version>5.5.0</version></dependency><dependency><groupId>org.apache.logging.log4j</groupId><artifactId>log4j-api</artifactId><version>2.14.0</version></dependency><dependency><groupId>org.apache.logging</groupId><artifactId>olog4j-core</artifactId><version>2.14.0</version></dependency></dependencies></project>' AS xmlString
WITH apoc.xml.parse(xmlString) AS value
WITH value.type as project_type,
[i in value._children where i._type = 'modelVersion'|i._text][0] as modelVersion,
[i in value._children where i._type = 'groupId'|i._text][0] as groupId,
[i in value._children where i._type = 'version'|i._text][0] as version,
[i in value._children where i._type = 'name'|i._text][0] as name,
[i in value._children where i._type = 'description'|i._text][0] as description,
[x in [i in value._children where i._type = 'dependencies'|i._children][0]| x._children] as dependencies
with project_type, modelVersion, groupId, version, name, description, [i in dependencies | {
    groupId: [x in i where x._type = 'groupId'|x._text][0],
    artifactId:  [x in i where x._type = 'artifactId'|x._text][0],
    version:  [x in i where x._type = 'version'|x._text][0]
}] as dependencyList
return *

Adding creating entities to the above parsing query:

WITH '<project type="some type"><modelVersion>4.0.0</modelVersion><groupId>xxx</groupId><version>0.0.1-SNAPSHOT</version><name>123</name><description>456</description><dependencies><dependency><groupId>org.neo4j.driver</groupId><artifactId>org.neo4j.driver</artifactId><version>5.5.0</version></dependency><dependency><groupId>org.apache.logging.log4j</groupId><artifactId>log4j-api</artifactId><version>2.14.0</version></dependency><dependency><groupId>org.apache.logging</groupId><artifactId>olog4j-core</artifactId><version>2.14.0</version></dependency></dependencies></project>' AS xmlString
WITH apoc.xml.parse(xmlString) AS value
WITH value.type as project_type,
[i in value._children where i._type = 'modelVersion'|i._text][0] as modelVersion,
[i in value._children where i._type = 'groupId'|i._text][0] as groupId,
[i in value._children where i._type = 'version'|i._text][0] as version,
[i in value._children where i._type = 'name'|i._text][0] as name,
[i in value._children where i._type = 'description'|i._text][0] as description,
[x in [i in value._children where i._type = 'dependencies'|i._children][0]| x._children] as dependencies
with project_type, modelVersion, groupId, version, name, description, [i in dependencies | {
    groupId: [x in i where x._type = 'groupId'|x._text][0],
    artifactId:  [x in i where x._type = 'artifactId'|x._text][0],
    version:  [x in i where x._type = 'version'|x._text][0]
}] as dependencyList
create(p:Project{project_type:project_type, description:description, groupId:groupId, modelVersion:modelVersion, name:name, version:version})
foreach(k in dependencyList |
create(d:Dependency) set d=k
merge(p)-[:HAS_DEPENDENCY]-(d)
)

THANKS SIR.. I am digging through this NOW!

I guess I am confused by this section. could you maybe suffer me an explanation? why the "|i._children][0]| x._children] as dependencies" section.

THANKS For Helping Me Out!!

Here is the json 'value' returned from the parser method.

{
  "_children": [
    {
      "_type": "modelVersion",
      "_text": "4.0.0"
    },
    {
      "_type": "groupId",
      "_text": "xxx"
    },
    {
      "_type": "version",
      "_text": "0.0.1-SNAPSHOT"
    },
    {
      "_type": "name",
      "_text": "123"
    },
    {
      "_type": "description",
      "_text": "456"
    },
    {
      "_children": [
        {
          "_children": [
            {
              "_type": "groupId",
              "_text": "org.neo4j.driver"
            },
            {
              "_type": "artifactId",
              "_text": "org.neo4j.driver"
            },
            {
              "_type": "version",
              "_text": "5.5.0"
            }
          ],
          "_type": "dependency"
        },
        {
          "_children": [
            {
              "_type": "groupId",
              "_text": "org.apache.logging.log4j"
            },
            {
              "_type": "artifactId",
              "_text": "log4j-api"
            },
            {
              "_type": "version",
              "_text": "2.14.0"
            }
          ],
          "_type": "dependency"
        },
        {
          "_children": [
            {
              "_type": "groupId",
              "_text": "org.apache.logging"
            },
            {
              "_type": "artifactId",
              "_text": "olog4j-core"
            },
            {
              "_type": "version",
              "_text": "2.14.0"
            }
          ],
          "_type": "dependency"
        }
      ],
      "_type": "dependencies"
    }
  ],
  "_type": "project",
  "type": "some type"
}

The first key from the root is '_children'. It is an array of json objects. The first five elements contain the modelVersion, groupId, version, name, and description values. The sixth element is a json object with key/value pairs '_type:dependencies' and '_children', which is another array of json objects that have the dependency information.

To get the dependencies array, we need to get the element from the top-level '_children' array that has key/value pair '_type:dependencies'. This is done with the following expression. It loops through the elements of the value._children array looking for the json object that has its '_type' key equal to 'dependencies'. It then returns only the '_children' element of the json object. There is only one 'type' key with value 'dependency', so the result of the list comprehension will be a list with one element, thus, we get the zeroth element.

[i in value._children where i._type = 'dependencies'|i._children][0]
{
  "_children": [
    {
      "_type": "groupId",
      "_text": "org.neo4j.driver"
    },
    {
      "_type": "artifactId",
      "_text": "org.neo4j.driver"
    },
    {
      "_type": "version",
      "_text": "5.5.0"
    }
  ],
  "_type": "dependency"
},
{
  "_children": [
    {
      "_type": "groupId",
      "_text": "org.apache.logging.log4j"
    },
    {
      "_type": "artifactId",
      "_text": "log4j-api"
    },
    {
      "_type": "version",
      "_text": "2.14.0"
    }
  ],
  "_type": "dependency"
},
{
  "_children": [
    {
      "_type": "groupId",
      "_text": "org.apache.logging"
    },
    {
      "_type": "artifactId",
      "_text": "olog4j-core"
    },
    {
      "_type": "version",
      "_text": "2.14.0"
    }
  ],
  "_type": "dependency"
}

The above result is itself a json array of json objects with a '_children' key whose value is the array of dependency data and a '_type key with value '_dependency'. This is the dependency data we are seeking. To extract the '_children' objects into a list, the expression now iterates through the data shown below and returns just the '_children' values. The final result of this expression is a list of the dependency objects:

[[
{
  "_type": "groupId",
  "_text": "org.neo4j.driver"
}
, 
{
  "_type": "artifactId",
  "_text": "org.neo4j.driver"
}
, 
{
  "_type": "version",
  "_text": "5.5.0"
}
], [
{
  "_type": "groupId",
  "_text": "org.apache.logging.log4j"
}
, 
{
  "_type": "artifactId",
  "_text": "log4j-api"
}
, 
{
  "_type": "version",
  "_text": "2.14.0"
}
], [
{
  "_type": "groupId",
  "_text": "org.apache.logging"
}
, 
{
  "_type": "artifactId",
  "_text": "olog4j-core"
}
, 
{
  "_type": "version",
  "_text": "2.14.0"
}
]]

The following expression in the next line, extracts the dependency data into a list of maps, where 'dependencies' is the variable representing the data above.

[i in dependencies | {
    groupId: [x in i where x._type = 'groupId'|x._text][0],
    artifactId:  [x in i where x._type = 'artifactId'|x._text][0],
    version:  [x in i where x._type = 'version'|x._text][0]
}]

Thanks Gary.. thanks man.. this is a huge help.. is there anything I can do for you? rlukas@mitre.org

Not use xml...it gives me a headache parsing it with cypher.

No worries, your welcome.