Looking for help with importing conference proceedings in XML

Trying to import a conference proceedings which is in XML. I've studied the APOC in detail and checked other resources, but just can't get it to do the right thing. At times it's giving results that don't make sense.

It wouldn't be smart to post the data file here, even the stripped down test data is too large.

At least a couple of things could be going on...

  1. The XML is malformed. One thing to try, is to convert to JSON and see if any errors crop when doing the conversion.
  2. The Neo4J XML importer has a bug. So, from the first step, see if you can do a JSON import which could avoid a hypothetical XML importer bug, but then again you could run into a Neo4J JSON importer bug.

Maybe also try it on a very small subset and debug that.

Maybe also try converting XML to CSV but use TABs to separate the values. A lot less can go wrong with CSV and it's usually easier to fix broken CSV than XML or JSON.

It's kind of hard to tell what's wrong without more info. (I haven't tried importing with XML or JSON so I can't vouch for their quality.)

Hello @jeg ,

These are the steps / processes I would recommend -

  1. One of the first step to take is to understand the XML Structure (i.e. root, elements, attributes and text).
    --> XML -> XML Tree
    If the file is too big, there are several tools. My favorite is XML Notepad from Microsoft (FREE!!!) -> XmlNotepad. (Note - if you know the opening and closing of parent tab, then use a small subset, otherwise don't copy and paste to a random lines. )

  2. Go through this detailed documentation -> Load XML - APOC Documentation

  3. Try to load to root, elements , attributes one at a time and slowly build your data model (examples -> Load XML - APOC Documentation

You don't have to convert to JSON/ CSV to check the validity. You have to convert to key:value pair in JSON, or pivot to columns. Instead open in Visual Code editor (with XML formatter).

Note -> apoc.xml.import doesn't have any bugs, but deprected. Instead use apoc.load.xml -> Load XML - APOC Documentation

I (we) are currently using apoc xml in production, and we don't see any bugs.

One of the most popular XML tool used in enterprise is from Altova. They are priced, but there are trial versions available. ( I wouln't recommend you to buy it if your use case is just once. Companies buy it because they build XML Schemas)

Alternate solution used by Data Engineers is to use python with xml.etree.ElementTree module and Minidom. (Python XML Parser Tutorial | ElementTree and Minidom Parsing | Edureka)

Instead of posting the test data, here's an overview of the document

<?xml version="1.0" encoding="ISO-8859-1"?>
<proceeding ver="6.0" ts="04/10/2010">
<X></X>
<Y></Y>
<Z></Z>
</proceeding>

X,Y,Z are complex, with several layers of nesting.

This statement:

CALL apoc.load.xml(*XML file*)
YIELD value UNWIND value._children AS conference
WITH [item in value._children WHERE item._type="conference_rec"] as conference_rec
return conference_rec;

returns X three times, where X is "conference_rec", the first chunk of data in the file. The exact same thing, repeated. I don't understand that.

I think you really need to provide one example where it's not working.

I presume lots of people have used XML importing for simpler cases without problems, so I'm guessing your simplified example won't demonstrate the problem you are encountering.

(If you want to redact the data, that's OK.)

Yeah, definitely. I never thought it was a problem with the APOC, just a problem with saying the right thing to make it do what I need. Here is the test file that produces the results in my last posting:

<?xml version="1.0" encoding="ISO-8859-1"?>
<proceeding ver="6.0" ts="04/10/2010">
<conference_rec>
	<conference_date>
		<start_date>07/27/1776</start_date>
		<end_date>07/31/1776</end_date>
	</conference_date>
	<conference_loc>
		<city><![CDATA[San Diego]]></city>
		<state>California</state>
	</conference_loc>
</conference_rec>
<proceeding_rec>
	<conference_number>2003</conference_number>
	<pages>142</pages>
</proceeding_rec>
<content>
	<section>
		<section_id>965401</section_id>
		<section_title><![CDATA[some sectiontitle]]></section_title>
		<section_page_from>1</section_page_from>
	<article_rec>
		<title><![CDATA[Peanuts and other goobers]]></title>
		<abstract>
			<par><![CDATA[Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Vel turpis nunc eget lorem dolor sed viverra ipsum.]]></par>
		</abstract>
		<authors>
			<au>
				<first_name><![CDATA[Mickey]]></first_name>
				<last_name><![CDATA[Mouse]]></last_name>
				<affiliation><![CDATA[University of Southern North Dakota at Hoople]]></affiliation>
				<role><![CDATA[Author]]></role>
			</au>
			<au>
				<first_name><![CDATA[Donald]]></first_name>
				<last_name><![CDATA[Duck]]></last_name>
				<role><![CDATA[Author]]></role>
			</au>
		</authors>
	</article_rec>
	<article_rec>
		<title><![CDATA[Efficient extraction of oil from peanuts]]></title>
		<authors>
			<au>
				<first_name><![CDATA[Minnie]]></first_name>
				<last_name><![CDATA[Mouse]]></last_name>
				<affiliation><![CDATA[D Entertainment]]></affiliation>
				<role><![CDATA[Author]]></role>
			</au>
			<au>
				<first_name><![CDATA[Daisy]]></first_name>
				<last_name><![CDATA[Duck]]></last_name>
				<affiliation><![CDATA[Imagineers]]></affiliation>
				<role><![CDATA[Author]]></role>
			</au>
		</authors>
	</article_rec>
	</section>
</content>
</proceeding>

Hello @jeg

In your code you do an unwind which creates N (aka 3) rows
but then you don't use the values from the unwind, but do another filter on the collection

You have to decide which of the two it is:

CALL apoc.load.xml(XML file) YIELD value
WITH [item in value._children WHERE item._type="conference_rec"] as conference_rec
return conference_rec;

or

CALL apoc.load.xml(XML file) YIELD value
UNWIND value._children AS conference
WITH conference WHERE conference._type="conference_rec"
return conference AS conference_rec;

Elaine

1 Like

Thank you, that has gotten me going.