Double splitting strings?

I'm trying an experiment that consists of ingesting a text file that has numbered questions and answers in the following format:

1.) Question : Answer
2.) Question : Answer

I'm able to read the TXT file using the LOAD CSV function just fine. I can separate the answer from the question with the SPLIT function (e.g., split(row, ':') as newrow. The question is how can I do a second split to extract just the question number as well? I tried the following:

unwind(row) as newrow
with split(newrow, ':') as myrow
unwind(myrow) as thisrow
with myrow, split(thisrow, ".)") as q
return q, myrow[0], myrow[1]

But I seem to be looping within myrow constraints as I get q returning with the question number & question, then the answer in the next row, etc.

Any suggestions? I would really like to separate out the actual question number and then the answer to further build the graph. The question text doesn't really matter (consistent number and questions).

Hi @mbandor!

Have you tried without the unwind?

Something like

with split(row, ':') as myrow
with myrow, split(myrow[0], ".)") as q
return q[0], q[1], myrow[1]


Unfortunately split() only works with string values. At this point, row is a list and an error is thrown about the type mismatch.

Hello @mbandor :slight_smile:

Here is a try:

LOAD CSV FROM 'file:///test.txt' AS line
UNWIND line AS item
WITH split(item, ":") AS elements
WITH split(elements[0], ".)") AS questions, elements[1] AS answer
RETURN questions[0] AS question_number, questions[1] AS question, answer



I'm still getting errors. This is the code:

LOAD CSV FROM "file:///abc.txt' as line
UNWIND line as item
WITH split(item, ':') as elements
WITH split(elements[0], ".)") as questions, elements[1] as answer
return questions[0] as q_nbr, questions[1] as q, answer

It doesn't like the second split and throws the following error:

Invalid input ')': expected whitespace or a property key name (line 4, column 27 (offset: 133))
"WITH split(elements[0], ".)") as questions, elements[1] as answer"

My query was working with my text file, can you send me yours?

Unfortunately I can't, however here are the first 3 lines in the text file:

1.) Is this a follow-on report? : Yes
2.) Enter report number of initial report. : TBN-17356-002
3.) Has this information been shared with any other agency? : No

The pattern (question #,), question, : response) are pretty consistent throughout the rest of the file.

Ultimately, I would like to use the to address each question and perform additional processing based on the question number. That is why I'm trying to extract the question number along with the response. I've managed to separate the response but not the question number.

With this data, Cobra's code is working fine. Check line 4 in your .csv file. I used Neo4j 4.3.1 and desktop 1.4.8

Yeah, same, I don't have any errors.

Still no luck. I created a new txt file with only the 3 lines I posted in it. This is the current code I'm trying to execute:

LOAD CSV FROM "file:///mbandor.txt' as line
UNWIND line as item
WITH split(item, ':') as elements
WITH split(elements[0], '.)') as questions, elements[1] as answer
RETURN questions[0] as q_nbr, questions[1] as q, answer;

The error now is a Lexical error at line 5, col 57 (Encountered after : ""

You have a typo first line : LOAD CSV FROM "file:///mbandor.txt' as line instead of LOAD CSV FROM 'file:///mbandor.txt' as line :slight_smile:

Thanks! That is what was causing the error! I appreciate everyone's input. Now to parse the rest of the file and create some nodes and relationships.

1 Like

Let's take this a slightly different direction using the apoc.load.csv call. The reason I'm heading that direction is some of the lines in the text file contain commas (throws off the processing). If I use the apoc.load.csv('mbandor.txt', {sep:':'}) it handles the semi colon without the need for the first UNWIND (from previous example) and gives me the additional benefit of automatically handling lines that have commas in it (either as part of the question and/or part of the answer). This is the prototype code that still needs to separate the question number of the result for further processing:

CALL apoc.load.csv('mbandor.txt',
YIELD list
RETURN list[0] as Question, list[1] as Response;

I also noticed the first line of the file gets skipped. That isn't really an issue as the real file has 2 lines with the report heading before the actual questions, so I can use the SKIP function to address that.

Alternatively I just discovered the FIELDTERMINATOR option for changing the delimiter as an option.

I have a secondary question related to double-splitting the strings. Is there a limit as to how many values can be checked with I have about 75 and I'm getting lexical errors, but lesser amounts seem to work fine. Each line in the file is unique and has to be parsed.

What is the error? Maybe there is a typo?

I tried an experiment and the limit appears to be about 50 conditions. That just means I have to make an additional pass or two to completely parse each file.

@mbandor this is NOT directed at you. It's a comment on APOC and how it's scope has continued to creep over built in cypher capabilities.

As context: my work project isn't allowed to use APOC, so I'm a little sensitive to "just use apoc" :smiley:
This raises one of the serious issues with APOC. There are now functions that make things more complicated than the original cypher method.

FIELDTERMINATOR as you mentioned is the better way to do it, apoc really should be for filling in gaps in cypher's capabilities, not recreating the wheel.

1 Like