Double splitting strings?

mbandor · August 26, 2021, 8:37pm

I'm trying an experiment that consists of ingesting a text file that has numbered questions and answers in the following format:

1.) Question : Answer
2.) Question : Answer
...

I'm able to read the TXT file using the LOAD CSV function just fine. I can separate the answer from the question with the SPLIT function (e.g., split(row, ':') as newrow. The question is how can I do a second split to extract just the question number as well? I tried the following:

unwind(row) as newrow
with split(newrow, ':') as myrow
unwind(myrow) as thisrow
with myrow, split(thisrow, ".)") as q
return q, myrow[0], myrow[1]

But I seem to be looping within myrow constraints as I get q returning with the question number & question, then the answer in the next row, etc.

Any suggestions? I would really like to separate out the actual question number and then the answer to further build the graph. The question text doesn't really matter (consistent number and questions).

Bennu · August 26, 2021, 10:51pm

Hi @mbandor!

Have you tried without the unwind?

Something like

with split(row, ':') as myrow
with myrow, split(myrow[0], ".)") as q
return q[0], q[1], myrow[1]

Bennu

mbandor · September 14, 2021, 1:43pm

Unfortunately split() only works with string values. At this point, row is a list and an error is thrown about the type mismatch.

cobra · September 18, 2021, 7:50am

Hello @mbandor

Here is a try:

LOAD CSV FROM 'file:///test.txt' AS line
UNWIND line AS item
WITH split(item, ":") AS elements
WITH split(elements[0], ".)") AS questions, elements[1] AS answer
RETURN questions[0] AS question_number, questions[1] AS question, answer

Regards,
Cobra

mbandor · September 22, 2021, 3:06pm

I'm still getting errors. This is the code:

LOAD CSV FROM "file:///abc.txt' as line
UNWIND line as item
WITH split(item, ':') as elements
WITH split(elements[0], ".)") as questions, elements[1] as answer
return questions[0] as q_nbr, questions[1] as q, answer

It doesn't like the second split and throws the following error:

Invalid input ')': expected whitespace or a property key name (line 4, column 27 (offset: 133))
"WITH split(elements[0], ".)") as questions, elements[1] as answer"

cobra · September 22, 2021, 3:11pm

My query was working with my text file, can you send me yours?

mbandor · September 22, 2021, 3:17pm

Unfortunately I can't, however here are the first 3 lines in the text file:

1.) Is this a follow-on report? : Yes
2.) Enter report number of initial report. : TBN-17356-002
3.) Has this information been shared with any other agency? : No

The pattern (question #,), question, : response) are pretty consistent throughout the rest of the file.

Ultimately, I would like to use the apoc.do.case to address each question and perform additional processing based on the question number. That is why I'm trying to extract the question number along with the response. I've managed to separate the response but not the question number.

ameyasoft · September 22, 2021, 4:33pm

With this data, Cobra's code is working fine. Check line 4 in your .csv file. I used Neo4j 4.3.1 and desktop 1.4.8

cobra · September 22, 2021, 5:44pm

Yeah, same, I don't have any errors.

mbandor · September 22, 2021, 5:44pm

Still no luck. I created a new txt file with only the 3 lines I posted in it. This is the current code I'm trying to execute:

LOAD CSV FROM "file:///mbandor.txt' as line
UNWIND line as item
WITH split(item, ':') as elements
WITH split(elements[0], '.)') as questions, elements[1] as answer
RETURN questions[0] as q_nbr, questions[1] as q, answer;

The error now is a Lexical error at line 5, col 57 (Encountered after : ""

cobra · September 22, 2021, 5:45pm

You have a typo first line : LOAD CSV FROM "file:///mbandor.txt' as line instead of LOAD CSV FROM 'file:///mbandor.txt' as line

mbandor · September 22, 2021, 5:46pm

Thanks! That is what was causing the error! I appreciate everyone's input. Now to parse the rest of the file and create some nodes and relationships.

mbandor · September 22, 2021, 6:04pm

Let's take this a slightly different direction using the apoc.load.csv call. The reason I'm heading that direction is some of the lines in the text file contain commas (throws off the processing). If I use the apoc.load.csv('mbandor.txt', {sep:':'}) it handles the semi colon without the need for the first UNWIND (from previous example) and gives me the additional benefit of automatically handling lines that have commas in it (either as part of the question and/or part of the answer). This is the prototype code that still needs to separate the question number of the result for further processing:

CALL apoc.load.csv('mbandor.txt',
{sep:':'})
YIELD list
RETURN list[0] as Question, list[1] as Response;

I also noticed the first line of the file gets skipped. That isn't really an issue as the real file has 2 lines with the report heading before the actual questions, so I can use the SKIP function to address that.

Alternatively I just discovered the FIELDTERMINATOR option for changing the delimiter as an option.

mbandor · September 22, 2021, 9:46pm

I have a secondary question related to double-splitting the strings. Is there a limit as to how many values can be checked with apoc.case()? I have about 75 and I'm getting lexical errors, but lesser amounts seem to work fine. Each line in the file is unique and has to be parsed.

cobra · September 23, 2021, 6:02am

What is the error? Maybe there is a typo?

mbandor · September 23, 2021, 1:09pm

I tried an experiment and the limit appears to be about 50 conditions. That just means I have to make an additional pass or two to completely parse each file.

mdfrenchman · September 23, 2021, 1:20pm

@mbandor this is NOT directed at you. It's a comment on APOC and how it's scope has continued to creep over built in cypher capabilities.

As context: my work project isn't allowed to use APOC, so I'm a little sensitive to "just use apoc"
Comment:
This raises one of the serious issues with APOC. There are now functions that make things more complicated than the original cypher method.

FIELDTERMINATOR as you mentioned is the better way to do it, apoc really should be for filling in gaps in cypher's capabilities, not recreating the wheel.

Topic		Replies	Views
String Parsing in Cypher Cypher	5	535	September 21, 2020
Load CSV split Cypher	3	966	June 21, 2019
Trouble with SPLIT and UNWIND syntax Cypher	5	1162	October 29, 2020
How to Split a Row While importing CSV with Cypher Neo4j Graph Platform migrated	27	268	August 10, 2022
Split on multiple levels Cypher cypher	2	148	February 23, 2024

Double splitting strings?

Related topics