I have seen some responses to questions about using NLP to extract information out of large text files from research papers. I'm trying to create a pipeline that feeds research papers from an API (for scholarly articles) into my graph, uses NLTK for NLP extraction, perform GDS and then writes back to my in-memory graph.
I already have the NLTK and NLP parts written - but am unsure how to feed in large texts to my graph. Does anyone have any suggestions?
The process for feeding large texts into a graph is just like any other data load operation. You have a bunch of options depending on what's most convenient, from loading using the python client (a bit assuming you're using python here just because it's popular with other folks using Neo4j for NLP) - to LOAD CSV, to using one of the supported connectors (kafka, spark, BI).
If you're feeding large texts into Neo4j usually people will store them as individual properties and use full text indexing on them https://neo4j.com/docs/cypher-manual/current/indexes-for-full-text-search/
But an option for you to consider is not to store the large texts in Neo4j at all, but instead use the results of the NLP operations. Whether you're doing POS tagging, or some kind of entity resolution, ultimately Neo4j will get you better results if you store the concept / term / tag graph in Neo4j to query, rather than using Neo4j as a large text storage solution.
Follow up with what you've tried, and more details about your pipeline and we can maybe get more specific.