Loading .pdf or .doc content into Neo4j

david_allen · July 23, 2019, 4:01pm

This is a very big area that has to be decomposed. The short answer is that you can't directly load a PDF or DOC file into Neo4j. Instead, you need a document pipeline which requires several other pieces.

Primarily you need to extract text from the PDF or DOC. There are many different formats of doc, and PDFs can sometimes be pictures of text not actual text. If you have pictures of text, you need an OCR step to pull text out of the picture.

Once you have text, then you need to think more what kind of graph you're trying to get out of it. Do you want person/place/thing information linked together? Then you might check out the GraphAware NLP plugins for Neo4j - but they work on text, not on docs/PDFs.

Apache Tika is a great package for extracting text from DOC/PDF.

https://tika.apache.org/

Once you have text, I'd recommend going with something like this:

So summarizing:

PDF -> Apache Tika -> data file (such as JSON or CSV) that contains the full path of the file and full text of the original file.
Load that data file into Neo4j however you wish (LOAD CSV / apoc.load.json)
Use GraphAware NLP to annotate nouns/verbs/etc.
Link as appropriate.

Topic		Replies	Views
Questions regarding neo4j Neo4j Graph Platform migrated	1	172	December 17, 2022
Question about NLTK and feeding large text files from API to graph General migrated	2	167	August 2, 2022
Tutorial: Import Relational Data Into Neo4j Neo4j Website	0	813	August 5, 2020
How do I load data into Neo4j? Newbie Questions newbie	1	1169	April 4, 2019
How Best to Load Web Page Table or Input Data into Neo4j Data Neo4j Graph Platform migrated	2	178	August 31, 2022

Loading .pdf or .doc content into Neo4j

Related topics