Loading .pdf or .doc content into Neo4j

Can someone share best practices for loading contents from pdf or .doc files into neo4j?
I would like to create relationships based on specific words found in the files' content. I have hundreds of files, so manually converting each to csv is not an option.

This is a very big area that has to be decomposed. The short answer is that you can't directly load a PDF or DOC file into Neo4j. Instead, you need a document pipeline which requires several other pieces.

Primarily you need to extract text from the PDF or DOC. There are many different formats of doc, and PDFs can sometimes be pictures of text not actual text. If you have pictures of text, you need an OCR step to pull text out of the picture.

Once you have text, then you need to think more what kind of graph you're trying to get out of it. Do you want person/place/thing information linked together? Then you might check out the GraphAware NLP plugins for Neo4j - but they work on text, not on docs/PDFs.

Apache Tika is a great package for extracting text from DOC/PDF.


Once you have text, I'd recommend going with something like this:

So summarizing:

  1. PDF -> Apache Tika -> data file (such as JSON or CSV) that contains the full path of the file and full text of the original file.
  2. Load that data file into Neo4j however you wish (LOAD CSV / apoc.load.json)
  3. Use GraphAware NLP to annotate nouns/verbs/etc.
  4. Link as appropriate.
1 Like

Thank you, this is exactly what I needed to get started.

1 Like