Loading .pdf or .doc content into Neo4j

Can someone share best practices for loading contents from pdf or .doc files into neo4j?
I would like to create relationships based on specific words found in the files' content. I have hundreds of files, so manually converting each to csv is not an option.

This is a very big area that has to be decomposed. The short answer is that you can't directly load a PDF or DOC file into Neo4j. Instead, you need a document pipeline which requires several other pieces.

Primarily you need to extract text from the PDF or DOC. There are many different formats of doc, and PDFs can sometimes be pictures of text not actual text. If you have pictures of text, you need an OCR step to pull text out of the picture.

Once you have text, then you need to think more what kind of graph you're trying to get out of it. Do you want person/place/thing information linked together? Then you might check out the GraphAware NLP plugins for Neo4j - but they work on text, not on docs/PDFs.

Apache Tika is a great package for extracting text from DOC/PDF.

https://tika.apache.org/

Once you have text, I'd recommend going with something like this:

So summarizing:

  1. PDF -> Apache Tika -> data file (such as JSON or CSV) that contains the full path of the file and full text of the original file.
  2. Load that data file into Neo4j however you wish (LOAD CSV / apoc.load.json)
  3. Use GraphAware NLP to annotate nouns/verbs/etc.
  4. Link as appropriate.
2 Likes

Thank you, this is exactly what I needed to get started.

1 Like

Hi David. As far as I am know GraphAware is not free, so I was wondering if there is another way to implement this without using paid services ?

Yes it's possible to do with free services. An approach you might consider is using a programming language library like NLTK for Python. There are natural language libraries for every programming language under the sun.

I'm gonna warn you though, this is a lot of integration work though, that personally I would not undertake if I could possibly avoid it. You'll need to build code that pulls data out of the database, does the NLP work, supports certain kinds of customizeability and then writes it back to Neo4j. This would take some time.

Since this original post, and since GraphAware's NLP is now built into their Hume product ... if I had only the NLP need with Graphs, these days I'm using Cloud APIs for the purpose, such as Google's natural language APIs. These aren't free options though.

Thank you David. Would you suggest NLTK over Spacy ? I am going to use it for scientific research papers. I am agree that google cloud API is great option for me, but I am interested in free options

I don't have anything good or bad to say about Spacy. I've used NLTK before in python and I liked it, but as I said, it's been a while since nowadays I mostly use cloud APIs for this purpose.

In general, if you're evaluating these types of libraries make sure to check out the various corpuses they offer. The actual functionality / quality of the library is only part of the issue; a lot of times you may need multi-lingual corpuses or other data files which are part of the usage of the library, but not code if you know what I mean.

1 Like

I was wondering about the following scenario:
If the business is not interested in searching the content of the MS-WORD/EXCEL/CSV/PDF/Image file, is it possible to store it as some sort of a "BLOB" and provide a relationship to a node. An example would be to include a document such as an image of a hotel invoice and associating that image to a specific employee node (Pardon my technical vocabulary I have started 10 minutes ago!).
Thanks.