Loading .pdf or .doc content into Neo4j

pa0lo · July 23, 2019, 1:35pm

Can someone share best practices for loading contents from pdf or .doc files into neo4j?
I would like to create relationships based on specific words found in the files' content. I have hundreds of files, so manually converting each to csv is not an option.

david_allen · July 23, 2019, 4:01pm

This is a very big area that has to be decomposed. The short answer is that you can't directly load a PDF or DOC file into Neo4j. Instead, you need a document pipeline which requires several other pieces.

Primarily you need to extract text from the PDF or DOC. There are many different formats of doc, and PDFs can sometimes be pictures of text not actual text. If you have pictures of text, you need an OCR step to pull text out of the picture.

Once you have text, then you need to think more what kind of graph you're trying to get out of it. Do you want person/place/thing information linked together? Then you might check out the GraphAware NLP plugins for Neo4j - but they work on text, not on docs/PDFs.

Apache Tika is a great package for extracting text from DOC/PDF.

https://tika.apache.org/

Once you have text, I'd recommend going with something like this:

So summarizing:

PDF -> Apache Tika -> data file (such as JSON or CSV) that contains the full path of the file and full text of the original file.
Load that data file into Neo4j however you wish (LOAD CSV / apoc.load.json)
Use GraphAware NLP to annotate nouns/verbs/etc.
Link as appropriate.

pa0lo · July 23, 2019, 4:37pm

Thank you, this is exactly what I needed to get started.

armensanoyan · April 18, 2021, 6:23am

Hi David. As far as I am know GraphAware is not free, so I was wondering if there is another way to implement this without using paid services ?

david_allen · April 19, 2021, 2:32pm

Yes it's possible to do with free services. An approach you might consider is using a programming language library like NLTK for Python. There are natural language libraries for every programming language under the sun.

I'm gonna warn you though, this is a lot of integration work though, that personally I would not undertake if I could possibly avoid it. You'll need to build code that pulls data out of the database, does the NLP work, supports certain kinds of customizeability and then writes it back to Neo4j. This would take some time.

Since this original post, and since GraphAware's NLP is now built into their Hume product ... if I had only the NLP need with Graphs, these days I'm using Cloud APIs for the purpose, such as Google's natural language APIs. These aren't free options though.

armensanoyan · April 19, 2021, 6:58pm

Thank you David. Would you suggest NLTK over Spacy ? I am going to use it for scientific research papers. I am agree that google cloud API is great option for me, but I am interested in free options

david_allen · April 19, 2021, 8:16pm

I don't have anything good or bad to say about Spacy. I've used NLTK before in python and I liked it, but as I said, it's been a while since nowadays I mostly use cloud APIs for this purpose.

In general, if you're evaluating these types of libraries make sure to check out the various corpuses they offer. The actual functionality / quality of the library is only part of the issue; a lot of times you may need multi-lingual corpuses or other data files which are part of the usage of the library, but not code if you know what I mean.

egkareemz · January 24, 2022, 11:21pm

I was wondering about the following scenario:
If the business is not interested in searching the content of the MS-WORD/EXCEL/CSV/PDF/Image file, is it possible to store it as some sort of a "BLOB" and provide a relationship to a node. An example would be to include a document such as an image of a hotel invoice and associating that image to a specific employee node (Pardon my technical vocabulary I have started 10 minutes ago!).
Thanks.

Topic		Replies	Views
Questions regarding neo4j Neo4j Graph Platform migrated	1	174	December 17, 2022
Question about NLTK and feeding large text files from API to graph General migrated	2	168	August 2, 2022
Tutorial: Import Relational Data Into Neo4j Neo4j Website	0	821	August 5, 2020
How do I load data into Neo4j? Newbie Questions newbie	1	1186	April 4, 2019
How Best to Load Web Page Table or Input Data into Neo4j Data Neo4j Graph Platform migrated	2	178	August 31, 2022

Loading .pdf or .doc content into Neo4j

Related topics