Can Neo4J search keywords in websites using links in properties

Hi Guys, real novice here and just experimenting with graphs. I have a query, can you include links to websites in node properties and when you do a graph search it searches the websites as well. Please excuse me if this is a stupid question.

Many thanks

the short answer is (technically speaking) no, at least not using simple cypher, you can store a url, and return the property (note urls are even clickable in neo4j browser...)

the long answer is you could implement the functionality in a few different ways. Just some examples off the top of my head

  1. close to what you want, but architecturally probably not a good idea. Use apoc in the cypher query as a second step to load web page content and search it. e.g. apoc.load.html
  2. drive the desired experience from a program you write in language N (e.g. java, python), and search the web sites from that language after you retrieve a list of urls from neo4j

Thanks Joel. Very informative, appreciate your time.

we did have a use case similar to yours. we had to search for a keyword inside few websites inside the data.
you can't perform this operation directly. but rather have a python and BeautifulSoup functionality.

1 Like

Thank you Dominic very informative

One thing to do, is periodically download the text of the website (which will be slow... and may annoy the websites by hitting them constantly) and then store the text into the Neo4J DB. Then do the full text indexing on this text field. You can do the search as Joel D suggested with apoc.load.html but run periodically.

This is what search engines do: periodically grab the text from websites and index the text. It's a potentially expensive process.

This could lead to problems.

  1. Downloading text and text maintenance is a painful task.
  2. Nested Tags parsing would be more difficult.
  3. If the root tag is download, to parse the inner tags, again it has to be called into memory.

As I mentioned before, this functionally module, can reside outside neo4j, and used to Python + BeautifulSoup