Semantic search - searching for an array of vectors set as a property for a node ( Querying node index )

Hi there folks .. i am creating a graph where i link the main entity ( title of the document / paragraph ) to child nodes ( all related entities in the doc / para ) ..Simple example

( Albert Einstein ) [ : Discovered ] ( space time )
( Albert Einstein ) [ : Discovered ] ( photoelectric effect )

i am storing some details ( a short summary of the relation between the main node and the child ) and in order to be able to search the relationship vectors i am creating an index on the relationship. So far so good. The problem however is that there could be a whole lot of information about "space time" that i would like to store as the property of the child node

that could mean multiple lines and in order to store them as vectors i will have to either
a) keep a fixed size ( truncate / pad ..since different child nodes will have diff lengths of context )..will lead to loss of info in a lot of cases
b) have "array_vector" as a property where i store the array of embedded vectors of chunked text.

the problem with (b) is the index is specified with a fixed size while creating it and hence , theoretically it will error out when i use an array of lets say vectors of sz 384 instead of a single vector of size 384

c) a terrible hack would be to separate the child nodes further by adding some additional info about each chunk ( for e.g. if the child node has 3 sentences as its property and assuming every sentence turns into a vector of size 384, i will need to create 3 separate child nodes with the same entity + some additional info and separate 384 sized vectors ..this way they will all be indexed )

the only advantage of the above would be that since all 3 would be connected to the main and child entity they should show up in a semantic search and i could then combine all the info

sadly i cant think of any other way to do this, unless the ninja's come to my rescue ..appreciate all the patience and opinions

I would just do a single embedding for the whole space-time text if that's still selective enough in search (you can also have an LLM generate a summary with key terms and embed that summary only)

Otherwise I think splitting them up into individual nodes makes more sense, think about those as Fact nodes or Article nodes that cover different aspects. In a vector db you would also create different entries for them.

M

thanks Michael , i concur ..the first approach, i wouldn't risk, since i dont trust summaries generated by even GPT4 ( they are good most of the times but when i start pumping in official documents and spreadsheets, with textual info, their summarization is suspect .. )

coming to the 2nd approach, could u please confirm if my understanding of your recommendation is as u intend it to be
-> child node -> space time
-> context - In physics, spacetime is a mathematical model that fuses the three dimensions of space and the one dimension of time into a single four-dimensional continuum. Spacetime diagrams are useful in visualizing and understanding relativistic effects such as how different observers perceive where and when events occur. However, space and time took on new meanings with the Lorentz transformation and special theory of relativity.
-> now within this context itself i see atleast 2 main "sub entities", if i can call them that ( "spacetime diagrams" & "Lorentz transformation" ) .. so now , i can comfortably fit the sub context in 2 separate nodes
-> or we just call the entities "space time1" and "space time2" etc and store chunkable contexts within them

Up to you, you can either substructure the data as any of these

  • pieces of text (paragraphs)
  • concept nodes like you showed
  • or concrete entities like Person, Concept, Expertiment, Theory

Whatever helps you best to represent the data for your use-case.

1 Like