Request performance on data with large amount of properties

DimGun · March 29, 2021, 11:30am

Hi folks,

I'm trying to build a solution that allows for BIM (building information model) analysis.

The problem that I'm facing is a request performance over a decent (100M) amount of elements, where every element has up to 100 properties out of 16k uniq properties.

Below you could find more detailed description of the problem, as well as experiments with Neo4j I already did, and requests I'm struggling with and what I'm trying to achieve.

Maybe someone could share their experience with tasks alike. I would appreciate any ideas/suggestions.

TL;DR

The data

Basically BIM is a DAG with a set of properties attached to every node.

Every BIM evolves over the time (new elements added, some are deleted, some change their values or place in the hierarchy) and I need to track every change, so I could rewind a model to any state in the past.

On a top of the BIM I have a user-defined hierarchies (UDH) of elements, that could link together nodes across multiple BIMs, from different levels. For example, it could be a set of walls collected across 10 buildings, grouped by a floor number.

In the future every UDH should be able to assign additional properties for model nodes, groups of nodes, or elements from other UDHs.

Current solution

The solution makes complex analytical queries on these structures, like "calculate total volume of elements with specific property, that are included into UDH Froo and UDH Bar", or "get all uniq values of property XYZ across all elements of UDH Foo".

Right now I'm using a custom solution based on the relational database, to represent this data. And it is kind of cumbersome, as it requires to build extremely complex queries to manipulate the data.

Experiments with Neo4j

It seems like Neo4j could drastically simplify the representation of BIMs and UDHs. But I'm concerned with the performance, the data sets are pretty large: every version of BIM contains around 300K nodes with 100 properties each, with total amount of uniq properties per BIM around 2k. I researched on this topic, and it seems like Neo4j is not quite good at large amount of properties, it's discussed in this topic, for example Best practices on number of properties for a node. Also it was stated that Neo4j makes linear search among node properties, which could result in slow queries that look for element with specific property.

I made a prototype, where uploaded several models to the Neo4j db and benchmarked requests. Request that collects all unique property names across all BIM nodes, where each node has 160-240 properties out of 500 uniq props, took

747 ms for 10K elements;
1568 ms for 20K elements;
3804 ms for 50K elements;
on 100K elements.

Time was growing linearly, and on the last test db just stoped responding.

What's next

Right now I'm thinking of some hybrid solution, that could use column db like Vertica/ClickHouse/HBase to store properties, which should fit pretty good according to the sparse nature of properties, and Neo4j to store relationships between nodes.

For the reference, there is what I'm trying to achieve:
Source data:

Database would contain 600+ building information models.
Every model has 10-60 versions.
Every version has up to 700K elements.
Every element has 20-100 properties out of 16K uniq properties.

Example requests:

Get all uniq property names across single version of BIM or UDH.
Get all uniq values of the property across single version of BIM or UDH.
Get all elements from a single version of BIM or UDH whose property conforms to logical expression.
Calculate expression for every element from the single version of BIM or UDH, like "(propA * probB) / propC * 100".
Recursively calculate aggregate value (like COUNT or SUM(propA)) for every node of a single version of BIM or UDH.

Timing:

Requests 1-3 should take less than 3 sec.
Requests 4-5 should take less than 60 sec.
BIM write (import) time should be less than 600 sec.

Additional requirements:

Optimal storage. Model has multiple versions. Each version introduces around 10k new elements, deletes around 10k elements and changes around 10k elements. So in every version thousands of elements remain the same and they should be "linked" to the new version, not recreated.
Storage should be scalable to support thousands of BIMs in the future.

akollegger1 · March 31, 2021, 12:47pm

Hi @DimGun ,

I keep bookmarking this post for reply, and haven't had time for a thoughtful response.

Brief notes to consider:

the size of the graph seems manageable, it's just the properties per node that seem challenging
indexes can be used in Neo4j similarly to an RDBMS for improving look-ups by property value
the approach you describe with another store for properties is definitely worth evaluating
take a look at neo4j-versioner-core GitHub - h-omer/neo4j-versioner-core: Entity-State model managed by Neo4j Procedures

-ABK

DimGun · April 6, 2021, 4:27pm

Hi Andreas,
thanks for the reply! I read through versioner-core documentation, pretty interesting approach to represent relationships between versioned entities, should really give it a try, when we would resolve the issue with the number of properties.

Topic		Replies	Views
Throughput for creation of nodes in Neo4J 3.5.7 decreases significantly with the number of properties Neo4j Graph Platform	7	894	August 16, 2019
Graph Data Modeling Question Modeling performance , neo4j-desktop , modeling , data-modeling	12	1242	May 4, 2021
Storing potentially large nodes in Neo4j Modeling performance , data-modeling	1	754	May 19, 2022
Neo4J browser graph render very slow due to property with large string Browser	2	1421	March 20, 2020
Property dense nodes Neo4j Graph Platform	3	1451	November 21, 2018

August Summer Fun!

Request performance on data with large amount of properties

The data

Current solution

Experiments with Neo4j

What's next

Related topics