I'm trying to build a solution that allows for BIM (building information model) analysis.
The problem that I'm facing is a request performance over a decent (100M) amount of elements, where every element has up to 100 properties out of 16k uniq properties.
Below you could find more detailed description of the problem, as well as experiments with Neo4j I already did, and requests I'm struggling with and what I'm trying to achieve.
Maybe someone could share their experience with tasks alike. I would appreciate any ideas/suggestions.
Basically BIM is a DAG with a set of properties attached to every node.
Every BIM evolves over the time (new elements added, some are deleted, some change their values or place in the hierarchy) and I need to track every change, so I could rewind a model to any state in the past.
On a top of the BIM I have a user-defined hierarchies (UDH) of elements, that could link together nodes across multiple BIMs, from different levels. For example, it could be a set of walls collected across 10 buildings, grouped by a floor number.
In the future every UDH should be able to assign additional properties for model nodes, groups of nodes, or elements from other UDHs.
The solution makes complex analytical queries on these structures, like "calculate total volume of elements with specific property, that are included into UDH Froo and UDH Bar", or "get all uniq values of property XYZ across all elements of UDH Foo".
Right now I'm using a custom solution based on the relational database, to represent this data. And it is kind of cumbersome, as it requires to build extremely complex queries to manipulate the data.
Experiments with Neo4j
It seems like Neo4j could drastically simplify the representation of BIMs and UDHs. But I'm concerned with the performance, the data sets are pretty large: every version of BIM contains around 300K nodes with 100 properties each, with total amount of uniq properties per BIM around 2k. I researched on this topic, and it seems like Neo4j is not quite good at large amount of properties, it's discussed in this topic, for example Best practices on number of properties for a node. Also it was stated that Neo4j makes linear search among node properties, which could result in slow queries that look for element with specific property.
I made a prototype, where uploaded several models to the Neo4j db and benchmarked requests. Request that collects all unique property names across all BIM nodes, where each node has 160-240 properties out of 500 uniq props, took
- 747 ms for 10K elements;
- 1568 ms for 20K elements;
- 3804 ms for 50K elements;
- on 100K elements.
Time was growing linearly, and on the last test db just stoped responding.
Right now I'm thinking of some hybrid solution, that could use column db like Vertica/ClickHouse/HBase to store properties, which should fit pretty good according to the sparse nature of properties, and Neo4j to store relationships between nodes.
For the reference, there is what I'm trying to achieve:
- Database would contain 600+ building information models.
- Every model has 10-60 versions.
- Every version has up to 700K elements.
- Every element has 20-100 properties out of 16K uniq properties.
- Get all uniq property names across single version of BIM or UDH.
- Get all uniq values of the property across single version of BIM or UDH.
- Get all elements from a single version of BIM or UDH whose property conforms to logical expression.
- Calculate expression for every element from the single version of BIM or UDH, like "(propA * probB) / propC * 100".
- Recursively calculate aggregate value (like COUNT or SUM(propA)) for every node of a single version of BIM or UDH.
- Requests 1-3 should take less than 3 sec.
- Requests 4-5 should take less than 60 sec.
- BIM write (import) time should be less than 600 sec.
- Optimal storage. Model has multiple versions. Each version introduces around 10k new elements, deletes around 10k elements and changes around 10k elements. So in every version thousands of elements remain the same and they should be "linked" to the new version, not recreated.
- Storage should be scalable to support thousands of BIMs in the future.