Materialized View for Org-roam.db using sqlite triggers

dmg · June 1, 2024, 10:05pm

You can debug the difference in execution time by using

explain <query>;

this will give a lot of info on how it is computed (sqlite is not very good at summarizing the execution plan though)

I would try removing the left join from both queries. It is not needed. If the join (without left) takes longer in the mview, and the plans look the same, the only reasonable explanation is that mview is a wider table than nodes.

akashp · June 1, 2024, 10:12pm

https://hastebin.skyra.pw/ohoserokeg.pgsql

I posted it earlier in here

There is an extra ISNULL running

dmg · June 1, 2024, 10:17pm

that is a very quick operation. Add file NOT NULL to your mtable.

that should remove that check.

akashp · June 1, 2024, 10:21pm

Thanks so much – fixed it, awesome. The check is gone

akashp · June 2, 2024, 7:49am

Posting my final tests -
I conclude the following:

FULL DATA SHEET. < CLICK HERE

Conclusions

Materialized View wins out clearly when Join Complexity is the criteria.
Because of the data structure - joins between nodes_view and files table is slower than join between nodes and files table ever so slightly.

Further we get the result that querying for the intersection of attributes available in the nodes and nodes_view take the same amount of time.

The exact rationale as to why there is a slight degradation when joining with files table from the nodes_view compared to nodes is not clear to me.

select ... from sqlite_stat1;
---------------------------------------------------------------
TABLE | INDEX | ROW SIZE | AVERAGE ENTRIES PER INDEX KEY

files |sqlite_autoindex_files_1 | 10000  1

nodes |sqlite_autoindex_nodes_1 | 20000 1

nodes_view |nodes_view_file | 20000 2

nodes_view | sqlite_autoindex_nodes_view_1| 20000 1

Some Quick Statistic:

(Control vs Experiment)
+ Query for entries in node-list: 0.6 sec vs 0.2 sec (nodes+files+tags+refs+aliases)
+ Query for * in nodes;: 0.1 seconds
- Query for ... in nodes join files;: 0.16 vs 0.20 seconds

So to conclude,

Formulate query from specialized tables for table join < 2
.: for simple query bw nodes & files
Utilize Materialized Views when table join >=2
.: for complex query between nodes, files, tags, refs & aliases

Also Note:

ALL RESULTS ARE BEST AVERAGES

LATENCY GOES UP 4x when doing first query. => 0.2 = 0.8 secs (low 1%), etc
Why Choose best averages? Because they are consistent to compare relative efficiency - when latency goes up - it goes up by equal factors for all classes. These results only show relative efficiency - not absolute time which users should expect to get. Actual times may fluctuate wildly. But they fluctuate by similar factors.

dmg · June 2, 2024, 9:08pm

How many tags and aliases do you have in that database? The performance of the join query is very dependent on the size of the tables it joins.

Yes, it runs 3 times faster (0.6 vs 0.2 s), but this time is negligible compared to the time needed to process the results. And that is with a very large number of files; for users with a reasonable number of files/nodes, the improvement will be smaller.

I think that for most people, the gain will depend heavily on how many tags they use.
I suspect these 0.4 seconds gained are way less than the gain obtained by adding the new constructor and rewriting the creation of the nodes. That is why I feel that optimizing the DBMS should be done last, and why using the profiler should be used to really identify slow parts of the processing.

In fact, if you only did this optimization, the actual gain would have been minuscule since most of the time is spent building the list of candidates (and the GC that comes with it).

For my own instance of org-roam, memoizing org-roam-node-read–to-candidate
and rewriting org-roam-node have made org-roam-find-node almost instantaneous.

But I know that this performance is not going to last for ever because I keep adding nodes.

akashp · June 3, 2024, 4:04am

You are correct -

Thats why we need to put together various fixes - in conjunction to each other. In the end, no one solution will work - we need to put forward many of these together.

For example consider the 2 cache system we had in the beginning - the weakpoint was that when db changed, cache regeneration took so much time, then we saw that the display candidate format cache could speed up the regeneration and so on, similarly this could probably help us save the last mile when the node size is huuge- I mean reaching 100,000+.

For me personally, although I prepared this, this is of no use, because I simply don’t query so much information, this cache actually slows down my query because I only query from the two tables – nodes and files - in so far, one table join is faster for that case.

So I put together many of these fixes, and there is no one solution – but a collection of them - when required. What we require now is a guide that puts together these fixes in a single place, a guide that summarises what we have been doing in a single place.

I am confident - with a combination of various fixes - we can make the system highly responsive in various situations for very high node count, reaching 500,000+ and so on.

d12frosted · June 3, 2024, 4:28am

for some reason it didn’t scroll, so I didn’t see tags initially. Checked on a laptop, and now I can see the full script.

Indeed, you are right.

Topic		Replies	Views
Utility package: org-roam-timestamps Development	6	1415	December 7, 2023
Thoughts on DB structure Development	13	1685	May 8, 2020
Automated testing for extensions of org roam Development	0	419	September 15, 2020
Org Roam on alpine linux Troubleshooting	3	343	October 20, 2021
Org-roam as filesystem Development	6	160	November 16, 2024

Materialized View for Org-roam.db using sqlite triggers

Conclusions

Also Note:

Related topics