You speak of the sun - I speak of the bright star in the sky.
Do you know what is the memory limit for garbage collection? 8MB for 64bit systems.
You indeed stopped GC from interrupting by not using &key or named parameters how you refer thereof. But you have delayed its apparition. It will come back to haunt you soon. You are refusing to see the systematic problem and privileging your experience as the absolute measure by which to measure this problem.
I think the issue is that the goal of this post was to improve org-roam-node-list. Not to fix the overall performance problems of org-roam-find-node. The two problems are related but are not the same.
I want to thank you because you showed me the actual danger of not letting GC run concurrently. Your code caused my entire 16GB to be consumed in matter of minutes. If I didnot have a running meter in my status bar it wouldve been a real problem. The original function resulted in a memory runoff real quick.
Will give it a try. Thank you for the details. Good to know about named-parameters and mapcan vs append⌠(to avoid creating âephemeral creation of data objectsâ).
@dmg But my problem is this doesnât translate inside the actual querying the database part. Testing purely the construction and testing the construction inside the query dont match up
I pasted the wrong result the first time, the GC is indeed low - but the change is very low in the grand scheme of things. For 20,000 nodes you get ~ < 1 sec less GC. But this value even varies. It is unnoticeable for large node size.
I think what we can both agree is for a holistic fix to this issue - we should apply a set of fixes - in a systematic manner.
A custom constructor function tailored to need +
increased GC threshold - depending on node size ( but this has the catch when GC will run it will run with more intensity ) , tailored query,
and overall optimisations throughout the read protocol.
We did reach the same conclusion,
Also we should not be afraid to take a systems approach to problems rather than trying to solve systematic problems in piece wise fashion.
I suggested to not port something like this upstream because users will need to write this with their needs in mind - what information they want to query and so on, so it should be left to the discretion of users, we should only try to elucidate certain main principles to be employed in this process.
With this I conclude, this Forum is asking me explicitly to shut up now â
"Youâve posted a lot in this topic! Consider giving others an opportunity to reply here and discuss things with each other as well."
I apologise for being frustrated earlier - it was a pleasure working with you, I got to know many things with your help.
I came across that earlier - I think there are dangers of arbitrarily setting this variable - even using complex mechanisms like the GCMH tries to do - I dont think we need to take the dangers - I will just post a last picture here showing how with very minimal changes to the gc threshold â not using setq but simple let - we can get very good results. Just increasing it to 16mb from 8mb between calls. I show in the other pane that indeed the node size is 20,000. The node list not only generates the list but also does complex sorting computation based on three criterias.
Without counting minimal GC we have result very close to ~0.01 seconds
I had the wonderful idea of creating a database of 60k nodes. I took the linux kernel, added a header to each file with a unique ID and title, and startedâŚ
that was more than 6 hrs ago!!! At least it seems to be doing something
though it has not updated the DB file in the last hr or so.
there are lots of room for improvement in org-roam for large knowledge bases.
I think its not a fault of Org roam per se, emacs is not designed to handle such cases. For very large Knowledge systems we might have to give up a db system altogether maybe and move to something like denote which organises based on a naming convention.
I ended only creating a DB of 18k files/nodes (one node per file).
Here is the result of the benchmarking of org-roam-node-list. It still shows that there is an improvement in the new processing. There is less garbage collection.
However, the rest of the processing (without caching) means that it takes around 5 seconds to try to find a node. I ran with 50M of gc-cons-threshold.
One of the problems of these artificial datasets is that the joins donât do much work because there are no tags, refs nor aliases.
Of course your fix works - it reduces GC by not creating temp variables there is no doubt about it, but we need to test at 8mb GC, that should be our target because thats the default in Emacs, and there the Memory Limit comes prior to this. Your suggestion was crucial in getting the last mile in my setup.
If the GC threshold is low - the Garbage created while querying nullifies the effect of using the custom constructor function, that is what I was trying to say.
I also got the display candidate generation to about <0.1 seconds by making the generation much fast. So in total I get about 0.2 ~ 0.3 seconds for the full circuit to complete for 20k
And please rethink running GC at 50MB - in my test anything after 16MB has diminishing results moreover - when the GC does happen for larger thresholds â it takes more time to complete. Try to reduce it to 16MB, I think you would find the same benefits.
Also I think the joins wonât tank much result - there are index on each of the Foreign Key Columns of Tables - so they shouldnât alter the result by too much.
Oh, this thread is interesting. I havenât read it fully yet, but it seems that you folks found a lot of bottle neck in the current lisp implementation of the query/population routine.
One of the other things that Iâve found some time ago is how slow multiple joins become as amount of notes grown. Iâve reported it as #1997 and even provided a link to solution that I use in vulpea, but I never found enough time and motivation to contribute it to Org Roam
Also I think the joins wonât tank much result - there are index on each of the Foreign Key Columns of Tables - so they shouldnât alter the result by too much.
So according to my findings, they do tax. Flat table is much faster. At least according to my benchmarks.
So maybe both approaches combined could give tremendous speed improvements.
Thanks, I wanted to create a custom table and replicate all the data in one place, but since I didnt have the appropriate data set I thought the index would cover it.
Will look into it, thanks for chiming in. Youre the OG optimiser. Your agenda mod has made a night and day diff for me.
Can you please comment on the optmisations I made? I will post the full mod I made to the read protocol. Performance has really increased.
Ill add your optimisations
I said here about material views, when I thought about it then, I looked into someway to offload as much of the processing to sqlite itself - I didnât want to burden the write to db protocol with more entries - but when I researched about it more I realised sqlite3 doesnât have native support - so gave up on the path. My reasoning was mostly data related - I had no way to make a controlled experiment and extract data, so it did not feel correct to try to solve a problem I couldnât be sure about. I will study about it more
Edit
After reading your vulpea-db.el I feel you have already made much of the modifications and generalised them very efficiently - we do have to spend more time during delta syncs - but if they are offset by efficiency in this regard it may be worth it -
I think your package solves many of the efficiency problems and you have identified the bottlenecks much before us. In effect the bottlenecks in org-roam-node-list and org-roam-node-read--completions are solved by you more comprehensively.
Maybe @dmg can find his GC reduction technique very useful here. That is only what could be imported into your approach
You have already taken care of query/filtering/sorting bottlenecks from my first reading.
Also if we flatten the structure - we may not be so restrictive as to what we we take - I have tuned this to my exact needs â
To be very honest I have eliminated all other table joins except the files because I could never be sure if table joins do tank even though I told myself maybe the index would be enough, but it was always bugging me, thanks for your confirmations
This can be generalised without losing performance. I think org-roam tries to be too general and in having to process all the alternative choice an user can make it has to take more indirect routes, and this redundancy comes in sharp contrast with limits in elisp itself when node size exceeds a certain threshold.
It would be excellent to work with you and maybe improve on my approach more.
Also I think the joins wonât tank much result - there are index on each of the Foreign Key Columns of Tables - so they shouldnât alter the result by too much.
I think the query evaluation plan will depend a lot on the actual data of a user. The work done by the joins depends a lot on the number of tags, refs and aliases used.
In fact, I have been thinking that there is no reason why this table could be kept up to date as a user modifies files.
By the way, you can use explain in sqlite. It is not as reasonable as other DBMS, but it gives an idea of what sqlite does when answering a query.