Rewriting org-roam-node-list for speed (it is not sqlite)

dmg · May 28, 2024, 5:32pm

One more thing.

The time spent in the query is currently just a very small fraction of the overall time.
So even if we have an ultra fast query, the rest of the processing will result in a slow org-roam-node-find

d12frosted · May 29, 2024, 6:51am

Look, I don’t say that the horrific SQL I’ve contributed to Org Roam is the only source of slowness. I am pretty sure that with Org Roam there are two key factors - the code that manipulates SQL result (at least whole this thread acts as evidence) and multiple joins.

The benchmarks I’ve shared prove that multiple joins are slow. After all, I didn’t change the processing code - the difference between numbers is based solely on switching from joins to a flat table. More over, flat table scales horizontally unlike joins, which is why I think it’s critical for (a) performance in Org Roam and (b) providing good APIs (today IMO it’s messy and leaky).

Thank you! I will take a look. This week is super busy for me, so I can’t promise to provide any valuable feedback very soon. But I’ve started to read the code. So thanks for sharing.

I don’t think generalisation is a problem here. Org Roam was designed for a very specific use-case. And once people started to use it outside this use-case, it turned out that it doesn’t scale. And instead of changing the core, people just added more leaking abstractions.

IMO there is a need of a core, which has nothing to do with being a Roam replica. Basically a performant database that can scale for 1m+ notes and some good APIs which you can use to implement Roam replica, Vino, various Wikis, etc. Basically, this is why I’ve created Vulpea. I hoped that V2 implementation would render Vulpea obsolete, but here we are.

And important to note. I am not blaming anyone. When I speak about problems I just mean that there are things that we as community can solve.

dmg · May 29, 2024, 7:50am

Sorry, I don’t know what vulpea does. I will check it out.

I do agree 100% than joins are slower than querying a single table. Plus, this query has 3 group-bys. But as the size of the database grows, the time inside emacs processing the results will still make searching for nodes unusable, even if the query is extremely fast.

I was thinking about how to make this particular query faster (org-roam-node-list)

I think the challenge is how to improve performance without adding complexity.

Are you suggesting to change the schema and keep some data non-normalized? That will require changes in many areas.

Another potential solution is to materialize the query. There are several strategies.

one is DB managed (using triggers): eg. SQLite triggers as replacement for a materialized view | madflex

Another one is to keep a timestamp of when the last node was added/modified/deleted, and if the materialized query is older, then recompute (async). This can be controlled with a flag.

By the way, my org-roam db is around 1k nodes, 400 tags, and 1 alias.

d12frosted · May 29, 2024, 9:03am

Yes, I understand But again, if you fix the code, you will still eventually run into issue with joins. What I am trying to say is that both issues must be addressed. At least to my understanding.

This is why I never found enough motivation to contribute a flat table to Org Roam. And considering there was no interest/traction in the proposal + war + personal stuff… You know the drill

I might be mistaken (and please correct me if I am confusing things), but sounds like you are mixing two concerns into one pot - write efficiency and read efficiency. All for the sake of keeping the performance improvement implementation as simple as possible with regards to current implementation. Which I can relate as you can see

The way I see it there is no real issue with writing. I mean, it can be improved in many ways, but there is an issue with read performance and it’s scattered all over Org Roam code base. Maybe I am pessimistic, but I don’t see a quick solution - there is a need to rewrite quite a lot of the stuff. Most of the improvements I’ve seen so far (and I am sure I haven’t seem all of them) - in this thread and in other threads - are covering a subset of features and use-cases (i.e. they ignore different slots of org-roam-node structure or access patterns).

Oh thanks for sharing. I need to read about that.

d12frosted · May 29, 2024, 9:11am

Ok, turns out, it’s not a big gist

As I said, this thread is interesting and I want to replicate/benchmark some of the claims myself For example, the new constructor - this is an interesting finding. I also recall Chris Wellons having a few articles on Emacs Lips performance. For example, How to Write Fast(er) Emacs Lisp.

Overall, I feel like while the gist does explore performance improvement options, it does it in a very narrow fashion. Just to explain what I mean let me ask a question. How would you modify the gist to include tags in +org-roam-node-display-template? I understand that your gist is just an example, an illustration of approach, but I am still curious

akashp · May 29, 2024, 11:59am

For my personal part - I dont need to - I rely on a custom tags finding mechanism to search and go through them using ripgrep. I just use org roam as a db for my IDs thats all. That is why I tore it down entirely I need it to do something very specific. No table joins, no nothing - just pure performance, just the basic core. Its for my very personal use, my idea is that people should make their own read functionality - just dont query what is not needed.

Even perhaps come up with with multiple node-list functions when to query what is needed - my feeling is that if we are querying too much information - and GC comes into play - it always doubles execution time, so while programming in elisp - we have to be extremely frugal - memory limitations will come much before computation limitations.

The constructor function limits the creation of temporary variables and staves off GC.

I want to discuss it in more detail - but this thread has gotten so long - I just want to give you a gist of what I find interesting - For example quering for properties triggers GC fast, so we need to be very frugal in what we query - it is an absolute waste of resource to query everything all the time. What do you think?

akashp · May 29, 2024, 12:10pm

Also I am curious - why do people use the read mechanism to search for all these informations - why dont we write sql statements to get complicated data? Why rely on the read mechanism and create this redundant data pulling from the DB which calls GC?

I feel like the node-list function is simply an example - and users ought to destroy the structure to erect in its place their own.

The display candidate generation is another maze - why have this very complicated and slow mechanism to deduce the display format - its ridiculous how many different function it goes through to when all i needed for my use case was just one 3 lines function.

This has immense improvements because I believe we need to do everything to prevent memory limitations from coming into play – this is my intuition so far.

nobiot · May 29, 2024, 1:56pm

I understand this sentiment from those who can code SQL statements easily within Elisp. I have also used Org-roam as a collection of “sample” code – good examples.

I think a simple response to your first question is, “why not?”. Org-roam V2 was developed to cater to a combination of two main sets of requirements:

Needs of its user base and learning from V1 implementation. Many of us do not know how to write SQL or Elisp, but are willing to put in some efforts in learning Org-roam and Emacs for their own note-taking workflow.
Needs of the author (Jethro) for ease of maintenance and his note-taking workflow.

You can see this from Jethro’s blog articles “Releasing Org-roam v2 (July 2021)” and “Org-roam: A Year In Review (December 2021)”.

You can have your own evaluation about how well Org-roam has achieved these goals.

My sense is that Org-roam has done well for these goals, and lets many users customize it to their own note-taking requirements without having to dig into the technical detail of SQL and Elisp implementations.

As a consequence, some advanced users have reported performance issues when they have about 4-10K notes – now thanks to many including you and this thread there are some mitigations available – I find this a great contribution to the community. So thank you, @akashp @dmg @d12frosted (and the GitHub PR thread #2341 and some other issue threads).

I am hovering at around 300 notes mark for the past 1-2 years, so I experience no performance issue. And, even with this number of notes, I feel I have still benefited from having Org-roam for personal notes and work notes alike. And I feel reassured that I have solutions when I hit a performance issue thanks to your collective efforts

It may look “ridiculous” to those who can see the technical detail, but at the same time, its ease of customizing gives a lot of control and power to the hands of users. I think it’s a good thing about Org-roam.

d12frosted · May 29, 2024, 2:03pm

And I find it troubling if we are talking about a library. Of course, everyone can figure performance issues by themselves, but it’s not helpful when we talk about the library itself. Aye, in a way performance and devex is a spectrum, but IMO when we talk about Org Roam it’s possible to come up with a compromise.

A few random thoughts:

There is a public API with org-roam-node defined. There are slots. If you get org-roam-node from a DB, it must be fully populated otherwise it’s a bad API. Of course, you can say that org-roam-node is bare minimum and everything else is queried separately from DB, but it has little value to the end user that needs this extra piece - they have to perform extra queries and they need to figure out performance by themselves. Using ripgrep instead of SQLite? Boy, it’s getting too complicated We are talking about mere 10-100k rows in a local database
IMO if a library enforces users to query from its DB directly, then it offers a bad API and makes further changes more complicated. Every org-roam-db-query outside of Org Roam is a loan that someone needs to pay in the future.

It’s a waste for you. Maybe for someone else. Maybe for everyone. The real question is - if get/query doesn’t return something (for example, tags), why is it part of the org-roam-node struct definition? And if it is still part of it, how would you get it? Considering one of the driving forces of the query performance topic - completion - how would they do it in a performant way?

And regarding tags, specifically, I am 100% sure people will come and ask these questions because I have examples Aliases? Not sure. Links? Not sure.

I hope my point is clear

That’s a good question. I think it’s more of a balance between performance and API conveniency. As long as API is fast - no one cares about extra data being pulled. And once you hit performance issues, it boils down to optimisations that are relevant for specific use case. A few examples:

There is no need to pull all data, let me just implement a quick list operation tailored for my completion pattern. Basically what you do here.
My notes are ‘namespaced’ by certain tags, and I don’t normally want to mix notes from different namespaces in a single completion, so I can reduce amount of in-memory notes by using some specialised queries. Basically, what I do with specialized queries in vulpea.

But of course, there might be some issues with the library itself and fixing them bring performance boost, so you can go back to your convenient APIs

akashp · May 29, 2024, 2:18pm

I agree with both of you - @nobiot and @d12frosted - I dont find these defects in org-roam itself - I was merely listing a mitigating technique that can be applied. I am definitely NOT ASKING for these to be pushed upstream - nor do I want org-roam to come default so restrictive – I was merely suggesting users who want performance can merely be frugal and get it –

I think we should by all account never take the “user” as naive – I believe users will figure out things very fast,

In so far I believe this is not to be fixed issue, because there is nothing to fix - but we should change our approach to how we perceive this problem, I think there should not be a large change to the basic principles of org-roam it is very stable, and with this - I feel the performance debate is solved for me – all the causes and mitigations have been very clearly and elaborately detailed.

Thank you all.

nobiot · May 29, 2024, 2:35pm

I think that the change @dmg has explored and suggested in an earlier thread would be worth proposed upstream, because it’s non-intrusive and yet improves performance.

– I believe your point was that the improvement with this method is minuscule under a certain condition, or it is not relevant for your personal changes you have implemented.

I could not really follow this part of the debate, because this change seems to still improve the performance under other non-trivial conditions for other users. Did I misread it?

akashp · May 29, 2024, 2:39pm

I think @dmg wants to not use named parameters &keys and directly supply the constructor function with all the args –

but this will require us two constructor fn and one dedicated to this querying ops because we need a cons funs that takes named params too

My opinion was that for users who use 8mb gc - this wont show up because the bottleneck in query will hit first and if gc runs – it would take the same amount of time.

So if user wants to put forward a custom cons fn – they would have to follow through more customisations and leave upto the user here to select their query lists and changes they would want to make

basically we cannot pull out the cons fn that takes &keys – and putting another one alongside it whose effects will show up prominently when other mitigation tactics are in play will be messy.

dmg · May 29, 2024, 2:55pm

For my use case, finding an existing node (org-roam-node-find) is the bottleneck.

When I profiled the code, 10-20% of the time it was sqlite. The rest was emacs.

With the improvements I have proposed:

Use of a new constructor for org-roam-node
Remove unnecessary data structures and computation (and some GC)
Use of a cache

Has made it go from 2 seconds to feel (most of the times) almost instantaneous.

My approach is always to hit the performance bottlenecks where they happen with the minimum intrusion, so I can do it with the smallest changes. Even a rewrite benefits from this work, since it give us an understanding of what the bottlenecks are and we can prepare for them.

May I ask what are your performance complaints when using org-roam (aside from org-roam-node-find)? Perhaps I don’t use some features that are important to others that are affecting the user experience.

dmg · May 29, 2024, 3:03pm

Yes, and document it well so it is clear that that constructor is only to be used for when creating a large number of nodes at once (and why).

The other constructor needs to be kept. It is used in a lot of different places.

akashp · May 29, 2024, 3:09pm

I think if you should put this - you should also pull in the value of org-roam-db-gc-threshold here its value is set to 8mb by default but this was created for org-roam to get temporary gc limit boosts - documentation should also state to push it to 16mb or somewhere around there to get the benefits.

This will extend the GC threshold for the query and will let your cons function shine through and show its effect more prominently – it will be non intrusive too because the variable is already defined in org-roam.

I think if a custom cons function is present and formatted correctly it should also hint to the user that they may choose to not fill all the details - but for api reasons we need to fill all the slots - I think with the GC value pulled here with ample documentation - this makes perfect sense.

dmg · May 29, 2024, 3:13pm

Isn’t the API the sqlite database? Wasn’t that the point of using SQLite?

For me it is. I have several functions where I query the DB for the information I want (and in some cases I even modified the DB directly).

and if that is the case, for finding nodes, we do not need to create a org-roam-node. All we need is a function that opens the file that has that node in the proper location. In a way, that is the beauty of org-roam and why it is easy to build on top of it.

dmg · May 29, 2024, 7:53pm

Ok, i did a bit of research. Here are all the functions that read the entire collection of nodes as part of the their processing (I have to say, I rarely cross-link notes)

org-roam-refile
org-roam-node-find
org-roam-node-random
org-roam-node-insert

d12frosted · May 30, 2024, 5:51am

My bottleneck was always org-roam-db-query and whatever uses it. I’ve many flows when I query information and not only for completion (for example, I build two sites from subsets of my private notes).

My second issue with Org Roam lies within its API. On one hand, it’s tied to a single use case - Roam replica, so more generic use cases are not well considered/hacked. On the other hand, it’s API is leaky. Transition from v1 to v2 clearly displayed it.

I can’t say for the Org Roam author, but IMO nope and nope. SQLite is a mere implementation detail. And the fact that users are forced to query cache database says a lot about API. Once users start to rely on implementation detail (i.e. use them directly), it become incredibly complex to actually move forward and perform changes that are not breaking but beneficial to users.

Just as an example, when vulpea#116 landed I had to change zero code in my 10+ apps/scripts that are using this library, but everyone gained huge performance boost. I don’t think that vulpea is a pinnacle of design, but I am just trying to emphasise that some abstractions are good. And they are needed for evolutionary process without changing every piece of code that uses the library.

But I think we detracted from the original thread.

akashp · May 30, 2024, 8:02am

I agree with you here, the query is also a bottleneck – OP doesn’t have the problem currently because I think of a mixture of factors - one being that the query for their case doesn’t spend a lot of time in table joins - which was also my case, the second being - OP is running a very high abnormal GC threshold that hides the problem somewhat under the rug.

I want to ask you something else - do you think latching onto the write to db protocol will be a good compromise - given the sync takes like forever for large files already?

I think vulpea will work for some people, but some others may not want to make the tradeoff. Also I agree with you here that users having to break the api to get what they want looks bad from software development pov – but I think a lot of users stayed back in org-roam v1 and never migrated, it shows I think org-roam as being a library for people to build their workflow around rather than seeing it as a stand alone software. What I am saying is that I think we should not put a library like org-roam in the same plane of existence as something like Microsoft Excel or Obsidian or Logseq or however many that comes and goes.

I find that beautiful in org-roam. Its so simple and effective, it builds on things existing rather than trying to build a wheel. These are my feelings.

d12frosted · May 30, 2024, 2:29pm

In any case, there is a compromise to be made – either to optimise for write operations or to optimise for read operations. Given the most common use patterns, I believe it makes sense to optimise for write operations. This is why, in Vulpea, I’ve opted for a flat structure. However, this decision comes with its own tradeoff: the need to build the cache upfront. In my experience, this is acceptable because it is done once, and then the cache is populated incrementally, which works fast enough.

That being said, I do think the write routine can be improved. I’m confident that performance enthusiasts could extract significant optimisations.

My main frustration, however, is the lack of a good way to extend or hook into the Org Roam sync routine. In Vulpea, this necessitates parsing the file a second time. This is something I’m looking to address, although I fear it might lead me to abandon Org Roam under the hood, which is ultimately an implementation detail.

What tradeoff are you talking about? If you are talking about write performance, it’s quite neglectable unless you have a huge file or perform a full re-sync. But the current situation at least allows to try vulpea without breaking Org Roam.

I tend to disagree with your conclusion. Assuming that many users stayed with V1, it simply means that they find V1 better suited to their needs compared to V2, or that the merits of V2 do not justify the cost of migration. I am certain that some people use Org Roam as a library (I am one such example – I use no interactive functions from Org Roam). For these users, the cost is even higher because, unlike ‘standalone’ users, they have to migrate their notes and all their code. The more they relied on implementation details, the harder the migration becomes.

When V2 started, I tried to provide feedback to make V2 more suitable for library writers, and I believe Jethro improved the situation tremendously. However, the migration wasn’t easy for everyone.

Despite any critical comments, I have no negative thoughts about Org Roam or its maintainers. I am very happy that Org Roam became a part of the Emacs landscape. So I share your positive view

Topic		Replies	Views
Improving performance of node-find et. al Development	31	1020	July 28, 2024
Org-roam-node-read speedup without cache - Guide Guides	2	115	June 28, 2024
How to make org-roam-node-find faster? Requests	4	711	December 31, 2022
Improving org-roam-format-template (next to make org-roam-node-find fast) Development	8	146	May 30, 2024
Difficult to use at >20,000 nodes? Troubleshooting	2	370	October 23, 2023

Rewriting org-roam-node-list for speed (it is not sqlite)

Related topics