Rewriting org-roam-node-list for speed (it is not sqlite)

akashp · May 31, 2024, 7:21am

I will give the next month maybe to extract the table flattening strategy from vulpea I dont know if triggers would work for us, but I am willing to experiment the path @dmg has suggested.

– currently I am hesitant to switch to vulpea from my personal user case not because of performance requirements really but because I think your package is exceptionally good at augmenting org-roam, to the point of replacing it - vulpea in itself is a replacement for org-roam, it merely utilises org-roam for its implementation so users will interact with vulpea instead of org-roam. It has many additions to its structure that will no doubt expand what users can do. But my problem with org-roam has always been that it does too much! - so I am always looking to just blow out large chunks from it in my distribution, things that are redundant – I think I yearn the same thing as you an org-roam-core.

I think if you should in the long run take the route to detangle from org-roam – nothing will change because vulpea doesn’t share the user facing interface with org-roam, except that for users instead of having to install two packages - they may install any one.

I am studying you code - and since I have no doubt you’re exceptionally good at it, I will unashamedly steal from your code back into my distribution, but here also I need to choose things I want to include in the db – for example I don’t have any requirement for updating the attach dir back to the db, the meta table and so on.

So I think, my requirement finally is a convex combination of vulpea and org-roam db, and instead of trying to get to it using configuration, I want to code the delta, so that I can make a direct jump from my position – but this is just my personal implementation problem.

Your package no doubt is exceptionally optimised otherwise, I have no doubt about it.

meedstrom1 · July 28, 2024, 2:11pm

That’s exactly what my org-node did, plus detangling from SQL – so I’m pinging you in case we’re reinventing the same wheel! If you’ve made progress on that path since May, we could learn from each other.

dmg · July 28, 2024, 5:40pm

I think that sqlite as a external caching mechanism is a great idea. Usually the performance issues we have been observing in org-roam are not due to sqlite, but about the decisions and implementations of its functionality.

@akashp has done excellent work creating a materialized view that makes querying the database a simple sequential scan.

Also, one of the slowest tasks is parsing an entire collection of org files to create the original database. But this does not have to be done in emacs. There is a very good library for python to parse org files. Thus, mass updates of the DB do not have to be done inside emacs. The sqlite database becomes an API.

The real problem with the database is how to facilitate incremental search on a node (title or filename) when there are thousands of nodes. Each node from the DB needs to be converted to a suitable emacs data structure. This takes time. And what is worst, this work is disposed every single time we query for a node (with the corresponding penalty in garbage collection).

This is where the caching work of @akashp shines (both at the database level with the materialized views, and caching the nodes data structures so they don’t have to be retrieved from the DB every single time).

Nonetheless I agree, that org-roam is one implementation of the fundamental idea that a node is a heading with an ID. And a node has tags and potentially an alias.

This means there are multiple “org-roams” that can be built. and perhaps that is the best way to move into the future.

I also agree with @meedstrom1 that for me, there is a before starting using org-roam and an after. Now I want to have all my org-files (in many different directories) under org-roam. It has allowed me to forget about where the files are actually located (I now find them as nodes). This is why I feel that the templating system of org-roam is lacking (e.g. I have now implemented templates where you can specify the directory where you want the new node to be created Add support for functions in the specification of a path. · dmgerman/org-roam@cc89958 · GitHub).

Ideally I would like to be able to blacklist subdirectories/files from org-roam processing (e.g. org files that come with modules I download or repos I clone). Other than that, org-roam (or its equivalent) should handle ALL the org files that have at least one ID.

meedstrom1 · July 28, 2024, 7:40pm

Yea, most of the work is not EmacSQL, but the fact that org-roam scans files by visiting each file with org as the major mode, and uses an expensive functional style of programming (it’s expensive in Elisp). By contrast, the function org-node-worker–collect-dangerously reads almost like a C program, staying in fundamental-mode, making no large point jumps (also expensive in Elisp) and processing each file in a single pass.

From my experience, the bottlenecks of a C-u M-x org-roam-db-sync seem to be:

~60% using org as major mode
~20% the programming style - it’s slow enough on its own that even in an already-open org buffer, saving a big buffer containing 500 nodes & 3000 links takes a few seconds on my 1GHz tablet
~20% EmacSQL or SQLite. I say that because my org-node-fakeroam-db-rebuild does nothing but translate from org-node’s tables and send it to the DB in the exact same way org-roam does, so I expected it to be instant, yet it takes about 20 seconds. There is probably a way to batch transactions or configure the DB with some SQL PRAGMA calls so that it would be instant…

Org-node implements none of the above bottlenecks, so it parses the entire collection of org files in a second, whereas it takes org-roam three minutes.

It’s interesting to ask why SQL is a good idea – brings to mind https://yourdatafitsinram.net/, although that’s a slightly different lesson. The amount of data we need to handle is small; a few thousand nodes and ten thousand links is nothing to a modern computer. It seems the main thing it brings to the table is give people familiar with the EmacSQL or SQL languages a way to query for complex things like “all nodes that have a link back to this one” or “all nodes with tag TAG and DATE > 2024”.

You see the downside of SQL right in this thread: that you need to get the data out of the DB and cache it in a suitable emacs data structure anyway to have a performant search. So why have the DB at all, why not leave it out and use only the emacs data structure?

It’s like putting a perfectly serviceable car on the back of a truck and then driving the truck to get the car around. If we already have a cache, we shouldn’t wrap it in a second cache that just caches the results of the first.

I sense this is down to a misconfigured DB. I’m still surprised that getting data out of a 1-5 MB database would take that long, it seems to run counter to the whole purpose of SQLite. Is that what @akashp fixed? I’m not sure I understood about the materialized view.

dmg · July 28, 2024, 8:09pm

Exactly. the DB becomes the API

Run the profiler. I did. What I observe is what I said at the beginning of this thread: it is easy to blame the Database, but the database is NOT the reason org-roam functions are usually slow (neither is functional programming by itself, but that is a different topic).

This is the output of profiling org-roam-db-sync (I erased the database, 76% of my test was org-roam-db-sync, and 23% doing garbage collection)

You will see that the time inside actual SQL code is very small. Most of the time is spent doing running org-mode (almost 31%, plus running hooks for extensions that sit on top of org-mode, another 18%). 49% is org-roam and 23% garbage collection. Simply loading the file in fundamental mode (as you correctly did) would save 50% of the time.

9% is org-roam-db-insert-link, but mostly getting the data from the file (6% is org-id-get).

So I agree, the processing of the org-file is what makes org-roam-db-sync slow.

But org-roam-db-sync is run on an empty filesystem only once in a blue moon. It is not worth optimizing before many other things.

meedstrom1 · July 28, 2024, 9:00pm

It would be nice my parser didn’t have to look like a C program. It may be possible with some refactoring – as you say, functional programming needn’t be slow. Funcalls have some overhead in Emacs Lisp (especially anonymous lambdas) so one just needs to be smart about it.

But problem with the org-roam codebase is more that kind of thing you mention, calling org-id-get to get the ID each time. Better to think about how that function is doing its thing (probably a lot of back-and-forth searches) and see if it’d be faster to roll it up into a bigger function that collects other metadata at the same time… instead of calling org-id-get, then org-get-title, then org-get-properties and so on. That’s what I mean, by the programming style: it easily compounds into a lot of duplicated work if you’re not careful.

I suppose the org-element-cache is meant to help with that, if you haven’t disabled it due to the bugs. So now we have three caches in the game I wonder if actually you could build the org-roam DB straight from the org-element-cache. Just visit all files once, and the necessary info may already be stored!

But org-roam-db-sync is run on an empty filesystem only once in a blue moon. It is not worth optimizing before many other things.

Actually, I used to run C-u M-x org-roam-db-sync a lot. Partly because I publish my files out of a copy in /tmp and ran a bunch of DB queries in the publishing process. Partly because I happened to find the DB out of sync pretty often, not sure why.

Anyhoo, the speed of a full DB sync is very related to the speed of saving one big file, which is a much bigger problem: that must be instant!

dmg · July 28, 2024, 9:34pm

I run org-roam-db-sync daily too (mostly because I use different computers, and the DB is not shared,).

One suggestion, run it with a cron, outside your main emacs instance. The DB will be locked during this process, so there is no chance the DB will be left in an inconsistent state. I run it at 4 AM.

I agree with you in most (if not all points). I would say that Jethro Kuan prioritized having no-bugs vs a fast solution. I think it was the right decision. The next generation of org-roam can address some of these performance limitations without braking the model that he developed.

As you already mentioned, your implementation is significantly different, but the org-roam files are still the same, and fully compatible with the original org-roam. I think that is what we should strive to continue achieving: solutions that don’t break the org-roam model (at least until we find that the model is not good enough).

I soo much wish that denote (Prot’s alternative to org-roam) would have used IDs instead of its own IDENTIFIER field). Had Prot used ID, both tools could have coexisted.

akashp · July 29, 2024, 7:39am

I don’t use any cache myself - that was my whole point - to not have to use a second cache. I posted my solutions a few months ago - it got no traction, but I haven’t touched elisp in a while, everything been working good. This problem is a done deal for me atleast.

@dmg Professor, you and @meedstrom1 have been talking about the sync-protocol in a thread dedicated to the read protocol - maybe a new thread for the topic at hand would be better to follow about.

d12frosted · August 5, 2024, 7:30am

Thanks for sharing. I will take a look

d12frosted · August 5, 2024, 7:36am

But you can do it out of the box. Just use sexps in your templates. I’ve been doing this for ages already (in vanilla org mode and I thin org roam v1):

;; template value
'(("d" "default" plain "%?" :if-new (file+head "%(vulpea-subdir-select)/%<%Y%m%d%H%M%S>-${slug}.org" "#+title: ${title}\n\n")
  :unnarrowed t))

;; function impl
(defun vulpea-subdir-select ()
  "Select notes subdirectory."
  (interactive)
  (let ((dirs (cons
               "."
               (seq-map
                (lambda (p)
                  (string-remove-prefix vulpea-directory p))
                (directory-subdirs vulpea-directory 'recursive)))))
    (completing-read "Subdir: " dirs nil t)))

You can do achieve so many things with this

dmg · August 8, 2024, 5:57am

Thank you. This is very useful. I didn’t realize that it was possible to have an s-exp in this way (I also get confused with using ’ and ` in the template, but that is a different story).

But I got very puzzled, because the documentation says that a target is required (but this template works and it does not have a target).

I had to dig into the code to understand what was going on.

This is a actually a deprecated feature. This template should use “target” instead of “if-new”. I tested and this works.

Topic		Replies	Views
Improving performance of node-find et. al Development	31	1024	July 28, 2024
Org-roam-node-read speedup without cache - Guide Guides	2	115	June 28, 2024
How to make org-roam-node-find faster? Requests	4	712	December 31, 2022
Improving org-roam-format-template (next to make org-roam-node-find fast) Development	8	147	May 30, 2024
Difficult to use at >20,000 nodes? Troubleshooting	2	370	October 23, 2023

Rewriting org-roam-node-list for speed (it is not sqlite)

Related topics