Materialized View for Org-roam.db using sqlite triggers

akashp · May 31, 2024, 7:13pm

Testing a Materialized View implementation using triggers.
Requesting testers and general help and advice, thanks for your time.

gist.github.com

https://gist.github.com/akashpal-21/561d4e5d6939427534b5b38982582946

org-roam-db-materialized-view.md

# The story so far

While using Org-roam - the data is stored across tables in the database.
The tables are as follows

``` example
(files . file title hash atime mtime)

(nodes . id file level pos todo priority scheduled deadline title properties olp)

This file has been truncated. show original

Test Suite to generate org samples is given below – click arrow:

Test Suite

#!/bin/bash

# Prompt the user for the number of files to create
read -p "Enter the number of files to create: " N

# Directory to store the files
OUTPUT_DIR="org_files"
mkdir -p $OUTPUT_DIR

# Loop to create N org files
for ((i=1; i<=N; i++))
do
  # Generate UUIDs
  FILE_UUID=$(uuidgen)
  H1_UUID=$(uuidgen)

  # Create the org file
  FILE_NAME="test_${i}.org"
  cat <<EOL > $OUTPUT_DIR/$FILE_NAME
:PROPERTIES:
:ID: $FILE_UUID
:ROAM_REFS: https://www.example.com
:ROAM_ALIASES: alias_${i}0
:END:
#+TITLE: test_$i
#+FILETAGS: :tag0_$i:

* h1_$i                                         :tag1_$i:tag2_$i:
:PROPERTIES:
:ID: $H1_UUID
:ROAM_ALIASES: alias_${i}1
:END:
EOL

  # Print progress
  echo "Created file $i of $N: $OUTPUT_DIR/$FILE_NAME"
done

echo "Created $N org files in $OUTPUT_DIR directory."

akashp · May 31, 2024, 9:02pm

Query Results

Node size: 20k
Tags: 30k
refs: 10k
aliases: 20k

Code sheet:
https://hastebin.skyra.pw/akudehizuv.org

dmg · May 31, 2024, 11:45pm

Some suggestions.

They are called materialized views, not material views.
Add foreign key constraints to the table
Use a query to create the table, because otherwise your code assumes that the triggers populate it. Simply do create table XYZ as select …;
Use a good name for the table. I would add _mview (or something like that as a prefix to inform the user that it is actually a view).

It looks good, but i think it would need testing. The good news is that it is trivial to test it.

Enable your changes to the DBMS, keep using org-roam and after some days run:

WITH original as (select ... <original query here>),
    first as (select * from myview except select * from original),
    second as (select * from original except select * from myview),
    select * from first union select * from second;

should return an empty set.

dmg · May 31, 2024, 11:46pm

By the way, these changes will not reduce the amount of GC collection that the query requires to run inside emacs. Only the execution time of the query.

dmg · May 31, 2024, 11:51pm

Oh, important

Add indexes on each of the attributes you search for. Otherwise each update will be O(n) on the number of tuples in the table.

(less important)

Primary keys by definition cannot be null, so no need to indicate it. Also, they create an index by default, so no need to create an index on id.

d12frosted · June 1, 2024, 6:06am

@akashp Thanks for sharing the results and the script you used to generate data. What I don’t understand - do you actually use tags, aliases and links? If not, you should. I have the following set of test notes you can use in your benchmarks as well - GitHub - d12frosted/vulpea-test-notes: Just some generated notes to test performance.

I will read the code later (tomorrow or early next week).

@dmg I think this is the point of this test - just to see how performance improves if you touch only the query performance.

And I agree with all other comments

akashp · June 1, 2024, 9:08am

Yes, I have also included the code that I used to generate the sample. Its in the first post, click the arrow to reveal it.

Links? As in like the links table? But it wasnt part of the query,
It contains tags, aliases and refs

:PROPERTIES:
:ID: $FILE_UUID
:ROAM_REFS: https://www.example.com
:ROAM_ALIASES: alias_${i}0
:END:
#+TITLE: test_$i
#+FILETAGS: :tag0_$i:

* h1_$i                                         :tag1_$i:tag2_$i:
:PROPERTIES:
:ID: $H1_UUID
:ROAM_ALIASES: alias_${i}1
:END:
EOL

Please take your time and improve on it. Thank you so much.

akashp · June 1, 2024, 10:11am

Hi I need some help. I had made the files.file as the foreign key - as we have to do this table join at the minimum, what other foreign key constraints should we add?

I first create the table - then populate it, then add the triggers - I tested whether the table tracks changes, which I confirm, I am not understanding this point, please clarify if possible.
We first create an empty table - then do a table populate - which it does very quickly because it is simply replicating the data, after which only we add the triggers to track changes.

Also thanks for catching its called “materialised view” I feel so goofy calling it just “material view” it is now public for millions to see - I will change it soon.

Can I also ask what extra index should we add? Jethro had kept three index by default and on the foreign keys of the table - I created an extra index on the files.file, I thought the index were there for table joins. I am very new to sql.

Thanks for all your support, much appreciated.

dmg · June 1, 2024, 5:11pm

YOu can create the table with a create and then populate it.
Since each trigger is going to find and/or modify few tuples at a time, you need to make sure that any selection (the technical term for the where clause) is computed using an index. basically, if you search for X = ‘value’, then index X. If you have two predicates it gets more complex, but I think in this case all are simple selections.
a primary key creates an index on all the attributes of the primary key (for these tables, all are singletons, so no need to worry about order of attributes in the keys).

4.You can only create a foreign key to an attribute (or set of attrs) that have an index (otherwise any update operation will be extremely slow). I think all these indexes are already created.

No, don’t create indexes on the tables that already exist. Create indexes on the attributes of the materialized table.
And I don’t know how I missed this: wrap this in a transaction! By default all DBMS do one transaction per statement.

I think that is all. sqlite is a bit non standard compliant, but I think everything we have discussed is easily doable in sqlite.

akashp · June 1, 2024, 6:34pm

@dmg Ok i understand – I have made the modifications as you requested please check

https://hastebin.skyra.pw/bufawosoxe.sql

We change the table name to mview simply
I did not want to join together the create and insert into (table population) into a single operation since I want the following to show up while doing .schema

CREATE TABLE mview (id PRIMARY KEY, file, level, pos, todo, priority, scheduled, deadline, title, properties, olp, tag, alias, type_ref, FOREIGN KEY (file) REFERENCES files (file) ON DELETE CASCADE);
CREATE INDEX mview_file ON mview (file);

Since the file column is the Foreign Key column we supply a index to this - this index is a non unique index.

– for my own reference:
I mistakenly supplied an index to files.file (in my setup that was not part of the code) but this is redundant because it is already the Primary Key to the files table.

– continued…
4. Since mview.id is the primary key - an index is automatically created for it - and since we utilise this in the WHERE clauses for the triggers – i think nothing else needs to be done

Added BEGIN TRANSACTION, and COMMIT; to wrap the entire operation. Thanks for teaching me about transactions - I was actually creating a bash script to supply separate entries one by one

Please have a look.

akashp · June 1, 2024, 7:59pm

I want to report a worsening result for small query set. Quite strange - any rationale ?

code sheet: Hastebin

I can replicate this worsening result consistently for smaller queries. Interesting

dmg · June 1, 2024, 8:09pm

I would have called it node_list_view
As I tell my students, format the query to make it readable, put the attributes in one or more lines (I usually put 5 at a time), and each constraints in its own line.
They are not related. The FK constraint says: if we add a tuple to this table, the file must exist. a) An index on mview(file) is used to quickly find the tuples when the files table gets modified. then b) when you do the new insert/update in mview, the DBMS goes and checks the new file attribute exists in files (uses the index of the file-id in files).
(This is my personal view). I am totally opposed to having the same attribute name for different attributes in the schema of a db. This is very pernicious and pervasive by users of mySQL. I would named the id attribute mview_id. It also makes joins easier (not needed in this case, but then in a properly named schema you can use join using, instead of join on).
A big mistake many programmers do is to not give all the work to the DBMS all at once. the DBMSs are extremely efficient at doing what they do. But in this case, it is also about maintaining referential integrity. Either all the work is done or none is. With a transaction there is no way to leave the DB in an inconsistent state (unless the programmer makes an error).
For performance (updated). I see you run the query. The time difference between queries is negligible. Use explain (see bleow)

Also, keep in mind that multiple executions of the same query are affected by caching done by the DBMS. It is not the same to run the query the first time than the second time.

You can also measure a query (though you have to account for the time opening and closing the database) with bash:

time sqlite3 database.db < query > outputfile.

And the most important tool in sqlite: explain. Use:

 explain query plan <query>

I soo much wish it has explain analyze

dmg · June 1, 2024, 8:18pm

How many tuples are you joining? (what is size of the left hand side of the join)

dmg · June 1, 2024, 8:20pm

and make sure they return the same tuples.

akashp · June 1, 2024, 8:22pm

https://hastebin.skyra.pw/ohoserokeg.pgsql

Here is the explain <query> for both the query – they should output the Exact same thing – both 20,000

The explain query plan <query> :

sqlite> explain query plan select mview.id, files.title, files.mtime, mview."level", mview.pos, mview.file, mview.title, mview.olp
from mview left join files on mview.file = files.file;
QUERY PLAN
|--SCAN mview
`--SEARCH files USING INDEX sqlite_autoindex_files_1 (file=?) LEFT-JOIN
sqlite> explain query plan select nodes.id, files.title, files.mtime, nodes."level", nodes.pos, nodes.file, nodes.title, nodes.olp
from nodes left join files on nodes.file = files.file;
QUERY PLAN
|--SCAN nodes
`--SEARCH files USING INDEX sqlite_autoindex_files_1 (file=?) LEFT-JOIN

In the explain <query> we see that for the mview case - the execution is slightly different - for the nodes case we use a simple execution plan - but for the mview case isnull is running – which eats some time off execution.

dmg · June 1, 2024, 8:31pm

Yes, it makes sense. mview is wider than nodes, so it needs more space in memory, and more block reads than nodes.

But what I don’t understand is why are you comparing these queries? You don’t need to use either one in org-roam, do you?

akashp · June 1, 2024, 8:31pm

No – I was just surprised – I expected mview to be faster for all cases – but for cases where we dont require join it is actually slower – when I think about it it does make sense.

Thanks

I will also do the other fixes you suggested - primarily related to namespace and formatting.

dmg · June 1, 2024, 8:36pm

By the way, please format the queries so they are readable.
Make sure it is easy to separate the different parts of the query (and table construction)
and use indentation for this purpose. Strings in emacs can be multiline

#+begin_src emacs-lisp   :exports both
"select <projection> -- few attributes per line
 from
    <table1> 
    left join <table2> on <predicate> 
 where <selection>
 group by
    <group by attrs>
 having
    <having attributes>"
#+end_src

dmg · June 1, 2024, 8:57pm

do you have a foreign key in mview to files (for this test)?

Change your query to join. Since the file is guaranteed to exist, the left join is superfluous (and probably the planner is not able to optimize it and checks the result against the table to see if it needs to add the left side of the left join).

akashp · June 1, 2024, 9:32pm

Yes-- CREATE TABLE IF NOT EXISTS mview (id PRIMARY KEY, file, level, pos, todo, priority, scheduled, deadline, title, properties, olp, tag, alias, type_ref, FOREIGN KEY (file) REFERENCES files (file) ON DELETE CASCADE)

versus nodes CREATE TABLE nodes (id NOT NULL PRIMARY KEY, file NOT NULL, level NOT NULL, pos NOT NULL, todo , priority , scheduled text, deadline text, title , properties , olp , FOREIGN KEY (file) REFERENCES files (file) ON DELETE CASCADE)
The results aren’t different if I use join/ inner join / left join – all have same results

Topic		Replies	Views
Utility package: org-roam-timestamps Development	6	1417	December 7, 2023
Thoughts on DB structure Development	13	1685	May 8, 2020
Automated testing for extensions of org roam Development	0	420	September 15, 2020
Org Roam on alpine linux Troubleshooting	3	343	October 20, 2021
Org-roam as filesystem Development	6	165	November 16, 2024

Materialized View for Org-roam.db using sqlite triggers

Query Results

Related topics