Testing a Materialized View implementation using triggers.
Requesting testers and general help and advice, thanks for your time.
Test Suite to generate org samples is given below – click arrow:
Test Suite
#!/bin/bash
# Prompt the user for the number of files to create
read -p "Enter the number of files to create: " N
# Directory to store the files
OUTPUT_DIR="org_files"
mkdir -p $OUTPUT_DIR
# Loop to create N org files
for ((i=1; i<=N; i++))
do
# Generate UUIDs
FILE_UUID=$(uuidgen)
H1_UUID=$(uuidgen)
# Create the org file
FILE_NAME="test_${i}.org"
cat <<EOL > $OUTPUT_DIR/$FILE_NAME
:PROPERTIES:
:ID: $FILE_UUID
:ROAM_REFS: https://www.example.com
:ROAM_ALIASES: alias_${i}0
:END:
#+TITLE: test_$i
#+FILETAGS: :tag0_$i:
* h1_$i :tag1_$i:tag2_$i:
:PROPERTIES:
:ID: $H1_UUID
:ROAM_ALIASES: alias_${i}1
:END:
EOL
# Print progress
echo "Created file $i of $N: $OUTPUT_DIR/$FILE_NAME"
done
echo "Created $N org files in $OUTPUT_DIR directory."
They are called materialized views, not material views.
Add foreign key constraints to the table
Use a query to create the table, because otherwise your code assumes that the triggers populate it. Simply do create table XYZ as select …;
Use a good name for the table. I would add _mview (or something like that as a prefix to inform the user that it is actually a view).
It looks good, but i think it would need testing. The good news is that it is trivial to test it.
Enable your changes to the DBMS, keep using org-roam and after some days run:
WITH original as (select ... <original query here>),
first as (select * from myview except select * from original),
second as (select * from original except select * from myview),
select * from first union select * from second;
By the way, these changes will not reduce the amount of GC collection that the query requires to run inside emacs. Only the execution time of the query.
Hi I need some help. I had made the files.file as the foreign key - as we have to do this table join at the minimum, what other foreign key constraints should we add?
I first create the table - then populate it, then add the triggers - I tested whether the table tracks changes, which I confirm, I am not understanding this point, please clarify if possible.
We first create an empty table - then do a table populate - which it does very quickly because it is simply replicating the data, after which only we add the triggers to track changes.
Also thanks for catching its called “materialised view” I feel so goofy calling it just “material view” it is now public for millions to see - I will change it soon.
Can I also ask what extra index should we add? Jethro had kept three index by default and on the foreign keys of the table - I created an extra index on the files.file, I thought the index were there for table joins. I am very new to sql.
YOu can create the table with a create and then populate it.
Since each trigger is going to find and/or modify few tuples at a time, you need to make sure that any selection (the technical term for the where clause) is computed using an index. basically, if you search for X = ‘value’, then index X. If you have two predicates it gets more complex, but I think in this case all are simple selections.
a primary key creates an index on all the attributes of the primary key (for these tables, all are singletons, so no need to worry about order of attributes in the keys).
4.You can only create a foreign key to an attribute (or set of attrs) that have an index (otherwise any update operation will be extremely slow). I think all these indexes are already created.
No, don’t create indexes on the tables that already exist. Create indexes on the attributes of the materialized table.
And I don’t know how I missed this: wrap this in a transaction! By default all DBMS do one transaction per statement.
I think that is all. sqlite is a bit non standard compliant, but I think everything we have discussed is easily doable in sqlite.
I did not want to join together the create and insert into (table population) into a single operation since I want the following to show up while doing .schema
CREATE TABLE mview (id PRIMARY KEY, file, level, pos, todo, priority, scheduled, deadline, title, properties, olp, tag, alias, type_ref, FOREIGN KEY (file) REFERENCES files (file) ON DELETE CASCADE);
CREATE INDEX mview_file ON mview (file);
Since the file column is the Foreign Key column we supply a index to this - this index is a non unique index.
– for my own reference:
I mistakenly supplied an index to files.file (in my setup that was not part of the code) but this is redundant because it is already the Primary Key to the files table.
– continued…
4. Since mview.id is the primary key - an index is automatically created for it - and since we utilise this in the WHERE clauses for the triggers – i think nothing else needs to be done
Added BEGIN TRANSACTION, and COMMIT; to wrap the entire operation. Thanks for teaching me about transactions - I was actually creating a bash script to supply separate entries one by one
As I tell my students, format the query to make it readable, put the attributes in one or more lines (I usually put 5 at a time), and each constraints in its own line.
They are not related. The FK constraint says: if we add a tuple to this table, the file must exist. a) An index on mview(file) is used to quickly find the tuples when the files table gets modified. then b) when you do the new insert/update in mview, the DBMS goes and checks the new file attribute exists in files (uses the index of the file-id in files).
(This is my personal view). I am totally opposed to having the same attribute name for different attributes in the schema of a db. This is very pernicious and pervasive by users of mySQL. I would named the id attribute mview_id. It also makes joins easier (not needed in this case, but then in a properly named schema you can use join using, instead of join on).
A big mistake many programmers do is to not give all the work to the DBMS all at once. the DBMSs are extremely efficient at doing what they do. But in this case, it is also about maintaining referential integrity. Either all the work is done or none is. With a transaction there is no way to leave the DB in an inconsistent state (unless the programmer makes an error).
For performance (updated). I see you run the query. The time difference between queries is negligible. Use explain (see bleow)
Also, keep in mind that multiple executions of the same query are affected by caching done by the DBMS. It is not the same to run the query the first time than the second time.
You can also measure a query (though you have to account for the time opening and closing the database) with bash:
time sqlite3 database.db < query > outputfile.
And the most important tool in sqlite: explain. Use:
Here is the explain <query> for both the query – they should output the Exact same thing – both 20,000
The explain query plan <query> :
sqlite> explain query plan select mview.id, files.title, files.mtime, mview."level", mview.pos, mview.file, mview.title, mview.olp
from mview left join files on mview.file = files.file;
QUERY PLAN
|--SCAN mview
`--SEARCH files USING INDEX sqlite_autoindex_files_1 (file=?) LEFT-JOIN
sqlite> explain query plan select nodes.id, files.title, files.mtime, nodes."level", nodes.pos, nodes.file, nodes.title, nodes.olp
from nodes left join files on nodes.file = files.file;
QUERY PLAN
|--SCAN nodes
`--SEARCH files USING INDEX sqlite_autoindex_files_1 (file=?) LEFT-JOIN
In the explain <query> we see that for the mview case - the execution is slightly different - for the nodes case we use a simple execution plan - but for the mview case isnull is running – which eats some time off execution.
No – I was just surprised – I expected mview to be faster for all cases – but for cases where we dont require join it is actually slower – when I think about it it does make sense.
Thanks
I will also do the other fixes you suggested - primarily related to namespace and formatting.
By the way, please format the queries so they are readable.
Make sure it is easy to separate the different parts of the query (and table construction)
and use indentation for this purpose. Strings in emacs can be multiline
#+begin_src emacs-lisp :exports both
"select <projection> -- few attributes per line
from
<table1>
left join <table2> on <predicate>
where <selection>
group by
<group by attrs>
having
<having attributes>"
#+end_src
do you have a foreign key in mview to files (for this test)?
Change your query to join. Since the file is guaranteed to exist, the left join is superfluous (and probably the planner is not able to optimize it and checks the result against the table to see if it needs to add the left side of the left join).
Yes-- CREATE TABLE IF NOT EXISTS mview (id PRIMARY KEY, file, level, pos, todo, priority, scheduled, deadline, title, properties, olp, tag, alias, type_ref, FOREIGN KEY (file) REFERENCES files (file) ON DELETE CASCADE)
versus nodes CREATE TABLE nodes (id NOT NULL PRIMARY KEY, file NOT NULL, level NOT NULL, pos NOT NULL, todo , priority , scheduled text, deadline text, title , properties , olp , FOREIGN KEY (file) REFERENCES files (file) ON DELETE CASCADE)
The results aren’t different if I use join/ inner join / left join – all have same results