Thoughts on DB structure

tecosaur · May 5, 2020, 3:15pm

Disclaimer

I’m not overly familiar with databases, so please take my thoughts with a pinch of salt. Hopefully, they’re of use

Background

In Custom Roam-style link I expressed some thoughts on the DB structure. In Application of Graph Theory to Roam link network I mention adding a new table to the DB. I should probably make a thread to further discussion on this.

Current Structure

The file org-roam.db is an SQLite DB with the following four tables.
All fields not null, unless labelled as “nullable”

files
- file : text : unique
- hash : text
- last_modified : text
links
- from : text
- to : text
- type : text
- properties : text
refs
- ref : text, unique
- file : text
titles
- file : text
- tites : nullable

It strikes me that there appears to be a bit of duplication (e.g. file) and redundantly small tables (e.g. titles). I also suspect that the introduction of UUIDs for the notes could be rather convenient, and lead to one or two minor, but nice, advantages.

An idea of mine

Just a starting point.

FIlename and titles can change. It seems like a nice idea to have a stable UUID for each note. AIn the default filename the current time to the nearest second is used, e.g. 20200429235659. Adding a created field would thence be able to both record the creation date (could come in handy now and then), and should be able to serve as a UUID, as of course, the creation date doesn’t change . If somebody (somehow) creates two files within the same second, I think it would be fine to just pretend the second file was created a second later.

Adding a UUID also allows the links table to have stable connections. By this, I mean that links are impervious to file renaming, the UUID field would never need to be updated. Reducing the number of actions that need to be performed on a file rename seems beneficial. To me, on a surface glance, this seems more robust.

With the introduction of Custom Roam-style link files can be referenced using [[roam:title]] notation. Since one would want such links to uniquely point to a file, it sounds sensible to require titles to be unique (I imagine with an on-save hook where the DB is updated, which probably already exists).

Proposed structure

notes
- UUID / created : integer : unique (e.g. 20200429235659)
- file : text : unique (e.g. “20200429235659-zettelkasten-method. org”)
- title : text : unique (e.g. “The Zettelkasten Method”)
- hash : text
- refs : text : nullable
links
- from : number (UUID, e.g. 20200429235659)
- to : number (UUID, e.g. 20200429231326)
- type : text (e.g. “roam”)
- properties : text

I’m not solid on what a ref is, perhaps it does warrant it’s own little table? IDK.

Network charachteristics extension

Since it’s relevant, in Application of Graph Theory to Roam link network I mention a new table which I think would complement new network analysis capabilities. This is the proposed structure.

network_features
- UUID : number
- page_rank : number
- betweeness_centrality : number
- k_core : number
- community_index : number

Perhaps this should be merged with notes? Again, IDK.

Let me know what you think

alan · May 5, 2020, 7:01pm

For UUID I would recommend org-id-new.

I also like the idea of enforcing uniqueness of titles (and aliases), since it is unlikely that a user intends to duplicate them and it would be best to return an error at time of creation/edit rather than letting the user discover the problem at time of access and have to manually resolve it.

However as currently implemented this would need to be enforced by org-roam rather than the db, I believe, since titles/aliases are stored as lists. Nothing difficult, just a note.

tecosaur · May 6, 2020, 1:40am

Mmmm. That’s what I was thinking. DB query to make sure unique, else tell the user they probably want to change it.

bruce · May 6, 2020, 10:54am

I have no SQL expertise, but your proposed structure does look more sensible to me, with the two notes and links tables.

Only question would be how one would deal with aliases in your model.

And I’d be curious what the counter argument would be.

On your last question, perhaps whether to add the new table, or merge with the others, comes down to technical considerations?

Logically, though, I think (but am not certain) there’s no need for a third table; those columns belong in notes.

jethro · May 7, 2020, 5:48am

While having a UUID initially seemed like a good idea to me, I now have my reservations.

Will the UUID be available in the Org file? If so, this makes the system brittle: it’s easy to mistakenly edit the UUID.
If it’s not available in the Org file, then this means that the UUID is stored only in the database. How do we then do a partial rebuild, what will the UUIDs be then?

I feel like it may eventually complicate the code while adding little benefit, since the UUIDs need to be maintained somehow, but it’s hard to say without a prototype.

good_matty · May 7, 2020, 7:22pm

I’ve done a lot of database design, so let me describe what I believe would be an optimal schema, without any regard for the efficiency of using this within the system.

First off, I think the introduction of a UUID for notes is a great idea. The titles and file locations ought to be able to change while still maintaining the links (and existence!) of a given note. I’d suggest a normalized schema like the one you started with:

note
** UUID : int : unique
** hash : text
link
** from : int : FK to note.UUID
** to : int : FK to note.UUID
** type : text/enum
** properties : text/tuple/whatever
title
** UUID : int : FK to note.UUID
** title : text : unique
** updated_at : timestamp
file
** UUID : int : FK to note.UUID
** file : text : unique
** updated_at : timestamp

(I’m not sure what the refs table does, but it seems like it would just be another table with the same format as link or title).

While this does seem like there might be some small tables, there’s an advantage in having this type of normalized design – it decouples text/filename updates from one-another (as well as the note itself), and allows for an easy recording of historical changes to the note title or filename (could be useful if you wanted to search for a note across an old title or something). It also would enable aliases, if you wanted to do that (you’d have to change the title schema a bit, but not a big deal).

This does mean that the queries to surface titles / filenames are a bit more complicated (you need to make sure you get the current one, do a join, etc.) but I think the safety in normalization and the ability to access historical metadata might be worth it. It seems like this is how it’s already laid out, already.

Besides that, I think that your initial instinct of having a UUID is a good idea, where the UUIDs are stored only in the database. You have the problem of initially creating them no matter what (which is what you’d do in this “partial rebuild” situation, I believe), which doesn’t seem to be an issue.

jethro · May 8, 2020, 6:48am

I like the idea of decoupling filename updates from one-another, but unfortunately with plain-text it isn’t so simple. Org links are link to file names, and not UUIDs. So if a filename is edited, org-roam must still visit all files pointing to the current file, and update the links. This can only be avoided by using non-file: links. The solution would then be almost equivalent to using Org IDs everywhere and link via IDs instead, which I think we should eventually support.

What I’m struggling to figure out is the benefits of adding UUID over using Org IDs, which will reduce the re-engineering required.

tecosaur · May 8, 2020, 7:06am

The reason why this entered my mind was thinking about file renaming.

At the moment, if I rename a file outside of Emacs, are all my links broken? Is there an easy way that all the links can be updated?

This is the main problem that I saw UUIDs as a simple solution to.

jethro · May 8, 2020, 7:20am

Yes, the links are broken. But I don’t think UUIDs is a solution to this: in fact, nothing is. The only solution I can imagine is to have Emacs run something, listening to inotify.

tecosaur · May 8, 2020, 7:25am

I’m thinking by recording a few other metrics e.g. title in file, creation date, line count etc. One might be able to create an org-roam-guess-uuid function which could be run on files which have ‘magically appeared’ (at the time that org-roam looks at them). Then if

A file in the DB appears to no longer exist
a new file which has the same metrics as the disappeared file is found (via org-roam-guess-uuid)

One could be reasonably confident that the new file is the old file renamed. Then we could

link the new file to the old UUID
identify all links, and replace with new file

jethro · May 8, 2020, 7:50am

All this feels like a bit of over-engineering. We have the appropriate machinery in place for automatically updating links when renaming or moving files within Emacs. I have mentioned this in other places before, but I think it’s worth repeating: I don’t think deleting/renaming files is supposed to be a key operation. The names/locations of the files shouldn’t be important, at least not as important as the title of the note, and the links to/from it. This is one of the reasons why supporting Org IDs isn’t as high on my priority list. If the concern is really breakage from deleting and moving files, then the solution already exists: Org IDs, and effort should go towards supporting that rather than building our own.

tecosaur · May 8, 2020, 8:59am

Yea, it does actually. As long as org-roam can do renaming itself well enough, this doesn’t sound like it’s worth worrying about

The one other point of value of a UUID perhaps worth considering is it’s value as a single identifier, but I guess we could just use the title/file for that.

Another thing in my mind is how this seems to be a lot of small tables. @good_matty you seem to be suggesting this is a good thing, would you mind explaing some more? To me at the moment I still get the impression that a having less tables (as in my original post) may be easier.

bruce · May 8, 2020, 10:52am

I think he was just referring to the general reason people advocate for “normalized” vs “denomrmalized” designs (his proposal being the former).

good_matty · May 8, 2020, 3:53pm

Pretty much! If you don’t have efficiency problems and you’re willing to write slightly more complicated queries, it’s generally preferable to have some normalized database design. Depending on what it is that you’re doing, it can save you from some situations where the database can become inconsistent, and often makes it easier to implement certain features.

In this case, doing something like keeping the titles table separate from the notes table allows you to more easily add/delete titles without going anywhere near other data, keep track of historical title information without lots of duplication in the notes table, and potentially allows for a note to have no title without introducing NULLs.

Topic		Replies	Views
Org-roam as filesystem Development	6	165	November 16, 2024
Why are file links being created when IDs exist? Troubleshooting	3	672	March 24, 2021
New default DB file location, and how to keep using the current DB file as is via customizing Tips	5	3735	October 6, 2020
Utility package: org-roam-timestamps Development	6	1417	December 7, 2023
Recording the date that connections were made Requests	3	227	March 8, 2024