Application of Graph Theory to Roam link network

tecosaur · May 5, 2020, 2:16pm

I’ve been thinking a bit about applying some graph theory concepts to Roam, and others have expressed interest. This thread exists to further discussion on the idea, and implementation.

Why Pursue This

I’m sure I won’t have to argue that being able to view a network of links is great. I think by utilising some graph theory concepts (namely centrality and communities) we can make it easier to

identify important ideas in the network
identify closely linked clumps of notes
make the graph nicer to look at.

I think the benefits of 1-2 are clear. 3 is a nice bonus which comes with them.

I propose sizing nodes based on importance/centrality, and colouring them based on groups/communities.

I suspect that applying Graph Theory to the note links network will make it easier to identify particular items of interest, and gain a more accurate picture of the overall state of the network.

Graph Theory

I’ll try to just give a brief overview of the relevant ideas, and the relevant formula so you can get a rough idea for the form of the calculations involved. I’ve left out details on the formula because I doubt that that’s of interest, and Wikipedia exists .

V = vertices = nodes = notes
E = edges = connections = links

Centrality

A measure of importance. A simplistic measure is how many connections a note has, however this fails to take into account how important the notes connected to are. To account for this, one arrives at a recursive definition. At this point matrix algebra and eigenvectors get involved. This is 𝓞 (V³).

I’d consider using a tweaked version of this, namely page-rank (eigenvector centrality + starting weight + weight based on vertex degree), but experimenting with other variants.

Good for finding vertices which are connected a lot

Betweenness centrality

A diferent take on the same idea (importance). The extent to which a vertex lies on the shortest path route between other vertices. 𝓞 (VE)

Good for finding vertices which bridge large groups of vertices

k-cores

Basically incrementally picking off the nodes with < k connections, until nothing remains. 𝓞 (V).

Good for finding a tightly-connected core

Communities

Clusters of closely-related vertices.

Greedy Modular Communities

Modularity is (roughly speaking) a measure of how many connections occur within communities vs. across communities. Maximising this (which can be done greedily) is an approach which generally works well. 𝓞 (N log N)

Good for finding clumps of tightly-connected vertices

Implementation

Implementing this in elisp would be (1) a huge effort (2) likely run into performance issues.
Common libraries exist in python and R, however, performance would likely still be an issue, and requiring say, R libraries would make end-user installation a prohibitive hassle.

From my googling so far, the most promising option appears to be making use of https://igraph.org/ to create a c/c++ source file which can be compiled on the end-users computer, in a similar manner to vterm. Providing pre-compiled executables would be nice, but I’m not sure how well this would work.

Since the centrality/community values are constant for a particular network structure, my current idea is to add a DB table to store the values to minimise computation, with a form such as this

network_features
- UUID
- page_rank
- betweeness_centrality
- k_core
- community_index

Then, when the user asks for a view of the graph, these metrics can simply be retrieved if the network has not changed. If the network has, these values would need updating, so the c/c++ script could be asynchronously executed, with the results directly written to the DB.

Final Thoughts

I, for one, am a huge fan of this idea. However, I lack experience with c/c++ and therefore am currently unlikely to be able to give this a shot. If someone would be willing to have a go at this, that would be brilliant! While I may not be able to help with language-specific nuances, I would be keen to discuss basically anything else surrounding this.

bruce · May 5, 2020, 2:34pm

Seems the best trade-off of features, performance, installation hassle on first glance.

Really cool what this could enable!

What do you say @jethro @danderzei @Memex ?

A quick looks shows igraph is available at least on fedora and ubuntu; not surprisingly, the former version is much newer than the latter though.

zaeph · May 5, 2020, 2:46pm

This looks wonderful. As someone coming from the humanities where I’ve had little to no experience with such graphs, I appreciate the time you took to explain the concepts.

I especially like how you’ve related concepts in Graph Theory to how they might be relevant for an user looking at their notes.

I’ll be keeping an eye on this, although, as I’ve said before, this is not really my area of expertise. All I can do is appreciate the potential of the project, as well as the work you’ve put into it. Thanks!

danderzei · May 5, 2020, 8:37pm

Centrality should be straightforward to implement in Elisp. Simply count the number of backlinks and size the node accordingly.

It seems that GraphViz has clustering tools.

tecosaur · May 6, 2020, 2:08am

Not with the measures of centrality I discussed above, namely: PageRank, betweenness, and k-core.

Ooh, that looks nice. I wonder how efficient it is, namely how well it scales.
I tried it out on my graph, and while it worked, I couldn’t work out how to use that to affect the style of a cluster .

danderzei · May 6, 2020, 4:15am

Various measure of centrality exist. Degree centrality is a pretty effective measure of node significance that is very easy to calculate.

Great to hear you got the GraphViz clustering working

tecosaur · May 6, 2020, 4:32am

Oh yes, degree centrality is the simplest, and broadly works. However, as per my original post I think there would be value in utilising other, more computationally intensive measures.

Seemed to work quite easily. However, I don’t know how I could go about making the different clusters appear visually different with graphviz.

bruce · May 6, 2020, 9:53am

In the end, I really want it all: the interactive graphing that @Memex is working on, plus the advanced color-coding and grouping described here and that @danderzei demonstrated, all accessible via the standard OR graph command.

tecosaur · May 6, 2020, 10:11am

I think that should be possible

jethro · May 7, 2020, 5:33am

I think this will definitely be an upgrade from the plain graphs we currently output, which shows nothing more than what’s connected. Something programmable/extensible will be nice, which is why I was thinking that D3 is probably the best bet.

It’s easy to fallback to existing software like igraph, but it feels like it could be an uphill battle if we want to explore more interesting ideas that could be especially relevant to links between notes rather than general graphs.

tecosaur · May 7, 2020, 6:19am

I also think something like D3.js would be good — for some sort of interactive, navigable visualisation. However, I suspect that there’s a misconception regarding the nature of my proposal.

We can use D3.js etc. for graph visualisation. This topic fits in at a different ‘layer’ in the process (see diagram below). If we can get this working, it acts in a supportive capacity by providing grap theory metrics as additional data that can be used to improve the visualisation, regardless of how we do the visualising.

structure

edumerco · May 7, 2020, 3:39pm

+1 to try to use Graphviz. It’s already there, multiplatform, rock solid and working, plus it already had many of this issues solved and working.

Coupled with the issue Support tags as first-class values (https://github.com/jethrokuan/org-roam/issues/572) we could define a style for each attribute we may like to see (group, centrality degree, etc.).

While I’m not a coder nor know enough to estimate effort, I believe the main point here would be to pass the parameters to Graphviz.

bruce · May 7, 2020, 3:45pm

Can graphviz do interactive graphs of the sort that @Memex is working on?

edumerco · May 7, 2020, 3:51pm

I don’t know exactly what interactive graphs is @Memex working on (found just the custom roam link here).
If “interactive” is animation as in D3, no (that I know of).
If it’s clickable, yes.

tecosaur · May 7, 2020, 3:53pm

Just to try to prevent us drifting off topic it would be good to keep this discussion on network analysis.

Perhaps it would be worth creating another topic for graph visualization? I know there’s already a G6 effort somebody on slack is engaged in.

bruce · May 7, 2020, 4:06pm

Yeah, I mean dynamic; not just clickable.

edumerco · May 7, 2020, 4:08pm

Graphviz can not, at least that I know…

jethro · May 8, 2020, 6:59am

@tecosaur I see. I do see potential benefit in retrieving network metrics within Org-roam, to make available to the visualization backends. For an early prototype, I think we can try implementing it in Elisp, either postprocessing using emacs lisp, or doing all the computation using SQL . We still need a backend to test it with, and from the above discussion it seems that Graphviz doesn’t cut it. We would require @Memex to be able to consume the metrics we output, at least.

Re. caching the values for the network, it doesn’t have to go into the DB, we can write the values into a file on disk, but it’s an optimization we can think about later.

Memex · May 8, 2020, 7:09am

I have been away from this community for a few days. Sorry I could not reply to the mentions. I have written an RFC here.

Would like to know what you think.

tecosaur · May 8, 2020, 7:20am

@jethro Glad we’re on the same page

Regarding computation: unfortunately, these metrics tend to be some combination of complicated and computationally intensive , which means to do this in elisp we’d more or less have to write a network science library and then it probably would have performance issues. Off the top of my head, we’d need

graph parsing, and helper methods for interaction with the graph structure
matrix maths / eigenvector/value calculations

Neither of which I can see myself doing without someone standing behind me with a cattleprod

While I don’t like the added complexity of using a C library, of the options I’m currently aware of it looks like the most promising option.

Re: DB writing, I think we want to cache, the DB was just the first thing that came to mind

Topic		Replies	Views
Analysing the semantic network Graphing Capabilities	8	3330	May 7, 2020
Spatial Graph Embeddings Graphing Capabilities	6	1336	November 2, 2020
[RFC] Interactive Visualization and Explorer Tool Graphing Capabilities	6	1582	May 14, 2020
AI generated node connections using semantic similarity estimation Development	1	675	December 4, 2022
Org-roam-graph with org-roam V2 Development	3	2309	July 3, 2021