(Prototype) org-similarity: generate a list of similar documents (Org-mode)

I’m a hardcore fan of org-roam and use it on a daily basis to organize my ideas and, above all, discover new connections between them. The organizing part is in check, but as the number of notes grow, finding similar or related notes by hand becomes a daunting task with suboptimal results. Therefore, I decided to code my first Emacs package :slight_smile: behold org-similarity!

For installation instructions and usage:

The package actually uses the power of Python’s scikit-learn and nltk modules for text feature extraction and pre-processing. More specifically, it cleans the org files (by stripping the front matter and some undesired characters), tokenizes the documents, replaces each token with their respective linguistic stems, generates a TF-IDF sparse matrix, and calculate the cosine similarity between the note you are currently editing and other notes in a directory of choice. It works with org-roam and org-mode in general. Here’s a demo:

optimised

This is both my first Emacs package and “useful” git repo, so it was a lot of fun to learn new stuff, and I’d love to have the community’s feedback to improve this package! :slight_smile:

PS: org-roam users might want to change the value of org-similarity-directory to org-roam-directory, and how links are created, like this:

(setq org-similarity-directory org-roam-directory)
(setq org-similarity-use-id-links t)
5 Likes

:astonished:

1 Like

Awesome!
if i changed the extension in this line, and the “#+title:” to an appropriate signifier in this line, would it work for non-org files? I have markdown files in mind.

Ah, I would also need to deal with the links in the token function… I am guessing.

Yes! It should work with your proposed changes plus replacing the org-mode keywords with their markdown-equivalent in this file: https://github.com/soldeace/org-similarity/blob/main/assets/junkchars.txt

1 Like

This is fantastic!

How big of a step is it to also TF-IDF saved web pages?

I save them in full html (as is done by Zotero) and since my own notes about them are terse, and usually cover only the thing I knew was interesting about them at the time, most of the future related notes discovery would have to come from .html, not .org files.

I guess you could say the same about pdfs…

The other thing I’ve been wondering about is how to merge similarity metrics coming from different sources e.g.

  • Text TF-IDF, maybe after SVD topic clustering or the the like
  • User-generated links, as done by Hugo Cisneros
  • User-generated tags

But for the moment, I’m really happy to see the prototype org-similarity

Thanks!

1 Like

Thank you for the feedback!

With regard to your ideas:

Scanning other types of files would certainly be a great addition. It must be pretty straightforward to implement such a thing. Strangely enough, I thought there would already be mature tools for the job, but couldn’t find any with my quick look up.

I didn’t know Hugo Cisneros. And wow, his digital garden of notes is amazing! Truly an inspiration. There are a lot of truly amazing machine learning tools inside scikit-learn, let’s see what we can do with them! :slight_smile:

Awesome!

Thanks for sharing this. I can’t wait to try it!

1 Like

I have written another discovery tool called “delve” (on github). Would it be possible to somehow abstract the function which returns a list of similarities, so that I could add it to delve? Delve uses a hierarchical list paradigm to explore the notes.
https://github.com/publicimageltd/delve

4 Likes

For starters, “Delve” is such a cool name! Props! :slight_smile: I’ll take a look at your code and see what I can do with mine to make them talk.

Thanks! Looking at your code, I see that you actually use python to find the result. So I guess the easiest way would be to use your package as a kind of plugin. We just need to make clear how your results are passed (given that it is more than just a list of pages), and define how to deal with errors. The general command would be “apply org-similarity to the page at point”. Sounds good to me.