(Prototype) org-similarity: generate a list of similar documents (Org-mode)

soldeace · December 5, 2020, 4:03pm

I’m a hardcore fan of org-roam and use it on a daily basis to organize my ideas and, above all, discover new connections between them. The organizing part is in check, but as the number of notes grow, finding similar or related notes by hand becomes a daunting task with suboptimal results. Therefore, I decided to code my first Emacs package behold org-similarity!

For installation instructions and usage:

The package actually uses the power of Python’s scikit-learn and nltk modules for text feature extraction and pre-processing. More specifically, it cleans the org files (by stripping the front matter and some undesired characters), tokenizes the documents, replaces each token with their respective linguistic stems, generates a TF-IDF sparse matrix, and calculate the cosine similarity between the note you are currently editing and other notes in a directory of choice. It works with org-roam and org-mode in general. Here’s a demo:

optimised

This is both my first Emacs package and “useful” git repo, so it was a lot of fun to learn new stuff, and I’d love to have the community’s feedback to improve this package!

PS: org-roam users might want to change the value of org-similarity-directory to org-roam-directory, and how links are created, like this:

(setq org-similarity-directory org-roam-directory)
(setq org-similarity-use-id-links t)

mshevchuk · December 5, 2020, 4:13pm

nobiot · December 5, 2020, 5:25pm

Awesome!
if i changed the extension in this line, and the “#+title:” to an appropriate signifier in this line, would it work for non-org files? I have markdown files in mind.

Ah, I would also need to deal with the links in the token function… I am guessing.

soldeace · December 5, 2020, 5:38pm

Yes! It should work with your proposed changes plus replacing the org-mode keywords with their markdown-equivalent in this file: https://github.com/soldeace/org-similarity/blob/main/assets/junkchars.txt

scotto · December 6, 2020, 7:58pm

This is fantastic!

How big of a step is it to also TF-IDF saved web pages?

I save them in full html (as is done by Zotero) and since my own notes about them are terse, and usually cover only the thing I knew was interesting about them at the time, most of the future related notes discovery would have to come from .html, not .org files.

I guess you could say the same about pdfs…

The other thing I’ve been wondering about is how to merge similarity metrics coming from different sources e.g.

Text TF-IDF, maybe after SVD topic clustering or the the like
User-generated links, as done by Hugo Cisneros
User-generated tags

But for the moment, I’m really happy to see the prototype org-similarity

Thanks!

soldeace · December 7, 2020, 1:14pm

Thank you for the feedback!

With regard to your ideas:

Scanning other types of files would certainly be a great addition. It must be pretty straightforward to implement such a thing. Strangely enough, I thought there would already be mature tools for the job, but couldn’t find any with my quick look up.

I didn’t know Hugo Cisneros. And wow, his digital garden of notes is amazing! Truly an inspiration. There are a lot of truly amazing machine learning tools inside scikit-learn, let’s see what we can do with them!

konubinix · December 9, 2020, 8:10am

Awesome!

Thanks for sharing this. I can’t wait to try it!

public_image_limited · December 11, 2020, 8:53pm

I have written another discovery tool called “delve” (on github). Would it be possible to somehow abstract the function which returns a list of similarities, so that I could add it to delve? Delve uses a hierarchical list paradigm to explore the notes.
https://github.com/publicimageltd/delve

soldeace · December 12, 2020, 4:04am

For starters, “Delve” is such a cool name! Props! I’ll take a look at your code and see what I can do with mine to make them talk.

public_image_limited · December 12, 2020, 10:01am

Thanks! Looking at your code, I see that you actually use python to find the result. So I guess the easiest way would be to use your package as a kind of plugin. We just need to make clear how your results are passed (given that it is more than just a list of pages), and define how to deal with errors. The general command would be “apply org-similarity to the page at point”. Sounds good to me.

Topic		Replies	Views
Org-similarity 0.2 released Development	5	729	December 28, 2022
Org-similarity v2.1.0 released: lexical similarity search for Emacs Development	0	525	June 29, 2023
AI generated node connections using semantic similarity estimation Development	1	675	December 4, 2022
A demo of AI for linking, writing, and thinking with org-roam. Should we build org-roam-ai? Development	16	2778	May 3, 2023
Orgrr - org-roam-ripgrep Meta	17	783	March 20, 2025

(Prototype) org-similarity: generate a list of similar documents (Org-mode)

Related topics