tl;dr org-similarity
is an Emacs package prototype that helps you find similar notes by tapping into Python’s powerful machine learning packages like nltk
and scikit-learning
. Version 0.2
comes with some improvements and bug fixes, and is now compatible with org-roam
v2. I would really appreciate your help in testing it! Installation instructions in the link below.
Almost exactly two years ago, I created a post here about my prototype called org-similarity
. It generates a list of documents that are similar to the open buffer in descending order of similarity. Unfortunately, I had to freeze its development due to big changes in my personal and professional life. The project winded up a bit stale, especially after the org-roam
v2 release and all its changes (for the better, of course). org-similarity
didn’t quite get along with the properties drawer, and instead of id:
links, it still generated file:
ones in conformity with former org-roam
.
However, in the recent weeks, there’s been a few people thanking me for the package (even if rudimentary), and others kindly asking for a bug fixes. That was kinda reinvigorating (especially after seeing the repo getting like more than 30 stars!) Well, I’m here to say that I intend to retake the work on org-similarity
, and today I refactored a lot of the code and fixed some long due bugs
Here are the major changes in v0.2:
-
Automated installation of Python dependencies. Users no longer need to worry about installing
org-similarity
Python dependencies manually. To make things easier for non-Pythonists, the former version’s recommendation was to install deps manually viapip
(which is not ideal, one should avoid mixing a project’s requirements with your system’s Python environment). With the help of virtual environments, all dependencies are now downloaded during the first run and installed inside the Emacs directory (more on that below). As a bonus, dependency checks are done at runtime, so when you get a package upgrade, the Python dependencies will be updated automatically as well. -
Better org-roam v2 compatibility. Now,
org-similarity
correctly parses the properties drawer and offers the option to create links using either IDs or filenames (as before). -
Refactored and optimized Python code. I must level with you: the first version of
org-similarity
was made in quite a hurry, so all of it, but especially the text preprocessing was very hacky and I was kinda embarrassed to expose that monster to the world The most recent version is leveraged byorgparse
, a nifty Python package to handle org-mode files, in a sane way.
Here’s the link to the repository: GitHub - brunoarine/org-similarity: Emacs package that helps org-mode users discover similar documents via TF-IDF and cosine similarity
I have a request for you people: please install the package and see if it works, and if it doesn’t, I’d be very happy to address the issues. Also, if you have cool ideas for its implementation, don’t hesitate to tell me! This is ticket opening season
A few notes:
- Judging by the code, I’m a fairly mediocre Elisp programmer. If you have any suggestion on how to make it look nicer, I’m all ears. I’m also a little bit concerned about the best practices when it comes to mingling to different languages. I make extensive use of well-known (and heavy-weight) Python libraries to deliver the similarity results in a short time. I wonder if it’s good practice to download and store them as virtual environments inside the Emacs packages directory (a few projects download and compile stuff there too, so I think it doesn’t hurt).
- In the future, I intend to implement more features, like grouping similar documents with clustering algorithms, or making it possible to choose other algorithms to infer document similarity. From my experience though, cosine distance + TF-IDF is a rather robust way to evaluate document distance without involving hundreds of megabytes of machine learning models, so I guess this is the best bang-for-the-buck for the moment being
- One day, I intend to re-implement the Python part of the package in pure Elisp. In fact, I’ve already started doing it. Speed is still an issue. (Some Python packages are optimized for heavy calculation and use Fortran under the hood—hard to beat that).
Happy holidays to everyone!