Org-similarity 0.2 released

tl;dr org-similarity is an Emacs package prototype that helps you find similar notes by tapping into Python’s powerful machine learning packages like nltk and scikit-learning. Version 0.2 comes with some improvements and bug fixes, and is now compatible with org-roam v2. I would really appreciate your help in testing it! Installation instructions in the link below.

Almost exactly two years ago, I created a post here about my prototype called org-similarity. It generates a list of documents that are similar to the open buffer in descending order of similarity. Unfortunately, I had to freeze its development due to big changes in my personal and professional life. The project winded up a bit stale, especially after the org-roam v2 release and all its changes (for the better, of course). org-similarity didn’t quite get along with the properties drawer, and instead of id: links, it still generated file: ones in conformity with former org-roam.

However, in the recent weeks, there’s been a few people thanking me for the package (even if rudimentary), and others kindly asking for a bug fixes. That was kinda reinvigorating (especially after seeing the repo getting like more than 30 stars!) Well, I’m here to say that I intend to retake the work on org-similarity, and today I refactored a lot of the code and fixed some long due bugs :slight_smile:

Here are the major changes in v0.2:

  • Automated installation of Python dependencies. Users no longer need to worry about installing org-similarity Python dependencies manually. To make things easier for non-Pythonists, the former version’s recommendation was to install deps manually via pip (which is not ideal, one should avoid mixing a project’s requirements with your system’s Python environment). With the help of virtual environments, all dependencies are now downloaded during the first run and installed inside the Emacs directory (more on that below). As a bonus, dependency checks are done at runtime, so when you get a package upgrade, the Python dependencies will be updated automatically as well.

  • Better org-roam v2 compatibility. Now, org-similarity correctly parses the properties drawer and offers the option to create links using either IDs or filenames (as before).

  • Refactored and optimized Python code. I must level with you: the first version of org-similarity was made in quite a hurry, so all of it, but especially the text preprocessing was very hacky and I was kinda embarrassed to expose that monster to the world :slight_smile: The most recent version is leveraged by orgparse, a nifty Python package to handle org-mode files, in a sane way.

Here’s the link to the repository: GitHub - brunoarine/org-similarity: Emacs package that helps org-mode users discover similar documents via TF-IDF and cosine similarity

I have a request for you people: please install the package and see if it works, and if it doesn’t, I’d be very happy to address the issues. Also, if you have cool ideas for its implementation, don’t hesitate to tell me! This is ticket opening season :slight_smile:

A few notes:

  • Judging by the code, I’m a fairly mediocre Elisp programmer. If you have any suggestion on how to make it look nicer, I’m all ears. I’m also a little bit concerned about the best practices when it comes to mingling to different languages. I make extensive use of well-known (and heavy-weight) Python libraries to deliver the similarity results in a short time. I wonder if it’s good practice to download and store them as virtual environments inside the Emacs packages directory (a few projects download and compile stuff there too, so I think it doesn’t hurt).
  • In the future, I intend to implement more features, like grouping similar documents with clustering algorithms, or making it possible to choose other algorithms to infer document similarity. From my experience though, cosine distance + TF-IDF is a rather robust way to evaluate document distance without involving hundreds of megabytes of machine learning models, so I guess this is the best bang-for-the-buck for the moment being
  • One day, I intend to re-implement the Python part of the package in pure Elisp. In fact, I’ve already started doing it. Speed is still an issue. (Some Python packages are optimized for heavy calculation and use Fortran under the hood—hard to beat that).

Happy holidays to everyone!

5 Likes

I installed 0.2 but have some problems and I fixed one with PR on GitHub, but didn’t make it work because of the UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 18590: invalid start byte.

Can I setup org-similarity for multi-language? English is my second language. Thank you very much for your awesome package!

@suliveevil Looks like it’s an issue from the orgparse package. It’s probably trying to open a file using the wrong encoding. I’ll investigate it further and try to either contact the package maintainers or implement some extra preprocessing step. Which language are your documents in? Could you give me a sample of some unicode characters that are raising this exception?

org-similarity can be setup with different languages (in fact, all possible languages for the Snowball stemmer that is packaged with nltk). You can do that by customizing the org-similarity-language variable (see here).

Thanks for the heads up!

this is really cool, thank you for making it! I’ve been trying to do something similar, also with python, but mine is suuuuuuuper hacky and basically brute force comparing by word count. this sounds way better!

Definitely doing to give this a spin.

1 Like

Hello, Bruno! I didn’t know org-similarity before posting, ironically, about a similar proof of concept

Would you mind looking at the post and see if those ideas go in the same direction you intend for the package? I’m on the fence between investing time to make it a tool for others or keeping it as a hacky personal thing. In my perspective, it relates to this part:

In the future, I intend to implement more features, like grouping similar documents with clustering algorithm…

I’m trying it now, nice job!

Hello @lgmoneda! I was about to reply to your thread. I started doing the updates to org-similarity in one day, and ended up creating this post too late at night, and didn’t see your thread right at the top. And now that I did, I was quite surprised to realize we’re trying to tackle the same problem! What are the odds? :slight_smile:

I still have to take a closer look, but I loved the premise from your post. As soon as humanly possible I’ll take a better look into it!

1 Like