Org-similarity v2.1.0 released: lexical similarity search for Emacs

tl;dr org-similarity is an Emacs package that helps you find similar documents in relation to the current buffer, or to an ad-hoc input query using lexical similarity calculation algorithms like TF-IDF and BM25.

If you your org-roam notes are atomic and stored in single files, this might be the perfect package for you to find out connections between a new note and your knowledge base.

How to install it

Using straight.el

Add the following to your init.el:

(straight-use-package '(org-similarity :type git :host github :repo "brunoarine/org-similarity" :branch "main"))

Doom Emacs

Add the following to packages.el:

(package! org-similarity :recipe (:host github :repo "brunoarine/org-similarity" :branch "main"))

Manual installation

Clone the repository:

git clone https://github.com/brunoarine/org-similarity

Then add the package to your load-path in your init.el:

(add-to-list 'load-path "/path/to/org-similarity")

Please check its official repository for updated instructions.

Q&A

Did org-similarity really jumped from v0.2 to v2.1.0?

No. Some of the new sweet features had been around since January, but they weren’t officially released. Some remained in the develop branch, and others in specific feature branches.

But eventually I had some good ideas (I hope), the user base started to grow, and a sense of urgency ensued. I had to do what was possible with the little window of opportunity I had between work projects, so I released everything everywhere all at once.

What are the changes in each intermediate release tag?

v0.3

This release tag refers to the code that remained in the main branch from December 2022 until July 2023.

  • The org-similarity-sidebuffer function was added to display the similarity results in a side-window.
  • Better Python availability check (“python3” instead of “python”).
  • Fixed typo in the coded that prevented the results lists to be rendered correctly when org-similarity-use-id-links was set to t.
  • The format_results function was modified to use the get_relative_path function.

v1.0.0

This release tag refers to the code that remained in the develop branch from December 2022 until July 2023.

  • Semantic versioning FTW.
  • Added BM25 as an alternative algorithm.
  • Added heading and prefix options.
  • Implemented a filter for minimum document size.
  • Added the org-similarity-remove-first option.
  • Decoupled the interpreter and dependency checks from the main function.
  • Renamed predicate functions for clarity.
  • Refactored command, executable, and dependency checks.

v2.0.0

The big deal.

  • Turned the Python part of the package into a standalone CLI tool called findlike, becoming a project on its own.

v2.1.0

  • Bug fixes, aesthetic improvements, and new option to filter out results by similarity score.

I didn’t like the new version / The older version gave me better results. What do I do?

If you prefer to stick to a particular release tag, just replace “main” in the :branch property with your preferred release tag (e.g. “v0.3”). The description of each tag is found above.

What’s the biggest change since v0.2?

In my opinion, it’s the stripping of all the Python code that was being shipped in the Emacs package.

Though (E)lisp is a fascinating programming language, it doesn’t seem to be ideal for matrix operations. So, in the beginning I resorted to Python and its weathered suite of numerical packages to perform the heavy lifting.

Mixing Lisp and non-Lisp code in the same Emacs package is not unheard of, so I was cool with that. (Even though handling virtual environments and Python package management from inside Emacs became my personal hell, and unsurprisingly, most of the tickets in the issue tracker were somehow related to that mess.)

But the fact is that I wanted to reuse that Python script in my terminal and elsewhere. That script was too useful to remain locked away in the dark towers of .emacs.d/.local.

I had enough reasons to create a separation of concerns and turn the engine behind org-similarity into a standalone program. So I created findlike , a command-line tool that can be used anywhere.

What I love about command-line programs is that they are universal. You can use them in the shell as standalone programs, or incorporate them as part of your script/plugin. Hell, you can even make it work for you when serving dynamic web content.

Given that findlike operates as a command-line program and returns the results either as a list (just like org-similarity Python routines did) or JSON structured data, there should be no major interoperability hiccups. However, results may be slightly different because of findlike’s different tokenization procedure and list of stopwords.

I see you still haven’t implemented modern machine learning models into it like the BERT family. Why?

I didn’t feel like they would fit inside an Emacs package. Call me crazy or purist, but as an Emacs user, I’d be really uncomfortable to find out that my .emacs.d folder suddenly has ~1GB where 95% of it comes just from a single package. I don’t want to be THAT developer :wink:

On the other hand, I feel totally willing to adapt a small language model onto findlike. So maybe we’ll have something like this in the near future.

Acknowledgements

I’d like to thank all of you who helped me test org-similarity and report bugs in the issue tracker.

And special thanks to @suliveevil and @AuroraDraco for the interest in the package and contributing with improvements!

5 Likes