An idea for a roam-based bibliographic manager

I can’t stop thinking about using org-roam for full-stack bibliographic management, so I thought I’d share it. Partly because I don’t think I have all the skills necessary to do this myself, and I’m hoping someone would want to pick up this idea, and/or collaborate with me on it.

The way I see it, maintaining org-roam bibliographic notes can be a bit kludgey at times, especially if you’re like me, and use org-roam, org-roam-bibtex, org-ref, helm-bibtex, bibtex-completion, and more. Here are some problems with that stack, as I see them:

Problems

  1. That’s a big software stack, with lots of moving parts to keep track of.
  2. Helm-bibtex and its siblings need to parse all your bibtex files each time you try to insert a citation. That can get pretty time-intensive, if you have a large bibtex file, like I do.
  3. Org-roam notes reference citations (cite:stanley2021), which point to entries in bibtex files (@article{stanley2021...), which are what ultimately contain the bibliographic metadata. But this means that you’re maintaining notes for bibliographic entries in two places, ultimately. This can get messy.
  4. Org-ref is slowly being replaced with org-cite, the built-in citation mechanism for Org-mode.

Proposed Solution

What if, instead of trying to Frankenstein together a bibliographic manager, by combining org-roam, org-ref, org-roam-bibtex, helm-bibtex, and others, what if you could just use org-roam as your bibliographic manager? That way, not only would you simplify your software stack, but you could simplify your notes, too, by avoiding BibTeX altogether.

The pieces are all almost there. Org-bibtex, which is already included in Org, provides templates for org-based bibliographic metadata storage, using PROPERTIES drawers.

Here’s an example:

* A multi-language computing environment for literate programming and reproducible research    :babel:
  :PROPERTIES:
  :BTYPE: article
  :AUTHOR: Eric Schulte and Dan Davidson and Tom Dye and Carsten Dominik
  :JOURNAL: Journal of Statistical Software
  :VOLUME: 46
  :NUMBER: 3
  :YEAR:     2012
  :MONTH: January
  :CUSTOM_ID: schulte2012babel
  :END:
Some annotation about babel.

This isn’t too far off from what an org-roam node looks like, as generated by org-roam-bibtex.

If this were your canonical format for bibliographic metadata, it would allow you to do some sophisticated org-ql queries for finding, say, all entries from 2021. It would also allow you to leverage hierarchical org-roam nodes to represent articles in edited collections–the collection itself could have the parent heading, and the collection’s articles could be subheadings.

What would be needed

What would be needed for something like this is, I imagine:

  • Importer functions that could take ISBNs or DOIs, or even plain-text queries, and convert them to org-roam nodes, containing bibliographic metadata gleaned from some REST API. It could just be a light rewrite of org-ref’s isbn-to-bibtex function, and other functions like that.
  • Functions to cache bibliographic metadata in org-roam’s database, whenever it scans for changes.
  • Functions for helm, ivy, or whatever other system, to select bibliographic items from org-roam nodes, based on that cached metadata, like titles, authors, years, and so on.
  • A backend for org-cite’s citeproc, that can look up bibliographic data from the org-roam database, rather than looking for it in a bibtex file. That way, when exporting an org file containing citations, it can generate the Works Cited page automagically, using org-roam data.

Let me know what you think, and whether you think this might deserve further thought and effort.

Org-roam v2 and org–cite both reflect a move towards more focused, and modular, tools. Org-cite, in particular, is incredibly modular.

I think the future is likely best served with that direction: equally focused and modular tools that integrate these pieces well.

The recent addition of org-cite indexing in org-roam, which includes a new citations db table, seems to provide a good foundation for just this.

I maintain bibtex-actions, and most of its code is independent of org. And none of it is specific to org-roam; not even the “open notes” embark menu items you see below.

But there are a few key places that integrate it with org-cite, so you end up with things like this within org/org-roam.

Or here’s embark at-point integration menu.

So I think there’s a lot of opportunity for doing cool stuff with org-roam v2 and org-cite.

BTW, this issue seems related to your point on org-bibtex.

How would an Org-based literature database work with exporting citations to pdf?Would you then still need to maintain the BibTeX file?

Researching my own question, i stumbled on this interesting discussion by the Org-ref developer John Kitchin:

Let me know what you think, and whether you think this might deserve further thought and effort.

It definitely deserves further thought and evaluation. I would be interested to see a working prototype of this, but personally wouldn’t invest my time into it right now and that’s why:

  1. It’s technically more challenging than it may seem. And we already have a reference manager that indexes bibliographic data into an SQL database. It’s called Zotero.
  2. Emacs is more about text data and BibTeX with all its cons, cons and cons perfectly fits into this scheme. The BibTeX format is mature and widely used. It’s a native format for JabRef and BibDesk and a few minor scale programs and is supported to various extend by all reference managers, by all the publishing houses and so on.
  3. An Org-base reference manager is ultimately limited to Emacs in the foreseeable future. I played many years ago with Org-bibtex and couldn’t figure out how to fit it into my workflow.
  1. That’s a big software stack, with lots of moving parts to keep track of.

In fact no. Helm-bibtex is an interface to bibtex-completion and is not independent of it. Org-roam-bibtex is an extension to Org-roam and is not independent of it. You have two packages for two different tasks. It’s a sort of Unix way within Emacs.

  1. Helm-bibtex and its siblings need to parse all your bibtex files each time you try to insert a citation. That can get pretty time-intensive, if you have a large bibtex file, like I do.

Maybe something’s wrong with your config? Hem-bibtex and similar re-parses the BibTeX file only if it has changed. It stores all the data in an internal cache, which is pretty instantaneous. For me the initial (re-)parsing takes a bearable amount of time (~2 s) on ~2000 BibTeX entries and it’s done only once in a while when I update the BibTeX file. As I mostly work with BibTeX in an external program (BibDesk), I often don’t notice that.

  1. Org-roam notes reference citations ( cite:stanley2021 ), which point to entries in bibtex files ( @article{stanley2021... ), which are what ultimately contain the bibliographic metadata. But this means that you’re maintaining notes for bibliographic entries in two places, ultimately. This can get messy.

I wouldn’t agree with that. You store bibliographic data in a proven and future-proof format. You however don’t store your PDF files in that format, do you? You just keep links to them. Neither do you store your bibliographic data in PDF format, although most modern PDF files in fact keep such data internally. So why would it be messy to store a conceptually rich-text note in a separate file?

  1. Org-ref is slowly being replaced with org-cite, the built-in citation mechanism for Org-mode.

No. Org-ref adopts to the new mechanism, but it’s not going to be replaced by it.

All that said, I would encourage you to pursue your ideas. They have a potential. They may not result in what you originally anticipate, but actually in something even more useful.

Org-cite alone will not replace org-ref, given the latter’s scope is much wider (it includes cross-referencing, glossaries, etc.).

But org-cite + smaller and more focused packages, along with possible tweaks to org, could.

Hopefully org-ref itself evolves in that direction, or if not, other developers and packages provide that.

@bruce By the way, do you have any idea why Org-cite’s parsing is so slow? I didn’t measure it precisely, but I managed to count to 100 while waiting for parsing initiated by org-cite-insert to finish. Bibtex-actions and Bibtex-completion parse the same library of 1954 entries within 2-3 seconds.

You mean the oc-basic insert processor?

I noticed that too. Does it not use parsebib?

Yes, I’m currently using the basic processor. I now have more time for a little bit of Elisp, so will try to dig into it, thanks for the tip.

You know about oc-bibtex-actions?

There’s a insert processor there (config example here).

Also:

1 Like

I have a setup which is quite close to the one that you propose, except that all of my bibliography (and notes, and links to PDFS) is in a large biblio.org file.

I have commands that parse bibtex that I copy-and-pasted and insert it at the right location in the file (which is sorted by the reference id). Everytime i add such an entry, org-bibtex just regenerate the .bib file; and I use org-ref to insert citations easily, go to the node or get the PDF.

My big problem for a while was a workflow for capturing. Now I use this: I open my pdf in emacs. If the paper interests me, I copy the title, and I have one function that uses biblio.el to lookup entries in a database, retrieve the bibtex, then put it in my .org file at a correct location, copy the PDF at a suitable place, and insert a link to it from the biblio.org file.

If that is interesting, I can put my code online. I would be interested in splitting this file into several org-roam notes, as it is becoming quite large.

(But as in org v2 notes can be outline in a file, I think my setup already works quite well with org-roam)