Forgive me please: I am completely new to emacs and org-mode.
Can org-mode handle a very large number of notes? I have on mind more than a million.
Also, forgive if the question seems stupid but I could not find this question anywhere.
Forgive me please: I am completely new to emacs and org-mode.
In theory, no number limits; as many as your storage can hold — after all, notes are text files.
Org mode’s performance would be more dependent on the size of each file than the number of files.You would access a file one by one and wouldn’t access million files at the same time.
For Org-roam, which is based on a database, there seems to be a practical limit of numbers. Some people have reported that (either on Slack or GitHub; source lost) having 40K notes has degraded performance.
So org-mode may be better for handling larger amount of files?
Better than what?
It may well be the case, but I cannot jump onto that conclusion. What would you like to determine by asking this question? Perhaps you could explain a bit more of your context and motivation to help us understand where you would like to get to.
Let me explain and ask these two questions:
Can Org-roam handle a very large number of notes when the number exceeds a million?
Can Org mode handle a very large number of notes when the number exceeds a million?
My issues in answering these questions are as follows:
I can’t answer the second question above for Org mode because I don’t know what it means for Org mode to “handle” a large number of notes.
When you use Org mode, you open a single file and then Emacs turns on Org mode. I am not sure if you can get Org mode to “handle” multiple files at the same time like Org-roam does with using the database. When you open a note with using Org-roam, Emacs turns on Org mode. I am not sure if I can compare the two in this case.
The premise of having a million files does not seem reasonable for a personal knowledge management tool.
Org-roam is designed as a personal tool, and the basic premise is that you write your notes by your own hand (refer to Jethro, the author’s guide). Even if you write 10 notes everyday 365 days a year (I struggle to write so many in a single day) it would take you about 273 years to have 1 million notes (did I calculate this correctly?). What’s the reason behind more than 1 million notes?
Thanks for a detailed answer. Appreciate it.
Let me specify:
I have a large collection of OCRed books (15,000+), as well as over 15,000,000 .txt files (a university department material). Each file ranges from a couple of paragraphs to an equivalent of 1,000 pages book.
I though I may utilize org-mode or org-roam as a sort of Memex machine (if you know Vannevar Bush’s idea) in order to access this large library as such it may be “consulted with exceeding speed and flexibility.” I know the idea of org-mode and other handy tools like Obsidian is build on the idea of personal note taking, rather than serving as a more “universal” knowledge management tool.
Perhaps I should just oust the idea of utilizing these tools aimed at personal note taking?
Should I rather look for enterprise-level solutions?
But then I wonder, is it not just all matter of CPU performance?
Thank you for the detail. You have interesting needs.
Apologies if any below sounds too obvious for you. I cannot assume your software knowledge. I would not claim to be an expert in these things (I’d invite others to jump in to help); this is as far as I could get on this topic.
As I understand it, you would like to have a personal full-text search engine for the books and text files stored in your computer. You would like to type some keyword(s) or phrase(s) to look for documents that contain them somewhere in their content, not just in their file names or titles – like Google does for the web.
That’s not Org or Org-roam. This is not because they are personal tools; they are not full-text search engines.
I think you could extend your hunt in two broad categories of software tools. One category is command line utilities such as
grep and Ripgrep. The other is full-text search engines (usually a combination of indexer and user interface).
Try Ripgrep or grep first with some keyword you know that hits some documents from your files. This is just because it’s easy to get and try (especially on macOS or Linux). If it works for the volume of documents, it can be a solution for you. My guess is that it would take too long for it to be practical for your daily use.
Then you could look for a search engine that you could get and set up. These tools tend to be much more complex to get working. May be something like DocFetcher. I’ve never used it so I’m not sure if it is good or work for the volume of data you have – but this one looks easy enough.
More complex “bespoke” solution might be to use something like Meilisearch. There is a number of similar tools; refer to this, for example, for a list of these tools (the author is clearly in favour of Meilisearch but the list might be useful for you). Meilisearch does not index text files directly, so you would need to craft a program to generate JSON files for your text data to be ingested to Meilisearch.
Thanks for a detailed answer.
Yes, I thought about ripgrep. As far as accessibility and speed of search it seems the best option for full-text searching large amount of files.
Maybe integrating ripgrep with emacs would do the job? It can be made interactive, and to use it with emacs would be interesting.
The thing is, in addition to all this—ripgrep full-text search over my gargatuan library, ripgrep search interactive (possible)—I need to incorporate internal links. This is what note taking tools such as org-mode or Obsidian have, crafted as they are for personal note taking. What I need is more of a general knowledge management tool/base, a more convenient digital library, semi-Memex—call it what you may—to consult. Or can you suggest any other tool that may do this?
So a good answer to my requirements may be an org-mode with interactive ripgrep? since org-mode can do internal links? Maybe something like deadgrep would do?
Sorry if asking impertinently, but it is helpful to have some feedback on ideas.
You may want to look into Devonthink (macOS only), which was made with these exact requirements in mind. In addition to your wish list, it also is really good in figuring out what kind of documents cover similar topics (with some kind of AI). It is also one of the few programs that would have no problems dealing with a million documents.
Org roam is better at being a pkm system, something quite different than an “everything bucket”. Org mode is a way to format text files (but also can do much more), perhaps similar to markdown or FoldingText.
Actually, I’ve tested ripgrep vis-a-vis my gargantuan library and the search performance is pretty satisfactory. Thus I foresee no problem trying org-mode.
Though still need to research how org-mode and ripgrep can work together.
Deadgrep may do?
In particular inspired by this article.
Frankly, also attracted by the long-term robustness of org. I would rather convert to a time robust tool even while sacrificing some fancier function and speed such as Devonthink.
There might be a misunderstanding here, org-mode is just a way to format text files. It does not have a database behind it. The “Agenda View” already struggles with 500+ files exactly because of this (see here, for example). Org roam does have a database layer but may struggle north of 40K notes (or nodes?). You could leave your library and org roam in separate folders and create notes in org-roam for the sources you have read and the topics you discovered (going into the direction of a Zettelkasten).
Btw Devonthink works perfectly fine with text files and does not change them in any way (unlike Evernote, OneNote etc). In fact, a DT database file is directory (macOS files are strange ) and you can easily copy any file/folder without opening DT. It is also easy to index external folders in DT.
Thanks for suggestions.
Right. I meant using emacs and ripgrep.
The ideal one tool = large library + ripgrep + ?
? = something that supports internal links
Emacs has some packages that interface with Ripgrep; e.g. refer to this post in this forum. I use both
I think you have enough tools to start forming your own set up. I’d experiment with the following set up if I had a similarly large volume of knowledge base to start with.
- Have clear separation of your own thoughts and the library of text files and books
- Have a root directory/folder for your notes and library — this way, you can use Ripgrep, etc. to search your notes and library at the same time
- Your own thoughts are written as notes with Org-roam
- Notes can have forward link to the source in your file in your library
- “Internal links” are established among your thoughts, thus among your notes, not among text files in the library
- Optionally use my Org-remark to highlight parts of text files and write “marginal notes” inside Org-roam — this way, you can “add” links from the article to your notes without changing the files in your library (the yellow highlighter in the illustration)
I believe it’s the same model @laotang suggests above.
Thanks for such detailed answer and helpful sketch. Appreciate it.
This is precisely what I was aiming at. Will try to build upon something like this.
Though changing the files/notes in the library folder (i.e. writing in content) will do no harm, no?
I guess that depends on the nature of change. The text files and OCRed books are works of others, so it’s no good if you changed the original content.
The key principle is probably that you would be able to clearly differentiate what part of the file content is the original, and what part are your addition – you might like to consider a long-term solution for this, as we tend to forget what we changed as early as in 1-2 years time…
- Changing the original content can be problematic, as it might make it difficult to make proper reference if you wish to have verbatim quotes.
- Changing the “format” might be OK, adding meta data, adding your links to other files, or end of the line ("\n") for the materials where line number is of no significance (I believe some legal materials may need to record the line numbers as per the original). For example, it might be OK if you converted
.orgfiles, and then add your meta data to the files.
You might like to consider version management, such as Git? Not sure if it is practical to use it for a folder that contains >15 million files, though…
I often add comments to those library files as I read them (practicing writing skills by rewriting sentences of various authors—learning by imitation); of course, those comments are always differentiated.
The library files are mostly creative commons or archival stuff (university library with old books) so by and large no copyright issues.
But from technical or performance stance there is no barrier to edit those library files in emacs, no?
Also, any org-roam alternative in emacs that can do internal links?
Anyway, thank you very much for all your suggestions.
I’ll give it a try.
For links, you can use Org mode for .txt files if you wish; you can set up Emacs to asscociate the .txt file extension with Org mode. Or any other modes, such as Markdown mode and Howm mode, the latter of which uses
>>> to indicate a link.
I don’t know many other modes, but I’m sure there are many more in the Emacs universe.
I see no technical issues. What’s your concern? As far as you told us, the files are plain text files. Emacs works well on them. The only thing may be is that if any of the files has no newline indicators (that is, the entire file is a long single line of text), Emacs is known to struggle with a very lone single line — see this EmacsWiki: So Long
Changing the filename and headlines might break org-mode links, if you are not using org-id. Org-remark is a terrific way to add comments and remarks. Another one would be to use org-mode footnotes, which would then stay within the original source.
I use a similar system. My library, however, is spread between folders, Devonthink and Zotero (all can be linked to by org-mode).