Home An Archivist’s View – Jamie Todd Rubin

An Archivist’s View – Jamie Todd Rubin

Productivity

May 26, 2026

Upcoming Site Changes – Jamie Todd Rubin

Note: This post is the first in a new series of posts describing ark, a command-line-based personal archive system I have developed and am using as my primary archiving tool. If you are not interested in this type of tech post, feel free to skip it. For those who are interested, a plan to provide a new entry in the series each Tuesday for the next 14 weeks.

ark is not publicly available on GitHub at the moment. The system has been highly tailored to how I work. It is also highly tailored to the Mac environment. I am on the fence about making it publicly available because I don't have the time or inclination to support it. While I was careful with the design, the design was egocentric in that the one and only user I had in mind was me. Depending on the feedback I hear from folks over the next fourteen weeks, I'll see about whether I'll make the code publicly available.

A couple of months ago, I was searching for a trust document. I've got two computers and two external drives. I tried multiple combinations of searches across all devices. I used Spotlight on both machines, and then switched to Unix-based search commands. Despite knowing the document existed somewhere, my search abilities couldn't surface it.

I. A Moment of Friction

I have thought about personal archives now and then — the kind you read about in biographies of notable people. I recall reading about how Boston University asked Isaac Asimov to collect his papers. I began to wonder if it was possible to create a similar archive for all of my papers. I took a small step in this direction in late 2024 with my Personal Archive System. This was an experiment to see what was in the realm of the possible, it was web-based, but there wasn't much thought about the long-term design and architecture — the very kinds of things I do in my day job.

At the same time, I had been using Claude Code at work to help me build a command-line system that made it easy to interact with Jira and tie Jira into an LLM for easy summarizing. I was impressed by how well Claude Code worked and how it felt almost like a real collaborator. In that project, I acted as an architect, and Claude Code did all of the grunt work for me.

At the end of March, I decided to sit down and, with Claude Code as a kind of partner¹, talk through how to best design a personal archive system that takes its core design principles from real archiving principles, while meeting my requirements for what I wanted in an archive system.

II. An Archivist's Lens

Longtime readers know that I have gone through at least two major iterations of personal archiving of sorts. The first, in the early 2010s was the time I spent using Evernote to go paperless. That experiment lasted several years. Ultimately, however, it wasn't a good fit for what I was trying to do. Part of the problem was that, at the time, I'm not sure I knew what I was trying to do. I knew that I wanted to be able to find things quickly. In order to do that, Evernote required some amount of metadata infrastructure (notebooks, tags) and for me, maintaining that became a roadblock.

The second wave, in the 2020s has been my use of Obsidian to go practically paperless. One thing that attracted me to Obsidian was its simplicity. At its core, it was plain text, the most basic, most portable form of data there is. Another thing that attracted me was that it was entirely local. No need to store data in the cloud. Everything was on my local machine. Ultimately, however, what I discovered was that both Evernote and Obsidian were working tools. That is, places to do work, take notes, etc., rather than a stable archive of work already done. In other words, these tools are optimized for now. An archive is optimized for posterity.

As I worked on the design of the system, one question that guided every decision was how would a real archive handle this, and where does the personal context require an adaptation? One obvious difference: in a public archive, the archivist and the subject are different people; here they are the same person. Other than that, the principles that guide a public archive could apply here. These principles include things like:

Provenance
Finding aids
Controlled vocabulary
Sensitivity
Accession

For instance, every item in the archive comes from somewhere in either the digital or physical world. This is its provenance. An archivist typically organizes items in an archive in a hierarchy that starts with series at the top. A series might have sub-series. Items in a sub-series might go in one or more files. Within the files are the items themselves. There is always, therefore, a clear path to an item in the archive.

III. Non-Negotiables

I came into the design discussion with several non-negotiable design decisions:

Local data storage: The archive would be designed to be stored locally on a file system.
A clear data egress boundary: A personal archive by its very nature will contain sensitive documents in a life: medical, financial, legal, etc. The sensitivity of these documents is decided at ingest and enforced at egress as a hard rule. As we will see, the archive makes use of LLMs for a variety of tasks. But most LLMs reside on the internet and that means sending data to them. ark was designed to block sensitive data from egress to these sources².
Hands-off automation: When bringing in hundreds of thousands of documents, manual classification, linking, etc. is out of the question. There has to be mechanisms for automating this process.
Durability over cleverness: Every architectural decision, from the local file storage, to the database, Unix composability, agnostic LLM layer, etc., all favor longevity. Services come and go. Formats change. APIs get deprecated. The goal isn't to have the cleverest tool; it is to have a tool that still works fifty or a hundred years from now.

IV. Pipeline as an Operational Spine

There are seven stages that form the operational backbone of ark.

Ingest: Includes parsers for email, PDFs, images, Office documents, Markdown, as well as other sources of data like calendars, text messages, read data, music, health data, and more. When ingested, this data goes through type detection, text extraction, OCR, deduping. Think of this as the narrow waist where everything entering the archive gets normalized into the same shape.
Enrich: Two flavors of enrichment: (1) automated, which includes things like classification, embeddings, mood scoring, place and person extraction; (2) human enrichment, which includes things like annotations and people curation.
Store: Content-addressed file storage plus SQLite database holding records, the full text index, embeddings, annotations, and the relationship graph. Ingested data enters the store and becomes a permanent immutable document³.
Search: Or more formally, retrieval, includes full-text, semantic, hybrid, and person-aware searches, and a scoring system that ranks results by signal rather than by raw text match.
Surface: What the archive can show you outside of searches: timelines, “on this day”, action-item digests, etc. These are distinct from retrieval in that you don't explicitly ask for them. The archive offers them.
Synthesis: Creating “bundles”, LLM tasks, deep searches, MCP. This is where the archive stops being a corpus and starts answering questions.
Stewardship: Proactively looking for and creating actions based on what is coming into the archive; automatically drafting replies, etc., for incoming activities. This is the bridge to automation that may allow the archive to act on my behalf as a kind of personal assistant.

V. What's Coming in the Rest of the Series

The remainder of this series will describe each of these layers from inside the design:

Part 2 zooms into the data model and graph that serves as the connective tissues between documents, people, and annotations.
Part 3 covers the ingestion process — how data comes into the archive, in bulk, and through a daily ingestion process.
Part 4 covers retrieval.
Parts 5-6 cover the enrichment layers.
Part 7 gives a tour of the way books and reading are handled by ark. Both are a significant part of my daily life, and warranted their own functions in the system.
Part 8 covers synthesis: bundles, LLM tasks, deep searches, and the “recommend” surface.
Part 9 covers stewardship — the layer that turns ingested material into prioritized work.
Part 10 covers publishing back out via the outbox.
Part 11 covers the operational tools and functions of living with the archive day-to-day: sync, location, timeline, day rank, etc.
Part 12 covers the four interfaces: CLI, TUI, MCP, and Vim.
Part 13 is all about the long game: production, backup, export, and durability.
Part 14 closes with a portrait of what the archive actually holds.

VI. The Long Game

ark looks the way it does because archivists solved these problems first, and because the design choices that matter most at a five-year horizon are different from those that matter most at a fifty-year horizon. Keep that in mind as you read the rest of the series.

VII. Coda

It took about a week of work to get the core system up and running, and to get 80% of the documents I had into the archive. As I write this, there are 680,497 items in the archive. There were around 400,000 at the end of that first week. At that point, I used the tool to search for the trust document I was searching for. By then, the enrichment layer was in place, and the basic search functionality automatically enriched the search as well. I ran a simple search command in ark:

ark search "living trust" --type pdf

which after about 0.30 seconds, returns a single match: the exact document I was looking for in the first place.

Two months of nights and weekends in — about 1,250 commits, 116,000 lines of Python, 4,800 tests — ark holds 680,497 documents across fifty tables: emails going back to 1994, every diary entry I've digitized, photos with their GPS, scanned tax records, blog posts, calendar events, books I've finished, music I've listened to. The 9 GB SQLite database and 125 GB content store live on this laptop and back themselves up overnight. Everything I've described above is what makes that pile navigable.

Source link