Home A Day with the Archives – Jamie Todd Rubin

A Day with the Archives – Jamie Todd Rubin

Productivity

June 3, 2026

A Re Discovery in Five Acts – Jamie Todd Rubin

“Pivot! Pivot! PIVOT!” — Ross Geller

I had planned to write about ark‘s store and how it manages the items in the archive today. But it occurred to me that in writing about ark last week, I didn't really show any of the system. Instead, I told a lot about the system. And as they say in the writing world: show, don't tell. So this week I've opted to show what ark looks like so you get a tour before seeing the plumbing. Part 1 made the case for the archivist's lens. Part 2 shows what it looks like. I've cast this as seven “scenes” through a day — not necessarily a literal day, but seven realistic moments, stitched into one. The post looks long only because of the archive output for the examples I've provided.

Scene 1 – Morning: Catching Up

ark has a wide range of items throughout my life. I need an easy way to see what's there for a given day, and this is where the ark day command comes in handy. Here is what it looks like for the day before I began writing this post:

A few things to note here. First, ark is pulling information from the archive from a variety of sources. The header information contains the date (and how long it has been from that date to the time the command was run). It has my location, which ark captures either explicitly through artifacts in the archive, or through an inferred default location over time. The health information (steps, sleep, etc) comes from Apple Health.

It has seven sections that follow that try to provide a high level picture of the day: what was on the calendar and diary entries lead the way. The numbers ([1]) are sigils that can be used with other ark commands in place of longer document IDs. So I can type ark show @3 to see the full diary entry.

Reading is important to me so I can see what I was reading on that day, including what I started reading and what I finished. Any reminders from Apple Reminders show up for that day. Also email, but note that the email is filtered in this view to show only mail from known contacts, and despite getting 37 messages, none of them were from my known contacts that I'd want appearing here.

There is a list of people I interacted with on that day, or who appear in documents that were ingested on that day, followed by a list of text messages (or message groups) that took place on that day.

The command provides an easy way to get a view of any day in the archive. I could also have ark summarize the day for me via ark day 2026-05-30 --summarize. Like most ark commands, there are plenty of options for filtering and controlling the output. Here is what ark day --help looks like:

Scene 2 – Triage: What Wants Attention

With a critical mass in the archive, and new items always coming in, it was a logical next step to make use of all of that information. After reviewing the day, I might make use of ark‘s “stewardship” layer. ark reviews items in the archive, as well as items coming into the archive, and identifies things that require my attention or action. From the CLI I can get a digest of this as follows:

or in TUI:

Same digest in TUI view

Here, I can see at a glance things that require my attention and take action on them¹. ark uses the Eisenhower matrix to put actions into quadrants around (1) Do First, (2) Schedule, (3) Delegate, and (4) Delete. That is what those Q1, Q2, etc., represent.

ark knows how to dedupe these items as they come into the archive so that if I get 3 reminder emails about the same thing, the thing only shows up once in the digest (even though all 3 emails are stored in the archive).

Working through the day, ark prepare digest is a nice follow-up to looking at the day or the previous day to get the context of what happened and what needs to happen.

Scene 3 – Looking Back: Finding What's In the Archive

Let's face it, a big reason for having an archive is so that you can search it. ark has sophisticated search capabilities that allow for a rich search experience. First, ark can do full-text searches of anything in the archive, whether text, PDFs, Office documents, email, text message, annotations, even photos. That's because when something enters the archive it it goes through a process that “enriches” what comes in making anything searchable.

Let's start with a regular full-text search. Here, I am going to search for items in the archive with the words “tax 2025”. The other parameters in the command tell it to limit the results to 10 items, to leave out snippets of the items, and to show how long it took the search to run:

Not a bad mix of items, but probably, not quite was I was looking for, either. The search ran in about quarter of a second, but the results were meh. I was really looking for tax-related documents, not documents with the words “tax” and “2025” in them.

This is where a semantic search comes in. Semantic searches allow searching for meaning.

In this case, I searched for “documents I'll need at the end of the year” added a filter from the beginning of the year, and added the --semantic parameter to tell ark do perform a semantic search. The results are much closer to what I was looking for. They are tax documents or tax-adjacent documents when compared with the simple full-text search. The trade-off is (a) more time during ingestion to give each item embeddings² that can be used by semantic search, and (b) a slightly longer time to execute the query (because it uses AI) — just over 1 second compared to 0.250 seconds. But even against an archive with 700,000 items, it is a negligible change in speed and the value of the quality of the search far outweighs it.

Scene 4: A Conversation with My Archive via Claude Desktop (MCP)

With more than 7,300 posts here written over 20+ years, I can't always remember if I've written about something before. For instance, as I write this, I am about to head up to Boston for a conference. In reviewing the profiles of the various attendees, I couldn't help being bemused by the wild garden of jargon on display. Immediately I thought: great topic for the blog. But I've written a lot over the years and it was entirely possible I'd written about this before. Since all of my blog posts are captured in ark, I decided to check there first.

One of the great things about the design of the archive is that many of the commands in the system are exposed as tools to LLMs via an MCP server. That means I don't always have to sit at the command line. I can go into Claude Desktop, for instance, and ask it directly: “I want to write a blog post about how business jargon annoys me. What have I written about jargon in the past?”

A conversation with Claude

If you can't read the screenshot, here is what Claude said in answer to my question:

Searched memory

The conversation search didn't turn up anything specifically about jargon. Let me try the ark search directly, since that's where your blog posts live.

Used ark integration, loaded tools

There's a very promising hit: “Meaningless Statements, Vague Gibberish, and Useless Qualifications” (2021) and “What not to do in a resume” (2006). Also that “AI jargon” email from 2025. Let me pull the relevant ones.

Used ark integration

You've got two solid posts in the archive that directly touch on jargon. Here's what I found:

“Meaningless Statements, Vague Gibberish, and Useless Qualifications” (March 10, 2021) — Your most directly relevant piece. It was prompted by a spam email pitching IT services that was so thick with buzzwords it meant nothing. You used the Foundation passage where Holk's symbolic logic analysis of a politician's speech finds, after eliminating all the meaningless statements and vague gibberish, that nothing is left. You particularly skewered “talk synergies,” looked it up in Merriam-Webster, and imagined an email parser that auto-replies with Strunk & White's Rule 17: Omit needless words. The post got three comments.

“What not to do in a resume” (November 30, 2006) — This is a broader piece about resume writing, but item #4 is explicitly about jargon: “Don't overuse jargon.” You wrote that you grow annoyed when bullet points read like a foreign language, and gave the example “Implemented best-of-breed, open-standards ITIL-based WEP-encryption application” as something that sounds ridiculous even if true.

So you've touched on this twice — once obliquely (resumes) and once head-on (the spam email). The 2021 post has the richest material, with the Asimov hook being particularly strong. For a new blog post, you might want to build on that Foundation framing more broadly, or shift the lens from spam email specifically to workplace communication generally. The 2021 post was reactive; a new one could be more systematic — here's what jargon actually costs us, here are the worst offenders, here's what plain language looks like instead.

So right away, I know that I've already written about jargon on the blog at least twice before, the last time in 2021 — so maybe it is time for something new.

I wanted to point out the line where Claude said “Used ark integration, loaded tools.” When Claude couldn't find anything in our conversations about jargon, it immediately knew to use the ark tool provided via the MCP integration to check the archive, and it used the correct search terms to find previous blog posts about jargon. All I had to do was ask my question.

This is easier to use than the search commands on the CLI, but they also serve different purposes. The CLI tools can be used in conjunction with other tools and commands through pipes, redirects, and scripts. Using Claude lets me interact with the archive as part of other work I am doing. It also allows me to follow up on things. For instance, I might wonder how my writing style changes in the 15 years that separates the two posts.

It is cut-off in the screenshot, but here is the short version at the end of Claude's response:

Short version: 2006 Jamie is a practical manager sharing hard-won tips. 2021 Jamie is a writer using an annoyance as a lens to say something about language, meaning, and the value of clarity — and reaching for Asimov to make the point stick.

Keep in mind, I could do this with any of the 700,000 documents in the archive. That said, there are some limits with the MCP, all intentionally imposed.

All documents in the archive get a sensitivity level. Any documents with a “Restricted” sensitivity are never returned to the MCP. These documents never leave the local machine. If I want LLM summaries of these documents, I can get them from the CLI, which will fall back to a local LLM that runs on my laptop.
Documents marked “Sensitive” are swept for sensitive information and that information is redacted before returning it to the MCP. For instance, if an SSN, account number, phone number, etc., appears in a document, those all get replaced with “[REDACTED]” before the document is returned to the MCP.
All documents are swept for a known list of sensitive information (SSNs, etc.) and are redacted before leaving the local machine, regardless of sensitivity.

Still, I can converse with about 90% of the documents that are in the archive, and that can be fun. For instance, I asked Claude what music I was listening to on past May 31sts³, and it came back with this:

The best part about the MCP integration, for me, is that I use Claude as a kind of personal assistant and with access to my archive, Claude is much more effective in doing things for me because it has a rich context to draw from.

Scene 5 – Adding My Voice: The Annotation System

It is one thing to have artifacts in an archive. But real value is added to the archive when you can layer in context to those artifacts. This is where the annotation system comes in. When items are pulled into the archive, they become read-only and cannot be changed⁴. The annotation system provides a layer atop the item itself to provide context to that item. When showing that item in ark, any annotations that are associated with the item are displayed as part of the output. Annotations are available in searches, and there are multiple ways of making annotations.

Directly through the CLI via the ark annotate command. This command uses the default editor set in your shell⁵.
Redirected into the ark annotate command. For instance, if I made a bunch of notes in a files called notes.txt, I could do this as the command line: cat notes.txt | ark annotate document 1234 to add the annotation to document 1234.
Via the MCP. Yes, I can ask Claude to annotate for me.

I have a book club meeting this evening. Our book was Wild Dark Shore by Charlotte McConaghy. ark captures my books and reading so I can search for the book in ark as follows:

That [1] on the line with the matching record is a “sigil”. It is a shortcut reference to the item that I can use in future commands (until another search is run) instead of having to provide the full id (725665 in this case) each time I want to refer to it. When a book is added, ark searches sources on the Internet to pull in metadata about the book. When I look at the record via ark show here is what I see:

After all of that metadata for the book record, note the section called --- annotations ---. This is where I have gone and added annotations to this particular book record. I've added two, as you can see, but I can add as many as I want. These are not part of the book record, but rather are linked to the book record. The ark show command is pulling all of this together when it shows the book to me.

Adding the annotation is as simple as running the command. Or, asking Claude to do it for you.

Here, not only was Claude easily able to find what I was looking for, but once found, it summarized it as requested, and then added the annotation that I asked for. For me, this turned annotation from a chore into something easy. I can sit with my book and as I read, tell Claude to add various annotations to the book for the thoughts or passages I want to comment on.

Scene 6 – `ark quality`: The Archive That Knows Where it is Weak

Archivists work to ensure the quality of the archives they manage. That not only means preserving the archive but enriching it with as much information as they can about the contents of the archive. Each item in ark has a certain set of metadata. It has the date of the item, the date it came into the archive, the source it came from, its location in the real world (if it is not digital in the first place). Each item has a title, a series and sub-series, a format, a sensitivity level. Items can have OCR data, embeddings, generated descriptions, and more. With all of that possible data surrounding an item in the archive, some way is needed to assess the quality of items in the archive.

Enter the ark quality command.

This command provides a way of looking at the archive as a whole seeing the overall quality of items. It also allows for looking at a specific item and seeing the quality of that item. Quality is a score between 0-100 that is a measure of how complete the picture of the item (or archive) is. An item with no metadata at all rates a 0. An item with every possible element filled in rates 100. Here is a quality summary of the entire archive.

Here you can see that about 60% of the items in the archive fall into the top quadrant of quality. The highest quality class of items are diary entries. The lowest quality are attachments. This makes sense because attachments are pulled in from email messages and outside of getting descriptions from vision models and embeddings on those descriptions, they have almost no other metadata.

Anything that comes into the archive has a title. Often times it is simply the name of the file that was ingested and we all know that filenames can be ambiguous or provide little useful information about what the file contains. ark can assess the quality of titles as well, for instance:

Note that in addition to listing items in the archive with poor quality titles, the command also tells you how you can raise the quality of those titles using ark enrich titles. Automated enrichment is a key element of ark. It uses LLMs to read the document and give a descriptive title to the document based on the context. Each document class has its own template of title format. So for instance a letter might have the format “Sender-to-recipient (date)” and the LLM will produce titles in that format so that they are consistent across that class of items, raising the quality and consistency of the archive.

ark also uses a controlled vocabulary for its metadata like series and sub-series. When something is ingested it may get a series that falls outside that controlled vocabulary. ark quality vocab can help to identify such items:

Once identified, ark provides commands for automatically fixing the items, either manually, or across the entire series.

All of this helps to raise the quality of the archive, making searching more accurate.

Scene 7 – Closing the Day

We'll conclude the tour with one way of looking at the work I got done yesterday. This is a good example because it shows how ark was really designed with core Unix principles, each command doing something the output of which could be used as input for other commands.

Let's begin by looking at what the archive ingested from yesterday. The easy way to do this is as following (I'm limiting this to just 10 entries for space):

But if we wanted to know the total number of things ingested from yesterday we can do this:

Next, we can use the ark bundle command to generate a “bundle” of one or more documents from ark. This one bundle would contain the contents of all of the documents we put into it. So for instance, if we wanted to generate a bundle with the full contents of everything that came into ark yesterday we could do this:

which would produce output containing the contents of all 81 items ingested yesterday — a total of 799 lines so I won't show them all here.

Finally, we can redirect that bundle to an ark task. A task is a prompt for an LLM to do something with the data. We'll use my “shipped-yesterday” task, which summarizes everything I “shipped” (code commits, posts written, emails sent, etc.) Putting it all together, it looks like this:

The first thing to note is the warning that ark gives on that second line after the command:

ark: warning: Redacted 1 sensitive pattern(s) before sending

This is an example of what I was talking about earlier. Sensitive information never leaves the local machine; it is either blocked, or, as in this case, redacted before sending it to an LLM.

The rest summarizes the work I did yesterday, based on how I specified the way I wanted it summarized in the “shipped-yesterday” task.

Seven scenes, but really seven angles on one habit. Most days I use four or five of these without thinking about it: the archive surfaces the day, I notice what wants attention, I look for something I half-remember, I add a note to something I just read, I check what shipped. None of those scenes are unique to ark taken in isolation — search engines search, task managers triage, note apps annotate. What's unusual is that they sit on top of the same archive, share the same discipline, and the parts compound. That's the payoff Part 1 promised the archivist's lens would deliver.

The next post takes the floor out from under all of this and shows what holds it up — ark‘s data model. A single SQLite database, a clean graph connecting documents to people to one another, and a schema disciplined enough that it should still answer the same questions thirty years from now. Plumbing always sounds duller than the demo, but it's where this stops being a pile of clever commands and starts being an archive built to last.

Source link