Email triage with an embedding-based classifier

Can an algorithm help me sort important, actionable messages from everything else? The startup pitch might be: “TikTok, but for keeping focus on my life's top priorities.”

This is the premise for my personal information firehose project. In this phase of research, I tackled the question of whether it is possible to train a model to recognize my personal preferences on email triage.

Although I thought the answer was likely yes, I know this is a different problem than (for example) spam filtering. As the Gmail priority inbox creators wrote in 2010:

While ideas were borrowed from the application of ML in Gmail spam detection, importance ranking is harder as users disagree on what is important, requiring a high degree of personalization.

My solution: an embedding-based classifier

My preferred solution so far is to compute an embedding for each email and classify it with logistic regression.

The workflow is like this:

Connect to my Fastmail account and download a few hundred recent emails
I hand-label each of these with my desired label (priority, fyi, ignore) — aka the golden set
Compute an embedding (vector of a few thousand numbers capturing the "meaning" of the email)
Compute some features like whether I am in to vs cc, and whether I have sent an email to the sender in the past
Train a model: logistic regression classifier trained on 75% of the golden set
Evals: use the other 25% of the golden set to score the model's labeling results

This is an entirely client-side / local-only Typescript/React app running in the browser. The exception is Fastmail email fetching and encoding the embedding with a call to OpenAI. I used IndexedDB + Dexie for storage.

There's no user-facing UI (e.g. an inbox) yet, since this phase of the project is focused entirely on verifying that I can train a suitable model.

How it performed

Evals are a common way to test results of a machine learning algorithm. They provide scores that show how well or poorly the model is performing, and create an iteration loop for improving it.

I wrote my own simple eval runner that runs directly in the app. For each example from the eval set, it produces probabilities for all three labels. I hand-chose some thresholds (probability >= 0.5, and at least 0.2 higher than the next closest label) as a proxy for confidence.

Then:

If a confident prediction matches my manual label from the golden set, score a pass ✅
If a confident prediction doesn't match, score a fail ❌
If the model doesn't have enough confidence in its prediction, then it's a skip ⏩️

At the end, it calculates a final score for accuracy (correct predictions, excluding skips) and coverage (the percentage it predicted without skipping). Plus some other stats useful for debugging, like a confusion matrix (visualization of how errors distribute).

Individual predictions

Here's an example of a successful prediction. It's a security notice, so should go to the FYI label.

27/90 A new device is using your a.. p=0.17,f=0.57,i=0.26 expected f ✅

The model gives 0.57 probability of the correct label, with the next closest (ignore) at 0.26. That's above the thresholds I chose, so it makes the prediction, and passes.

Here's a failure, a payment receipt for a passport renewal:

52/90 Pay.gov Payment Confirmation.. p=0.18,f=0.59,i=0.23 expected p ❌

I consider this a priority item, because I need to take action (print it out and include it in the passport renewal documentation package). But it looks like a receipt, which usually go to FYI, so the model predicts that label.

This is fair prediction, actually! It's a good example of how user preferences are much more subtle than broad heuristics like "all receipts go to FYI."

Here's a low-confidence case, a customer satisfaction survey from a photography studio I visited recently:

69/90 Your opinion matters p=0.19,f=0.35,i=0.45 expected i ⏭️

I labeled this one "ignore" because I rarely pay attention to followup surveys. But again, this is a genuinely ambiguous case: I've previously emailed with this sender and interacted with their automated emails in the past. The model's highest-confidence prediction is right (ignore is 0.45, higher than any others) but not above the 0.5 threshold, and too close to fyi (only 0.10 difference) to consider good confidence.

Overall scores

ACCURACY:
  All predictions:     74/90 correct = 82%
  High confidence:     64/69 correct = 93%
  Coverage:            77% (69/90 meet threshold)

The model is right 93% of the time when it's confident. That's very good! I'd consider 95-98% accuracy an ideal to strive for.

Coverage is weaker at 77%, which would leave me with two or three out of every ten emails that would need to be put into a "to review" bucket for the user. This could stand improvement but feels like it's within striking distance of good coverage.

We can also look at the confusion matrix, which tells us how the model was wrong for each label:

CONFUSION MATRIX (rows=actual, cols=predicted):
           priority    fyi  ignore  (skip)
priority     5       1       0       5
fyi          0      28       2       7
ignore       0       2      31       9

For priority, it almost never predicts wrong, but rather skips. So this is a coverage problem, but that's good—I prefer this over miscategorizing priority items.

For good measure I also include some per-class metrics. Here we can see that priority is the label that needs the most attention, perhaps with more samples, or different heuristics on what confidence is considered acceptable.

           Precision  Recall    F1    Skip
priority       1.00    0.45   0.63   45%
fyi            0.90    0.76   0.82   19%
ignore         0.94    0.74   0.83   21%

So overall I feel good about using this embedding-based classifier approach. But I also explored some other variations to compare.

Fine-tuning an LLM

As an alternate approach, "just send it to an LLM" is a good default in today's world.

I used a prompt like, "You are my assistant and triage emails into three categories according to these criteria..." with GPT-5.2. This did great on obvious categorizations that were in the prompt (e.g. "newsletters go to FYI") but poorly on any borderline or ambiguous case. It scored 68% on all predictions, compared to 82% with the embedding-based classifier approach.

Clearly the model needs more examples, but putting all 300 into the prompt doesn't seem like a good approach. The answer is fine-tuning!

I took the same set of ~300 samples used in training the classifier, and exported to a JSONL file suitable for upload to OpenAI's web interface. From this I fine-tuned a gpt-4.1-mini that performs like this:

ACCURACY:
  All predictions:     74/90 correct = 82%
  High confidence:     74/90 correct = 82%
  Coverage:            100% (90/90 meet threshold)

CONFUSION MATRIX (rows=actual, cols=predicted): 
           priority    fyi  ignore  (skip)
priority     6       5       0       0
fyi          1      34       2       0
ignore       0       8      34       0

PER-CLASS METRICS:
           Precision  Recall     F1  Skip%
priority       0.86    0.55   0.67   0%
fyi            0.72    0.92   0.81   0%
ignore         0.94    0.81   0.87   0%

This obviously could be improved a lot, for example teaching it to skip things it's unsure about. But 82% accuracy is not shabby!

That said, overall I don't like the LLM path as much as the embedding-based classifier. It feels like overkill to use an LLM for this simple labeling task. Fine-tuning jobs are much slower (~10 minutes) than computing embeddings (~1 second per email) and training the classifier (nearly instant).

I also think that using an LLM would encourage me as the developer to spend effort on prompt tuning, but the whole idea here is to have the model extrapolate the user's preferences from examples. Still, it's good to know there's another viable path for this email triage problem.

Local embeddings

It feels very natural to want to make this whole project local-first. Email is the "keys to your kingdom" (password resets etc) so sending message content off to a third party for embeddings doesn't feel great. What's more, the app is 100% local except for the embedding, so why not try bringing that part on-device?

I tried computing embeddings with both in-browser (Nomic) and a self-hosted Python server (PyTorch) with various size models. Here's a comparison table, sorted by best to worst (in my judgement):

Host	Model	Dimensions	Accuracy	Coverage
OpenAI	text-embedding-3-large	3072	93%	77%
OpenAI	text-embedding-3-small	1536	93%	74%
self-hosted Python	bge-m3	1024	97%	64%
self-hosted Python	mxbai-embed-large-v1	1024	88%	63%
in browser	nomic-embed-text-v1.5	768	87%	59%

So OpenAI is still the best, but only by a smidge. These all basically work! The in-browser option feels underpowered for this task, but it feels like bge-m3 could do the job with some tuning.

Findings

The research question for this phase was: "can a model sort according to highly user-specific preferences, specified only in a set of examples"? And the answer, happily, is yes.

Is 93% accuracy and 77% coverage "good enough" for real world use? I'm not sure, but there are many levers I can use to improve scores. For example there are no features right now related to threads, and the confidence thresholds were numbers I picked arbitrarily and didn't iterate on. I feel it's probable I could fiddle with knobs for a while and eventually hillclimb to 90%+ coverage and 95%+ accuracy.

Other notes:

On the architecture, I find it really nice to separate “understand the email” (compute embedding, slow and expensive) from “label according to user preferences” (logistic regression classifier, fast and cheap).
Predictions were poor until I got to over 250 samples in the training set. It was also important to have relatively balanced labels. I had to go dig out a bunch of older priority messages from my archive, since any given recent set tends to be less than 10% priority.
I really enjoyed using IndexedDB + Dexie + the dev tools database inspector. It was quick to work with, and easy to go in and do stuff like deleting all the embeddings or deleting a particular model.
Local / open text embedding models work fine on consumer hardware. Having to run a separate Python server isn't ideal but implies that something local-first or at least fully self-hosted is entirely plausible.

What's next

So accurate email triage is possible for a user (me) who has hand-labeled hundreds of emails. But up-front labeling would be an unreasonable thing to ask of any other user.

This initial golden set is also a fixed point in time: my triage needs are likely to be different (because my life is different!) in six or twelve months' time.

So this implies two follow-up research questions:

How can we bootstrap getting the data needed to start making useful predictions? Perhaps we mine the user's mailbox history for signals like replies and read/unread status. Or maybe this is where a generic LLM prompt might be able to do triage with 60% coverage, and send all the rest to a "to review" category to encourage the user to start building up their own golden set.
How can we adapt the filtering over time to reflect the user's changing preferences? For example when you're hunting for a job, inbound job leads may be a priority. When you just started a new job, a message from a recruiter is an irrelevant distraction.

#1 feels like more of a product onboarding question than thorny research, so I'll probably put it off for now.

#2 is where I want to go next. Is there a fast, low-friction, as-it-happens method for the user to report their desired label for a message? Perhaps this is even "in anger"—they get a phone notification for a priority message that is actually not a priority, and they can signal in that moment that they shouldn't be seeing this with no more effort than just dismissing the notification.

Ideally I might even go fully to implicit feedback, compared to the explicit feedback of selecting a label. For example, watching a YouTube video for a while will implicitly signal to the algorithm that you find it interesting, whether or not you thumbs-up (= explicit feedback) the video.

But this implies detailed data coming in through the UI. For this project, I'd prefer to build as little UI as possible. Users already have a zillion inboxes to check, I don't need to give them another one. But if the triage relies on (for example) labeling in the user's email to be read in their existing mail client, we don't have control over the UI and hence much less access to implicit feedback like "did the user scroll to the bottom of this message."