Product-Ready Embeddings on the Edge

Embeddings are the bridge between human language and product intelligence. They turn text into dense vectors that can be compared, clustered, and searched, which is why they power semantic search, recommendations, deduplication, and memory features. But a list of numbers rarely builds intuition.

This tutorial makes the pipeline tangible. You will watch tokenization create IDs, see a masked model score candidate tokens, generate sentence embeddings with different pooling strategies, and finish by ranking results with cosine similarity - all running locally in the browser with Transformers.js.

The lesson is simple: tokens feed the model, logits explain a model’s next‑token beliefs, embeddings compress meaning, and vector search turns that meaning into ranked results. We will make each step visible.

In this walkthrough you will:

Watch how token IDs and special tokens are created.
See masked token predictions and how logits become probabilities.
Generate embeddings and inspect their structure.
Store vectors in localStorage and rank results with cosine similarity.

Everything runs locally in your browser. The first run will download the models you choose and cache them for reuse.

Running embeddings on edge devices changes the product equation:

Lower latency: users get instant results without round‑tripping to a server.
Privacy by default: sensitive text never leaves the device.
Cost control: similarity search happens client‑side, reducing inference spend.
Resilience: features keep working in flaky networks or offline.

Primer: tokens, logits, embeddings

Tokenization converts text into IDs. Models like BERT prepend special tokens such as [CLS] (classification) and append [SEP] (separator) so the model can identify boundaries. Those IDs are the true input to the transformer.

Logits are the raw scores a model produces before softmax. In a fill‑mask task, the model assigns a score to every vocabulary token for the masked position. Softmax turns those scores into probabilities so you can compare candidates.

Embeddings are the vector representations that drop out of the model. Sentence embeddings are produced by pooling token embeddings (mean pooling) or by using a designated token (CLS pooling). Normalizing vectors makes cosine similarity a simple dot product, which is perfect for nearest‑neighbor search.

Interactive lab

Each step is executable. Change a setting, watch the code update, and re-run to see the impact immediately.

Desktop required

Open this post on a desktop browser to run the interactive embeddings and vector search lab.

Model downloads

The first run will download model weights to your browser cache. Subsequent runs reuse the cached files.

Run the step to see output and a rendered result.

Text

Tokenizer model

Show special tokens

const modelId = "Xenova/bert-base-uncased";
const text = "Build in-browser models that never leave the device.";
const showSpecialTokens = true;

const tokenizer = await getTokenizer(modelId);
const encoded = await tokenizer(text);
const ids = Array.from(encoded.input_ids.data ?? []);
const rawTokens = tokenizer.model?.convert_ids_to_tokens
  ? tokenizer.model.convert_ids_to_tokens(ids)
  : tokenizer.convert_ids_to_tokens
    ? tokenizer.convert_ids_to_tokens(ids)
    : ids.map(id => String(id));

const tokens = rawTokens.map((token, index) => ({
  id: ids[index],
  text: token,
  special: token.startsWith("["),
  color: token.startsWith("[") ? "rgba(255, 255, 255, 0.12)" : colorFromText(token),
}));

const visible = showSpecialTokens ? tokens : tokens.filter(token => !token.special);

return {
  type: "tokens",
  title: "Tokenization (" + visible.length + " tokens)",
  note:
    "Tokenizer: " +
    modelId +
    ". Special tokens like [CLS] and [SEP] mark sequence boundaries before the model ever sees a word.",
  tokens: visible,
  insights: [
    {
      title: "Token IDs are the model's vocabulary",
      body: "Every token maps to a numeric ID. The embedding table uses that ID to look up the vector the model starts with.",
      accent: "Vocabulary",
    },
    {
      title: "Subwords explain unknown words",
      body: "Rare words are split into smaller pieces, so the model can recombine meanings without inventing new IDs.",
      accent: "Subwords",
    },
    {
      title: "Token count drives cost",
      body: "Longer inputs create more tokens, which means more attention work and higher latency.",
      accent: "Compute",
    },
  ],
};

Run the step to see output and a rendered result.

Masked prompt

Prediction model

Top K results5

const modelId = "Xenova/bert-base-uncased";
const prompt = "The quick brown [MASK] jumps over the lazy dog.";
const topK = 5;

const fillMask = await getPipeline("fill-mask", modelId);
const predictions = await fillMask(prompt, { topk: topK });
const candidates = Array.isArray(predictions[0]) ? predictions[0] : predictions;

const items = candidates.slice(0, topK).map((item, index) => ({
  token: item.token_str.trim(),
  score: item.score,
  color: scoreColor(item.score),
}));

return {
  type: "logits",
  title: "Masked token guesses",
  note:
    "Prompt: " +
    prompt +
    ". Scores come from a softmax over the model's logits for the masked position.",
  items,
  insights: [
    {
      title: "Logits become probabilities",
      body: "The model outputs raw scores (logits). A softmax converts them into probabilities that sum to 1.",
      accent: "Softmax",
    },
    {
      title: "Context reshapes the distribution",
      body: "Change a word near the mask and the ranking shifts, because attention pulls in different signals.",
      accent: "Attention",
    },
    {
      title: "Top‑K is a slice, not the whole story",
      body: "The model scores every token in the vocabulary, but we only visualize the top few for clarity.",
      accent: "Top‑K",
    },
  ],
};

Run the step to see output and a rendered result.

Text to embed

Embedding model

Pooling strategy

Normalize vectors

const modelId = "Xenova/all-MiniLM-L6-v2";
const pooling = "mean";
const normalize = true;
const text = "Embeddings turn text into a numeric fingerprint you can compare.";

const embedder = await getEmbedder(modelId);
const tensor = await embedder(text, { pooling, normalize });
const vector = toVector(tensor);

let sum = 0;
let min = vector[0] ?? 0;
let max = vector[0] ?? 0;
for (const value of vector) {
  sum += value * value;
  if (value < min) min = value;
  if (value > max) max = value;
}

return {
  type: "embedding",
  title: "Embedding snapshot",
  dimensions: vector.length,
  norm: Math.sqrt(sum),
  min,
  max,
  preview: vector.slice(0, 16),
  note:
    "First 16 dimensions from " +
    modelId +
    ". Pooling decides how token vectors combine before similarity comparisons.",
  insights: [
    {
      title: "Pooling defines the sentence view",
      body: "Mean pooling averages all tokens. CLS uses the first token embedding as a summary signal.",
      accent: "Pooling",
    },
    {
      title: "Normalization stabilizes similarity",
      body: "Unit‑length vectors make cosine similarity a simple dot product and keep scores consistent.",
      accent: "Normalize",
    },
    {
      title: "Embeddings are high‑dimensional fingerprints",
      body: "Small edits nudge the vector, which is why similar phrases end up close together.",
      accent: "Geometry",
    },
  ],
};

Run the step to see output and a rendered result.

Dataset

Search query

Add a new document

Top K results3

Persist to localStorage

Reset stored vectors

Embedding model

Pooling strategy

Normalize vectors

const datasetId = "support";
const documents = [
  "Checkout is stuck on loading after I click pay.",
  "Cannot reset my password because the email never arrives.",
  "The export button spins forever on large datasets.",
  "Billing page shows two invoices for the same month.",
  "Mobile layout cuts off the submit button on iOS.",
  "I want a webhook when a job finishes successfully."
];
const query = "How do I fix the checkout spinner?";
const newDocument = "";
const topK = 3;
const persist = true;
const resetStore = false;
const modelId = "Xenova/all-MiniLM-L6-v2";
const pooling = "mean";
const normalize = true;

const storeKey = "embeddings-lab:" + datasetId;
const seedDocs = documents.map((text, index) => ({
  id: "seed-" + index,
  text,
}));

let stored = persist ? loadStore(storeKey) : null;
if (!Array.isArray(stored) || resetStore) {
  stored = seedDocs;
}

if (newDocument.trim() && !stored.some(doc => doc.text === newDocument.trim())) {
  stored = [
    ...stored,
    { id: "custom-" + Date.now(), text: newDocument.trim() },
  ];
}

const withVectors = [];
for (const doc of stored) {
  const embedding = doc.embedding || (await embedText(modelId, doc.text, { pooling, normalize }));
  withVectors.push({ ...doc, embedding });
}

if (persist) saveStore(storeKey, withVectors);

const queryVector = await embedText(modelId, query, { pooling, normalize });
const ranked = withVectors
  .map(doc => ({
    id: doc.id,
    text: doc.text,
    score: cosineSimilarity(queryVector, doc.embedding),
  }))
  .sort((a, b) => b.score - a.score)
  .slice(0, topK);

return {
  type: "vector-search",
  title: "Top matches",
  query,
  count: withVectors.length,
  items: ranked,
  insights: [
    {
      title: "Query and documents share the same space",
      body: "The query is embedded with the same model, then compared to each stored vector using cosine similarity.",
      accent: "Similarity",
    },
    {
      title: "Local storage is your lightweight vector DB",
      body: "The lab stores vectors in localStorage so you can keep adding documents without a server.",
      accent: "Local",
    },
    {
      title: "Embedding choice changes ranking",
      body: "Switch the model or pooling and the nearest neighbors will reorder.",
      accent: "Tuning",
    },
  ],
};

Concept map

Tokenization → Logits → Embeddings → Vector search

Tokenization creates a list of IDs the model understands. A masked model produces logits for the missing token, which are converted into probabilities. Embedding models compress entire sequences into fixed‑length vectors. Vector search compares those vectors to rank the nearest neighbors.

Why this matters for real apps

Search without keywords: embeddings let users search by meaning, not exact phrasing.
Product feedback clustering: vectors reveal themes without manual tagging.
Local-first AI: in‑browser embeddings keep data private and reduce infrastructure.

What to notice as you experiment

Token boundaries: word pieces like ##ing show how subwords are stitched together.
Masked predictions: top tokens hint at what the model thinks is the most likely continuation.
Embeddings: tiny edits to the sentence change the vector preview and similarity scores.
Vector search: rankings shift as you add documents or switch pooling strategies.

Product-Ready Embeddings on the Edge

Primer: tokens, logits, embeddings

Interactive lab

Transformers.js embeddings lab

Shared helpers

Tokenize text into model-ready pieces

Predict a masked token (logit intuition)

Generate a sentence embedding

Run vector search with local storage

Concept map

Why this matters for real apps

What to notice as you experiment