// bridge system

useState()→Model Weights·Event Propagation→Forward Pass·Array.map()→Tensor Operation·React diff→Loss Function L = Σ(y−ŷ)²·transition-duration→Learning Rate η·CSS clamp()→σ(x) Activation·Re-render cycle→Training Epoch·Event bubbling→Backpropagation ∂L/∂w·useCallback→Gradient Caching·Promise.all()→Batch Inference·Redux store→Weight Matrix·DevTools profiler→Loss Landscape·useState()→Model Weights·Event Propagation→Forward Pass·Array.map()→Tensor Operation·React diff→Loss Function L = Σ(y−ŷ)²·transition-duration→Learning Rate η·CSS clamp()→σ(x) Activation·Re-render cycle→Training Epoch·Event bubbling→Backpropagation ∂L/∂w·useCallback→Gradient Caching·Promise.all()→Batch Inference·Redux store→Weight Matrix·DevTools profiler→Loss Landscape·

Disclaimer: Signal Ward is an educational simulation. All clinical scenarios are fictional. Nothing in this course constitutes medical advice.

Vikram
Khalil, when you parse a URL like /patient/123/records, you split it into segments: "patient", "123", "records." Each segment means something different. Tokenization does the same for text.

Khalil
So "patient reports chest pain" becomes four tokens?

Vikram
Right. And each token maps to an index in our vocabulary. "patient" might be index 42, "reports" is 107, and so on. The model only sees these numbers.

You parse URLs all the time. new URL() breaks a URL into structured parts — protocol, host, pathname, search params. Tokenization does the same for natural language: it breaks a sentence into structured parts that a model can process.

Learning Objectives

○Tokenize text by splitting on whitespace and punctuation
○Build a vocabulary mapping from words to integer indices
○Convert a sentence into a sequence of token IDs
○Handle unknown words with an out-of-vocabulary token

From Text to Token IDs

Frontend

URL Parsing

new URL('https://hospital.org/patient/123').pathname.split('/')

Machine Learning

Tokenization

'patient reports pain'.split(/\s+/).map(t => vocab[t])

Structural Bridge

tokenization-basics.tstypescript

// URL parsing: string → structured segments
const url = new URL('https://hospital.org/patient/123/records');
const segments = url.pathname.split('/').filter(Boolean);
// ['patient', '123', 'records']

// Text tokenization: string → structured tokens
function tokenize(text: string): string[] {
return text.toLowerCase()
  .replace(/[.,!?;:]/g, ' ')
  .split(/s+/)
  .filter(Boolean);
}

const tokens = tokenize('Patient reports chest pain.');
// ['patient', 'reports', 'chest', 'pain']

Building a Vocabulary

A vocabulary maps each unique word to an integer ID. This is exactly like an i18n translation dictionary — each key maps to a value.

vocabulary-map.tstypescript

// Build vocabulary from a corpus of notes
function buildVocab(corpus: string[]): Map<string, number> {
const vocab = new Map<string, number>();
vocab.set('<PAD>', 0);   // padding token
vocab.set('<UNK>', 1);   // unknown words

let index = 2;
for (const text of corpus) {
  for (const token of tokenize(text)) {
    if (!vocab.has(token)) {
      vocab.set(token, index++);
    }
  }
}
return vocab;
}

const corpus = [
'Patient reports chest pain',
'Patient denies fever or chills',
];

const vocab = buildVocab(corpus);
// Map { '<PAD>' => 0, '<UNK>' => 1, 'patient' => 2, 'reports' => 3, ... }

// Convert text to token IDs
function encode(text: string, vocab: Map<string, number>): number[] {
return tokenize(text).map(t => vocab.get(t) ?? 1); // 1 = <UNK>
}

console.log(encode('Patient reports chest pain', vocab));
// [2, 3, 4, 5]

Challenge

Build a tokenizer and vocabulary encoder for clinical notes.

⚡

Exercise

Beginner~7 min

Tokenize Text

Implement an encode function that takes a text string and a vocabulary Map<string, number>, tokenizes the text (lowercase, split on whitespace), and returns an array of token IDs. Use 1 as the ID for unknown words.

function tokenize(text: string): string[] {
  return text.toLowerCase().replace(/[.,!?;:]/g, ' ').split(/\s+/).filter(Boolean);
}

function encode(text: string, vocab: Map<string, number>): number[] {
  // 1. Tokenize the text
  // 2. Map each token to its vocab index (use 1 for unknown)
  return null; // your code here
}

const vocab = new Map([['<PAD>', 0], ['<UNK>', 1], ['patient', 2], ['reports', 3], ['chest', 4], ['pain', 5]]);
const ids = encode('Patient reports chest pain', vocab);

Key Takeaways

✓Tokenization splits text into atomic units, just like URL parsing splits paths into segments
✓A vocabulary maps each unique token to an integer index
✓Unknown words get a special <UNK> token to handle unseen text
✓Token IDs are the numerical input that models actually process

Need a hint?

🧭 Guidance

✅ Solution

Tokens and Vocabulary — Signal Ward | Tensorcraft

Tokens and Vocabulary

Context