Disclaimer: Signal Ward is an educational simulation. All clinical scenarios are fictional. Nothing in this course constitutes medical advice.
VikramKhalil, when you parse a URL like /patient/123/records, you split it into segments: "patient", "123", "records." Each segment means something different. Tokenization does the same for text.
KhalilSo "patient reports chest pain" becomes four tokens?
VikramRight. And each token maps to an index in our vocabulary. "patient" might be index 42, "reports" is 107, and so on. The model only sees these numbers.
You parse URLs all the time. new URL() breaks a URL into structured parts — protocol, host, pathname, search params. Tokenization does the same for natural language: it breaks a sentence into structured parts that a model can process.
new URL('https://hospital.org/patient/123').pathname.split('/')'patient reports pain'.split(/\s+/).map(t => vocab[t])// URL parsing: string → structured segments
const url = new URL('https://hospital.org/patient/123/records');
const segments = url.pathname.split('/').filter(Boolean);
// ['patient', '123', 'records']
// Text tokenization: string → structured tokens
function tokenize(text: string): string[] {
return text.toLowerCase()
.replace(/[.,!?;:]/g, ' ')
.split(/s+/)
.filter(Boolean);
}
const tokens = tokenize('Patient reports chest pain.');
// ['patient', 'reports', 'chest', 'pain']A vocabulary maps each unique word to an integer ID. This is exactly like an i18n translation dictionary — each key maps to a value.
// Build vocabulary from a corpus of notes
function buildVocab(corpus: string[]): Map<string, number> {
const vocab = new Map<string, number>();
vocab.set('<PAD>', 0); // padding token
vocab.set('<UNK>', 1); // unknown words
let index = 2;
for (const text of corpus) {
for (const token of tokenize(text)) {
if (!vocab.has(token)) {
vocab.set(token, index++);
}
}
}
return vocab;
}
const corpus = [
'Patient reports chest pain',
'Patient denies fever or chills',
];
const vocab = buildVocab(corpus);
// Map { '<PAD>' => 0, '<UNK>' => 1, 'patient' => 2, 'reports' => 3, ... }
// Convert text to token IDs
function encode(text: string, vocab: Map<string, number>): number[] {
return tokenize(text).map(t => vocab.get(t) ?? 1); // 1 = <UNK>
}
console.log(encode('Patient reports chest pain', vocab));
// [2, 3, 4, 5]Build a tokenizer and vocabulary encoder for clinical notes.
Implement an encode function that takes a text string and a vocabulary Map<string, number>, tokenizes the text (lowercase, split on whitespace), and returns an array of token IDs. Use 1 as the ID for unknown words.
function tokenize(text: string): string[] { return text.toLowerCase().replace(/[.,!?;:]/g, ' ').split(/\s+/).filter(Boolean); } function encode(text: string, vocab: Map<string, number>): number[] { // 1. Tokenize the text // 2. Map each token to its vocab index (use 1 for unknown) return null; // your code here } const vocab = new Map([['<PAD>', 0], ['<UNK>', 1], ['patient', 2], ['reports', 3], ['chest', 4], ['pain', 5]]); const ids = encode('Patient reports chest pain', vocab);
Diagnostic One can now split clinical notes into tokens and map them to vocabulary indices.
Next: connecting all the pieces in a complete forward pass.