sct embed
Generate vector embeddings from a SNOMED CT NDJSON artefact and write an Apache Arrow IPC file for semantic vector search.
Embeddings are produced by a local Ollama instance — no bundled model, no external API key. The Arrow IPC output can be queried in DuckDB, loaded into Python (PyArrow/Pandas), or imported into LanceDB or any Arrow-compatible vector store.
sct embed is the only sct subcommand that requires an external process (Ollama). All others work fully offline.
Usage
sct embed --input <NDJSON> [--output <FILE>] [--model <MODEL>] [--batch-size <N>] [--ollama-url <URL>]
Options
| Flag | Default | Description |
|---|---|---|
--input <FILE> |
(required) | NDJSON file produced by sct ndjson. Use - for stdin. |
--output <FILE> |
snomed-embeddings.arrow |
Output Arrow IPC file. |
--model <MODEL> |
nomic-embed-text |
Ollama model name to use for embeddings. |
--batch-size <N> |
64 |
Number of concepts to embed per Ollama API call. |
--ollama-url <URL> |
http://localhost:11434 |
Ollama base URL. |
Prerequisites: Ollama
This command requires Ollama to be running with the nomic-embed-text model pulled:
# Install Ollama (see https://ollama.com/download)
ollama pull nomic-embed-text
ollama serve # or it may already be running as a service
Verify it's working:
curl http://localhost:11434/api/embed \
-d '{"model": "nomic-embed-text", "input": ["test"]}'
If Ollama is not running when you run sct embed, you will see a helpful error with instructions to start it.
Example
# Pull the model once
ollama pull nomic-embed-text
# Generate embeddings (takes ~30 minutes for 831k concepts on CPU)
sct embed \
--input snomed.ndjson \
--output snomed-embeddings.arrow
Custom Ollama URL (e.g. remote GPU host)
sct embed \
--input snomed.ndjson \
--ollama-url http://192.168.1.100:11434 \
--output snomed-embeddings.arrow
Embedding text format
Each concept is embedded as a single string combining all its human-readable content:
{preferred_term}. {fsn}. Synonyms: {synonyms joined with ", "}. Hierarchy: {hierarchy_path joined with " > "}.
Example:
Heart attack. Myocardial infarction (disorder). Synonyms: Cardiac infarction, MI - Myocardial infarction. Hierarchy: SNOMED CT Concept > Clinical finding > Disorder of cardiovascular system > Ischemic heart disease > Myocardial infarction.
This rich format means the query sct semantic "blocked coronary artery" can match Myocardial infarction even though none of those words appear in the preferred term.
Output format
The output is a single Arrow IPC (.arrow) file with the following schema:
| Column | Type | Description |
|---|---|---|
id |
utf8 |
SCTID |
preferred_term |
utf8 |
Preferred term |
hierarchy |
utf8 |
Top-level hierarchy name |
embedding |
fixed_size_list<float32>[N] |
Vector embedding (dimension determined by model) |
For nomic-embed-text the dimension is 768.
Querying the embeddings
Via sct semantic (recommended)
sct semantic "blocked coronary artery" --embeddings snomed-embeddings.arrow --limit 5
See sct semantic for full documentation.
DuckDB (vector similarity search)
INSTALL vss;
LOAD vss;
SELECT id, preferred_term, hierarchy,
array_cosine_similarity(embedding, $query_vec::FLOAT[768]) AS score
FROM read_ipc_auto('snomed-embeddings.arrow')
ORDER BY score DESC
LIMIT 10;
Python (PyArrow + NumPy)
import pyarrow.ipc as ipc
import numpy as np
import ollama
# Load embeddings
with ipc.open_file("snomed-embeddings.arrow") as f:
table = f.read_all()
embeddings = np.array(table["embedding"].to_pylist(), dtype=np.float32)
# Embed query
resp = ollama.embed(model="nomic-embed-text", input=["heart attack"])
q = np.array(resp["embeddings"][0], dtype=np.float32)
# Cosine similarity
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
normed = embeddings / (norms + 1e-9)
q_normed = q / (np.linalg.norm(q) + 1e-9)
scores = normed @ q_normed
top_idx = np.argsort(scores)[::-1][:10]
ids = table["id"].to_pylist()
terms = table["preferred_term"].to_pylist()
for i in top_idx:
print(f"{scores[i]:.4f} {ids[i]} {terms[i]}")
Import into LanceDB
import lancedb
import pyarrow.ipc as ipc
with ipc.open_file("snomed-embeddings.arrow") as f:
table = f.read_all()
db = lancedb.connect("snomed.lance")
db.create_table("concepts", data=table, mode="overwrite")
Notes
- Embedding 831k concepts takes significant time on CPU (~30 min). A GPU or Apple Silicon machine will be much faster.
nomic-embed-textproduces 768-dimensional float32 vectors. Other models with different dimensions will work automatically.- The complete dataset is held in memory during embedding. For limited RAM, use
--batch-size 16or lower. - The
.arrowfile is also consumed bysct mcp --embeddingsto exposesnomed_semantic_searchto AI clients.