sct parquet
Export a SNOMED CT NDJSON artefact to a Parquet file, directly queryable by DuckDB without any import step.
When to use: analytics, data science, columnar queries, or loading into pandas/Polars/R. For FTS and exact lookups, sct sqlite is better.
Usage
sct parquet --input <NDJSON> [--output <PARQUET>]
Options
| Flag | Default | Description |
|---|---|---|
--input <FILE> |
(required) | NDJSON file produced by sct ndjson. Use - for stdin. |
--output <FILE> |
snomed.parquet |
Output Parquet file path. |
Example
sct parquet \
--input snomedct-monolithrf2-production-20260311t120000z.ndjson \
--output snomed.parquet
Schema
| Column | Type | Notes |
|---|---|---|
id |
VARCHAR |
SCTID |
fsn |
VARCHAR |
Fully Specified Name |
preferred_term |
VARCHAR |
Preferred term for selected locale |
synonyms |
VARCHAR |
JSON array of strings |
hierarchy |
VARCHAR |
Top-level hierarchy label |
hierarchy_path |
VARCHAR |
JSON array of strings |
parents |
VARCHAR |
JSON array of {id, fsn} |
children_count |
BIGINT |
|
active |
BOOLEAN |
|
module |
VARCHAR |
SNOMED module identifier |
effective_time |
VARCHAR |
YYYYMMDD |
attributes |
VARCHAR |
JSON object of attribute groups |
ctv3_codes |
VARCHAR |
JSON array of strings (UK edition) |
read2_codes |
VARCHAR |
JSON array of strings (UK edition) |
schema_version |
BIGINT |
Artefact schema version |
Array/object columns are stored as JSON strings. DuckDB's json_extract, json_extract_string, and unnest can operate on them directly.
Example queries (DuckDB)
Queries are ordered from simple to complex. For context on the JSON array columns
(synonyms, hierarchy_path, parents, attributes) — they are stored as
VARCHAR JSON strings, so use DuckDB's json_extract, json_array_length, and
from_json rather than list functions.
Lookup and filter
Find a concept by preferred term
duckdb -c "
SELECT id, preferred_term, hierarchy
FROM 'snomed.parquet'
WHERE preferred_term ILIKE '%myocardial infarction%'"
Concepts with a specific attribute present
duckdb -c "
SELECT id, preferred_term
FROM 'snomed.parquet'
WHERE json_extract_string(attributes, '$.finding_site') IS NOT NULL
LIMIT 10"
Concepts modified in a given release
duckdb -c "
SELECT preferred_term, effective_time
FROM 'snomed.parquet'
WHERE effective_time = '20260311'
ORDER BY preferred_term
LIMIT 20"
Export a hierarchy to CSV
duckdb -c "
COPY (
SELECT id, preferred_term, fsn
FROM 'snomed.parquet'
WHERE hierarchy = 'Procedure'
ORDER BY preferred_term
) TO 'procedures.csv' (HEADER, DELIMITER ',')"
Aggregates and distributions
Concept count by top-level hierarchy
duckdb -c "
SELECT hierarchy, COUNT(*) AS n
FROM 'snomed.parquet'
WHERE active = true
GROUP BY hierarchy
ORDER BY n DESC"
Leaf concepts per hierarchy
Leaf concepts have no children — the most specific, fully-refined terms in the polyhierarchy.
duckdb -c "
SELECT hierarchy, COUNT(*) AS leaf_count
FROM 'snomed.parquet'
WHERE children_count = 0 AND active = true
GROUP BY hierarchy
ORDER BY leaf_count DESC"
Release timeline — how many concepts were last updated per release date
duckdb -c "
SELECT effective_time, COUNT(*) AS n
FROM 'snomed.parquet'
WHERE active = true
GROUP BY effective_time
ORDER BY effective_time DESC
LIMIT 20"
Synonym count histogram
Shows how terminology richness is distributed across the concept space.
duckdb -c "
SELECT json_array_length(synonyms) AS n_synonyms, COUNT(*) AS concepts
FROM 'snomed.parquet'
WHERE active = true
GROUP BY n_synonyms
ORDER BY n_synonyms"
Hierarchy depth profile
hierarchy_path is a JSON array of ancestor labels from root to concept.
Its length is the concept's depth in the polyhierarchy.
duckdb -c "
SELECT hierarchy,
ROUND(AVG(json_array_length(hierarchy_path)), 1) AS avg_depth,
MAX(json_array_length(hierarchy_path)) AS max_depth
FROM 'snomed.parquet'
WHERE active = true
GROUP BY hierarchy
ORDER BY avg_depth DESC"
JSON expansion
Concepts with the most synonyms
duckdb -c "
SELECT id, preferred_term, json_array_length(synonyms) AS n_synonyms
FROM 'snomed.parquet'
WHERE active = true
ORDER BY n_synonyms DESC
LIMIT 10"
Concepts with the most parents (deepest polyhierarchy membership)
Each entry in parents is a direct IS-A parent. High parent counts indicate
concepts that sit at the intersection of multiple classification axes.
duckdb -c "
SELECT id, preferred_term, json_array_length(parents) AS parent_count
FROM 'snomed.parquet'
WHERE active = true
ORDER BY parent_count DESC
LIMIT 10"
Expand synonyms to one row each
from_json parses the JSON array string into a typed list; unnest explodes it
into individual rows.
duckdb -c "
SELECT id, preferred_term,
unnest(from_json(synonyms, '[\"VARCHAR\"]')) AS synonym
FROM 'snomed.parquet'
WHERE active = true
AND json_array_length(synonyms) > 0
AND preferred_term ILIKE '%myocardial infarction%'
ORDER BY id, synonym"
Window functions
Most common finding sites across Clinical findings, with share percentage
duckdb -c "
WITH sites AS (
SELECT
json_extract_string(attributes, '$.finding_site[0].fsn') AS site,
COUNT(*) AS n,
ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) AS pct
FROM 'snomed.parquet'
WHERE json_extract(attributes, '$.finding_site') IS NOT NULL
AND active = true
GROUP BY site
)
SELECT site, n, pct
FROM sites
ORDER BY n DESC
LIMIT 15"
Top 2 most-synonymised concepts per hierarchy
QUALIFY filters rows after window functions are evaluated — equivalent to
wrapping in a subquery and filtering on the ranked result, but more concise.
duckdb -c "
SELECT id, preferred_term, hierarchy,
json_array_length(synonyms) AS n_synonyms,
RANK() OVER (
PARTITION BY hierarchy
ORDER BY json_array_length(synonyms) DESC
) AS rank_in_hierarchy
FROM 'snomed.parquet'
WHERE active = true
QUALIFY rank_in_hierarchy <= 2
ORDER BY hierarchy, rank_in_hierarchy"
Load into Python
import polars as pl
df = pl.read_parquet("snomed.parquet")
df.filter(pl.col("hierarchy") == "Clinical finding").head(10)
import pandas as pd
df = pd.read_parquet("snomed.parquet")
df[df.hierarchy == "Procedure"].preferred_term.head(20)
Tips
- DuckDB reads Parquet files in-place with zero import overhead — just reference the file path directly in queries.
- The Parquet file is ~250 MB for the full UK Monolith (vs ~1.2 GB NDJSON), owing to columnar compression.
- Written in batches of 50,000 rows using Arrow for memory efficiency.
- For analytics workloads, Parquet is faster than SQLite; for FTS and exact lookups, prefer
sct sqlite.