sct parquet

Export a SNOMED CT NDJSON artefact to a Parquet file, directly queryable by DuckDB without any import step.

When to use: analytics, data science, columnar queries, or loading into pandas/Polars/R. For FTS and exact lookups, sct sqlite is better.

Usage

sct parquet --input <NDJSON> [--output <PARQUET>]

Options

Flag	Default	Description
`--input <FILE>`	(required)	NDJSON file produced by `sct ndjson`. Use `-` for stdin.
`--output <FILE>`	`snomed.parquet`	Output Parquet file path.

Example

sct parquet \
  --input snomedct-monolithrf2-production-20260311t120000z.ndjson \
  --output snomed.parquet

Schema

Column	Type	Notes
`id`	`VARCHAR`	SCTID
`fsn`	`VARCHAR`	Fully Specified Name
`preferred_term`	`VARCHAR`	Preferred term for selected locale
`synonyms`	`VARCHAR`	JSON array of strings
`hierarchy`	`VARCHAR`	Top-level hierarchy label
`hierarchy_path`	`VARCHAR`	JSON array of strings
`parents`	`VARCHAR`	JSON array of `{id, fsn}`
`children_count`	`BIGINT`
`active`	`BOOLEAN`
`module`	`VARCHAR`	SNOMED module identifier
`effective_time`	`VARCHAR`	`YYYYMMDD`
`attributes`	`VARCHAR`	JSON object of attribute groups
`ctv3_codes`	`VARCHAR`	JSON array of strings (UK edition)
`read2_codes`	`VARCHAR`	JSON array of strings (UK edition)
`schema_version`	`BIGINT`	Artefact schema version

Array/object columns are stored as JSON strings. DuckDB's json_extract, json_extract_string, and unnest can operate on them directly.

Example queries (DuckDB)

Queries are ordered from simple to complex. For context on the JSON array columns (synonyms, hierarchy_path, parents, attributes) — they are stored as VARCHAR JSON strings, so use DuckDB's json_extract, json_array_length, and from_json rather than list functions.

Lookup and filter

Find a concept by preferred term

duckdb -c "
  SELECT id, preferred_term, hierarchy
  FROM 'snomed.parquet'
  WHERE preferred_term ILIKE '%myocardial infarction%'"

Concepts with a specific attribute present

duckdb -c "
  SELECT id, preferred_term
  FROM 'snomed.parquet'
  WHERE json_extract_string(attributes, '$.finding_site') IS NOT NULL
  LIMIT 10"

Concepts modified in a given release

duckdb -c "
  SELECT preferred_term, effective_time
  FROM 'snomed.parquet'
  WHERE effective_time = '20260311'
  ORDER BY preferred_term
  LIMIT 20"

Export a hierarchy to CSV

duckdb -c "
  COPY (
    SELECT id, preferred_term, fsn
    FROM 'snomed.parquet'
    WHERE hierarchy = 'Procedure'
    ORDER BY preferred_term
  ) TO 'procedures.csv' (HEADER, DELIMITER ',')"

Aggregates and distributions

Concept count by top-level hierarchy

duckdb -c "
  SELECT hierarchy, COUNT(*) AS n
  FROM 'snomed.parquet'
  WHERE active = true
  GROUP BY hierarchy
  ORDER BY n DESC"

Leaf concepts per hierarchy

Leaf concepts have no children — the most specific, fully-refined terms in the polyhierarchy.

duckdb -c "
  SELECT hierarchy, COUNT(*) AS leaf_count
  FROM 'snomed.parquet'
  WHERE children_count = 0 AND active = true
  GROUP BY hierarchy
  ORDER BY leaf_count DESC"

Release timeline — how many concepts were last updated per release date

duckdb -c "
  SELECT effective_time, COUNT(*) AS n
  FROM 'snomed.parquet'
  WHERE active = true
  GROUP BY effective_time
  ORDER BY effective_time DESC
  LIMIT 20"

Synonym count histogram

Shows how terminology richness is distributed across the concept space.

duckdb -c "
  SELECT json_array_length(synonyms) AS n_synonyms, COUNT(*) AS concepts
  FROM 'snomed.parquet'
  WHERE active = true
  GROUP BY n_synonyms
  ORDER BY n_synonyms"

Hierarchy depth profile

hierarchy_path is a JSON array of ancestor labels from root to concept. Its length is the concept's depth in the polyhierarchy.

duckdb -c "
  SELECT hierarchy,
         ROUND(AVG(json_array_length(hierarchy_path)), 1) AS avg_depth,
         MAX(json_array_length(hierarchy_path)) AS max_depth
  FROM 'snomed.parquet'
  WHERE active = true
  GROUP BY hierarchy
  ORDER BY avg_depth DESC"

JSON expansion

Concepts with the most synonyms

duckdb -c "
  SELECT id, preferred_term, json_array_length(synonyms) AS n_synonyms
  FROM 'snomed.parquet'
  WHERE active = true
  ORDER BY n_synonyms DESC
  LIMIT 10"

Concepts with the most parents (deepest polyhierarchy membership)

Each entry in parents is a direct IS-A parent. High parent counts indicate concepts that sit at the intersection of multiple classification axes.

duckdb -c "
  SELECT id, preferred_term, json_array_length(parents) AS parent_count
  FROM 'snomed.parquet'
  WHERE active = true
  ORDER BY parent_count DESC
  LIMIT 10"

Expand synonyms to one row each

from_json parses the JSON array string into a typed list; unnest explodes it into individual rows.

duckdb -c "
  SELECT id, preferred_term,
         unnest(from_json(synonyms, '[\"VARCHAR\"]')) AS synonym
  FROM 'snomed.parquet'
  WHERE active = true
    AND json_array_length(synonyms) > 0
    AND preferred_term ILIKE '%myocardial infarction%'
  ORDER BY id, synonym"

Window functions

duckdb -c "
  WITH sites AS (
    SELECT
      json_extract_string(attributes, '$.finding_site[0].fsn') AS site,
      COUNT(*) AS n,
      ROUND(100.0 * COUNT(*) / SUM(COUNT(*)) OVER (), 2) AS pct
    FROM 'snomed.parquet'
    WHERE json_extract(attributes, '$.finding_site') IS NOT NULL
      AND active = true
    GROUP BY site
  )
  SELECT site, n, pct
  FROM sites
  ORDER BY n DESC
  LIMIT 15"

Top 2 most-synonymised concepts per hierarchy

QUALIFY filters rows after window functions are evaluated — equivalent to wrapping in a subquery and filtering on the ranked result, but more concise.

duckdb -c "
  SELECT id, preferred_term, hierarchy,
         json_array_length(synonyms) AS n_synonyms,
         RANK() OVER (
           PARTITION BY hierarchy
           ORDER BY json_array_length(synonyms) DESC
         ) AS rank_in_hierarchy
  FROM 'snomed.parquet'
  WHERE active = true
  QUALIFY rank_in_hierarchy <= 2
  ORDER BY hierarchy, rank_in_hierarchy"

Load into Python

import polars as pl
df = pl.read_parquet("snomed.parquet")
df.filter(pl.col("hierarchy") == "Clinical finding").head(10)

import pandas as pd
df = pd.read_parquet("snomed.parquet")
df[df.hierarchy == "Procedure"].preferred_term.head(20)

Tips

DuckDB reads Parquet files in-place with zero import overhead — just reference the file path directly in queries.
The Parquet file is ~250 MB for the full UK Monolith (vs ~1.2 GB NDJSON), owing to columnar compression.
Written in batches of 50,000 rows using Arrow for memory efficiency.
For analytics workloads, Parquet is faster than SQLite; for FTS and exact lookups, prefer sct sqlite.