0006: XSD Support — Library Choice, Security Posture, and Type Mapping
Status: Accepted
Date: 2026-05-28
Authors: Ben Lin, Claude Sonnet 4.6
Context
weirding's primary authoring format is the plain-attribute annotation convention (ADR-0001). XSD was explicitly rejected as the primary format: it is too verbose for LLM prompt templates, and no maintained Python runtime XSD converter existed at ADR-0001 decision time.
However, XSD input support is a legitimate secondary need. Enterprise environments
frequently have existing XSD contracts (SOAP services, document interchange schemas,
regulatory definitions) that users want to compile into Pydantic models via weirding's
pipeline. Phase 03 adds weirding[xsd], an optional extra that enables compile() to
accept XSD documents.
Three decisions had non-obvious rationale and required a record:
1. Library selection. Any XSD runtime implementation requires a library that can parse and walk XSD schema components (complex types, element declarations, facets). The constraints were: - Must be a pure-Python dependency (no Rust/C extensions — see MEMORY.md rule 6) - Must be actively maintained - Must expose a component-level object model (not just validation), so we can walk types to produce JSON Schema IR - Must support XSD 1.0 at minimum
Candidates surveyed:
- xmlschema (Davide Brunato, active since 2016, latest release 2024): pure Python,
full XSD 1.0 and 1.1 support, rich component object model, defuse security parameter,
active maintenance. The only viable candidate.
- lxml.etree: ships an XSD validator but exposes no component object model — cannot
walk element declarations or type hierarchies programmatically.
- generateDS / pyxb: generate Python code from XSD at build time, not at
runtime. Not applicable to a library that compiles XSD on-demand from user input.
- datamodel-code-generator: explicitly banned as a runtime dependency by MEMORY.md
rule 7 (ruff subprocess startup cost, black cold start cost, Databricks binary
compatibility risk).
- No other maintained Python runtime XSD parser exists.
2. Security posture. weirding receives XSD as arbitrary user-provided input. XSD
documents can reference external resources via xs:import, xs:include, and XML entity
declarations. An unrestricted parse can trigger:
- XXE (XML External Entity) injection — reading local files or making network requests
through entity expansion
- SSRF (Server-Side Request Forgery) via xs:import resolving to attacker-controlled URLs
- Billion-laughs / quadratic blowup via entity nesting
The xmlschema library offers a defuse parameter with three levels:
- "remote" (default): blocks only remote resource fetching; local file resolution and
entity expansion are permitted
- "local": blocks remote and local file resolution; entity expansion permitted
- "always": blocks all external resolution and entity expansion unconditionally
The decision was to use defuse="always" exclusively, not the default. The threat model
for weirding — a library that runs in Databricks, Kubernetes, and serverless environments
and processes user-provided XSD strings — requires zero external I/O on the schema parse
path. defuse="remote" would still permit local file reads, which is unacceptable when
weirding is deployed in a multi-tenant environment or receives schema content over a
network boundary.
Defense-in-depth: weirding's make_parser() (the lxml XMLParser with
resolve_entities=False, no_network=True, load_dtd=False, huge_tree=False) runs
first and produces the lxml.etree._Element that is passed to XMLSchema(root, defuse="always").
The lxml layer prevents entity injection before xmlschema ever processes the input.
defuse="always" is a second independent layer ensuring xmlschema's own resource
resolution logic is also disabled.
A consequence of defuse="always" is that multi-file XSD schemas using xs:import or
xs:include do not work. This is a deliberate scope limitation: weirding accepts XSD as
an in-memory string, not as a file path, so relative xs:include paths have no meaningful
resolution base anyway. Single-document self-contained XSDs are the supported scope.
3. Type map key format. When walking xmlschema type objects, type names are exposed
via type_obj.name. The value is a Clark-notation URI:
{http://www.w3.org/2001/XMLSchema}string. The type_obj.prefixed_name attribute
returns xs:string, but this prefix is the schema document's own declared prefix for the
XSD namespace — it could be xsd:, x:, or anything else the document author chose.
Using prefixed_name as a map key would fail silently for any schema that uses a
non-xs prefix for the XSD namespace.
Decision
We will use xmlschema>=3.0 as the sole XSD parsing and component-walking library,
scoped to the optional weirding[xsd] extra. It is never a base dependency.
Every call to xmlschema.XMLSchema() must pass defuse="always". The default
(defuse="remote") is never used in weirding code. This applies to all call sites in
src/weirding/xsd/ and any future modules that invoke xmlschema.
The XSD type map (_XSD_TYPE_MAP in src/weirding/xsd/_bridge.py) uses Clark-notation
URI keys and is looked up via type_obj.name. The prefixed_name attribute is never used
for type dispatch.
The following are explicitly out of scope for the XSD bridge:
- xs:import and xs:include (blocked by defuse="always")
- xs:extension / xs:restriction on complex content (inheritance chains)
- Multiple top-level element declarations (first declaration used as root)
Consequences
Positive
xmlschemais the only maintained option; no meaningful library selection risk.defuse="always"+ lxmlmake_parser()provides defense-in-depth against XXE, SSRF, and entity-expansion attacks on all code paths that process XSD input.- Clark-notation keys in
_XSD_TYPE_MAPmake type dispatch correct regardless of the XSD namespace prefix the schema author chose. - Scoping
xmlschemato[xsd]keeps the base install footprint small; users who only use the plain-attribute annotation path do not pay the import cost.
Negative
- Multi-file XSD schemas using
xs:importorxs:includeare not supported. Users with large enterprise XSD suites that split across files must pre-process (inline) the schema before passing it tocompile(). xmlschemaadds approximately 1–2 MB to the installed package size for[xsd]users.xmlschema.XMLSchema(root, defuse="always")performs full schema validation on everycompile()call. For very large XSD files (thousands of type definitions), this may add noticeable latency. Applications that callcompile()in a hot path should cache the result.- Schema inheritance (
xs:extension,xs:restrictionon complex content) is not supported. Fields from base types will not appear in the produced IR.
Neutral
- Any future code that calls
xmlschema.XMLSchema()must passdefuse="always". This is a permanent requirement, not a configuration choice. - Any new XSD type mappings added to
_XSD_TYPE_MAPmust use Clark-notation URI keys. xs:choicemodel groups are supported and emitoneOf;xs:sequenceandxs:allemit standardproperties/requiredobjects.