A Customer Couldn’t Upload an Invoice. The Fix Re-Taught Me XSD, JAXB, and Maven.
Table of Contents
A Polish customer reports their invoice won’t upload.
The file looks valid. Other Polish customers upload fine.
Three days later I’ve read more of the Polish Tax Code than I’d like to admit.
This is the story of how “a customer can’t upload an invoice” turned into adding an entire third XML standard to a working integration — and what that taught me about parts of the Java stack I’d forgotten I’d ever used. It’s also a story about Maven, JAXB, and four obscure errors that surfaced between “got the schemas” and “the build is green.”
If you haven’t worked with XML schema in a while (most modern Java developers haven’t), the early sections explain XSD, JAXB, and cxf-xjc-plugin from scratch. If you have — skip down to the four errors; that’s where the war-story material lives.
What I was looking at
A B2B procurement SaaS I worked with ingests invoices in two XML formats:
- UBL (Universal Business Language) — the OASIS standard most of the EU adopted.
- CII (Cross Industry Invoice) — the UN/CEFACT standard, used heavily in France and Germany.
Both are wrapped under the EU’s EN 16931 framework: a common semantic data model that UBL and CII each have a syntactic mapping for. When a customer uploads an invoice, the system tries UBL, falls back to CII, and rejects anything else.
A Polish customer was uploading something that didn’t match either schema.
The first three hypotheses were all wrong:
- “Malformed XML” — no, it parsed cleanly as XML.
- “Customer using an older UBL version” — no, the root element wasn’t
Invoice(UBL) orCrossIndustryInvoice(CII). It wasFaktura. - “Some custom Polish extension to UBL” — no, the namespace declared was
http://crd.gov.pl/wzor/2025/06/25/13775/. That’s not UBL. That’s a Polish government domain.
Pasting the schema reference into Google sent me to KSeF (Krajowy System e-Faktur — Poland’s National e-Invoicing System) and to a GitHub repository owned by the Polish Ministry of Finance hosting the XSDs.
This wasn’t a bug. It was an entire missing standard.
KSeF and FA(3) in three paragraphs
Most EU countries adopt UBL or CII directly for B2B e-invoicing. A few — Italy (FatturaPA), Poland (KSeF/FA), France (Factur-X), and one or two others — chose to define their own national format on top of the EU semantic model. The reasoning is regulatory: each country’s tax authority has fields it cares about (Polish KSeF-specific identifiers, VAT scheme codes, tax point dates) that don’t fit cleanly into UBL or CII without custom extensions, and the tax authorities preferred a clean schema they fully control.
KSeF is Poland’s centralised e-invoicing platform. FA(3) is its current invoice schema version (schemat_FA(3)_v1-0E.xsd). Both became progressively mandatory for Polish B2B from 2024 onward, with the final cutoff phase landing in 2026 — so anyone operating a procurement system that touches Polish vendors is, right now, on the hook to support FA(3) or watch their customers’ uploads quietly fail.
Three sibling formats exist alongside FA(3) in the same Polish government repo: FA-RR (for flat-rate-VAT farmer invoices), PEF (an older public-sector e-invoicing format that predates KSeF), and the FA(2) version that’s being deprecated. They’re all real but rare. The right call here is YAGNI: ship FA(3), structure the code so adding KsefRrProcessor later is one extra case in an enum, and skip the rest until a customer actually needs them.
So the work in front of me: add a third format alongside UBL and CII, with all the schema-validation and XML-to-Java-object plumbing the other two already have.
A primer on the parts I’d forgotten
Most of what follows you can skip if you build SOAP services or work in heavily-regulated XML domains (banking, healthcare, government). For everyone else — i.e. me, and the 80% of Java developers who’ve been writing REST APIs with Jackson for the last decade — here’s what’s involved.
What’s an XSD?
An XSD (XML Schema Definition) is a formal grammar that says what a valid XML document for some format looks like — which elements are required, in what order, with what data types, with what enumerated values. It’s itself an XML document.
If you have an XSD and an XML file, a validator can tell you “matches” or “doesn’t match” in milliseconds. Without an XSD, you’d hand-write a parser that walks the DOM and hopes the producer didn’t change anything. With one, you get free machine-checked validation.
KSeF’s FA(3) schema is a single ~2,000-line XSD file that imports a handful of base XSDs from a bazowe/ folder (“base” in Polish — common type definitions like country codes, address structures, VAT enumerations). The Polish government publishes them all on GitHub under an open licence so anyone integrating with KSeF can vendor them.
What’s JAXB?
JAXB (Java Architecture for XML Binding) is a Java standard for converting XML ↔ Java objects automatically. Given an XSD, JAXB can generate Java classes that match the XML structure exactly. Your application code then sees faktura.getFa().getKodWaluty() instead of poking through Element and NodeList like it’s 2008.
Two directions:
- Unmarshal — XML bytes → Java object tree
- Marshal — Java object tree → XML bytes
JAXB sat inside the JDK from Java 6 through Java 8. It was removed from the JDK in Java 11. From 11 onward, you have to pull it in as an explicit Maven dependency. It also went through a rename: the package moved from javax.xml.bind.* to jakarta.xml.bind.* when Oracle handed Java EE to the Eclipse Foundation in 2018-19. That rename will surface as one of the four errors later in this post.
What’s cxf-xjc-plugin?
The JAXB classes don’t generate themselves — there’s a code generator, xjc, that reads an XSD and emits the Java source files. You can run it from the command line, but in any real project you want it bound to the Maven build, so the classes regenerate automatically when the XSD changes.
cxf-xjc-plugin is a Maven plugin (from the Apache CXF project) that wraps xjc and runs it during the build. You point it at an XSD, it spits Java source into target/generated-sources/xsd/, and you continue. Other plugins exist (maven-jaxb2-plugin, jaxb2-maven-plugin) with slightly different XML configuration syntax; the codebase used cxf-xjc-plugin, so that’s the one I had to learn.
How they fit together
End-to-end:
KSeF FA(3) XSD ← vendored from the Polish gov GitHub
↓
cxf-xjc-plugin (at build) ← reads the XSD, generates Java
↓
JAXB-annotated .java files ← in target/generated-sources/xsd/
↓
Compiled into the module's JAR
↓
KsefProcessor unmarshals ← uses the generated classes at runtime
That’s the whole flow. Almost every error I hit later in this post involves one of those arrows not connecting properly.
The architectural pattern already in the codebase
Before writing any new code, I went and read how UBL and CII were already plumbed in. The codebase has a separate Maven module called generated-xrechnung whose entire purpose is to:
- Hold the UBL + CII XSDs as
src/main/resourcesfiles (vendored, checked into git). - Configure
cxf-xjc-pluginto read those XSDs and generate Java classes during the build. - Package the generated classes into a JAR.
- Publish that JAR to GitHub Packages (Maven coordinates:
com.b2bsaas:generated-xrechnung:1.0.1).
The clever bit lives in the parent pom and is easy to miss:
<!-- in the root pom.xml -->
<modules>
<module>document-processing</module>
<module>integration-api/integration-gmail</module>
...
<!-- generated-xrechnung is DELIBERATELY NOT in this list -->
</modules>
The generated-xrechnung module is not registered in the parent <modules> block. So when a developer runs mvn clean install from the repo root, it skips that module entirely. The generated JAR isn’t built from source — it’s downloaded from GitHub Packages, just like any other third-party dependency.
The pom for generated-xrechnung itself spells out why, in a comment block:
This module is not registered in the parent pom.xml intentionally to force the download of the pre-generated artifact from the repository. If you need to regenerate the code, first bump the version and then run:
mvn -f generated-xrechnung/pom.xml install deploy.
The trade-off this makes: xjc is slow. Running it on every developer’s machine on every build means N developers × M builds per day × ~30 seconds of code generation that produces the same output 99% of the time. So the team treats the generated artifact as a release deliverable, not a build output: regenerate only when the XSD changes, bump the version, publish once, and every other build just pulls the JAR.
That’s the pattern I had to mirror for KSeF. New module — generated-ksef. Same publish-once-consume-many shape. Deliberately not in the parent <modules> list.
I’d never seen this pattern before. Most of my projects had been small enough to run code generation on every build, or used Spring Boot’s auto-configuration where this kind of thing is hidden. The “deliberately unregistered Maven module published as its own artifact” trick is genuinely useful for any slow code-gen — JAXB, Protobuf, gRPC, OpenAPI — and almost nobody writes about it. Worth filing.
A short detour through Maven phases
Before the new module’s pom would work, I had to actually understand what cxf-xjc-plugin was binding to in the build. My mental model of Maven was “it compiles things,” and that’s about as deep as it had ever needed to be.
Here’s the thing the docs don’t put on the first page: Maven runs the same fixed sequence of named phases on every build, in a strict order. You don’t run plugins directly — you run up to a phase, and Maven walks every phase between validate and that target, executing whatever plugins each module has bound to each one.
The phases that actually mattered for this module, in order:
| Phase | What plugins bound to it produce |
|---|---|
validate | (nothing here) |
generate-sources | .java text files, generated from non-src/main/java sources |
process-sources | (rarely used) |
process-resources | copies src/main/resources/ → target/classes/ |
compile | .java → .class bytecode in target/classes/ |
test-compile | test sources compiled |
test | unit tests run |
package | target/classes/ bundled into the JAR |
install | JAR copied into ~/.m2/repository/ (your local Maven cache) |
deploy | JAR uploaded to the remote registry (GitHub Packages here) |
The insight I’d been missing my entire career: generate-sources runs before compile. I had vaguely assumed that compilation came first and code generation happened “somewhere afterward, maybe to expose annotations,” because that’s how IDE-driven Java feels. It’s the other way around. generate-sources produces text-form .java files; compile then turns every .java file it can find (in src/main/java and in any source directories registered earlier) into .class bytecode.
That ordering is the only thing that makes cxf-xjc-plugin work at all. If xjc ran after compilation, the generated .java files would never get turned into .class files in time to be packaged into the JAR.
There’s one more gotcha. By default, the Maven compiler only looks at src/main/java. It doesn’t know that cxf-xjc-plugin has just dumped 56 fresh .java files into target/generated-sources/xsd/. So you need a second plugin — build-helper-maven-plugin — whose only job in this module is to call add-source during generate-sources, registering that target folder as an additional source root for the upcoming compile phase:
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>build-helper-maven-plugin</artifactId>
<executions>
<execution>
<phase>generate-sources</phase>
<goals><goal>add-source</goal></goals>
<configuration>
<sources>
<source>${project.build.directory}/generated-sources/xsd</source>
</sources>
</configuration>
</execution>
</executions>
</plugin>
cxf-xjc-plugin writes Java files into target/. Maven by default ignores target/. build-helper-maven-plugin add-source is the glue that makes Maven also look in target/generated-sources/xsd when compiling. Without it, you’d get a successful XJC step, then a compile step that processes zero files, then a JAR with zero classes inside. Useless.
The mental model that fixed Maven for me:
target/is a layered cake being assembled phase by phase. First the generated-sources layer (.javatext). Then the resources layer (copies ofsrc/main/resources/). Then the compiled-classes layer (.classbytecode). Then the whole thing zipped into a JAR at the root. Each phase produces one slice; the next phase reads what came before.
Mirroring the pattern for KSeF
With that understood, scaffolding the new module was straightforward in shape:
integration-api/generated-ksef/
├── pom.xml ← mirrors generated-xrechnung/pom.xml
└── src/main/resources/
└── ksef-xsd/
└── fa/
├── schemat_FA(3)_v1-0E.xsd
├── bazowe/
│ ├── StrukturyDanych_v10-0E.xsd
│ ├── ElementarneTypyDanych_v10-0E.xsd
│ └── KodyKrajow_v10-0E.xsd
└── catalog.xml ← coming next
Three touchpoints in the parent pom, mirroring the xrechnung pattern:
- New property:
<ksef.version>1.0.0</ksef.version> - Managed dependency:
<dependency><groupId>com.b2bsaas</groupId><artifactId>generated-ksef</artifactId><version>${ksef.version}</version></dependency> - The module deliberately not added to
<modules>
Plus a <dependency> on generated-ksef in the one downstream module (document-processing) where the new KsefProcessor was going to live.
That was the shape. Compiling it was where every assumption I made turned out to be a surprise.
Error 1 — The schema imports from an absolute URL
First mvn install from inside the new module’s pom:
[ERROR] (no protocol: bazowe/StrukturyDanych_v10-0E.xsd)
[ERROR] Failed to read schema document
'http://crd.gov.pl/xml/schematy/dziedzinowe/mf/2025/06/25/eD/...'
I opened the FA(3) schema and there it was, near the top:
<xs:import schemaLocation="http://crd.gov.pl/xml/schematy/dziedzinowe/mf/2025/06/25/eD/StrukturyDanych_v10-0E.xsd"
namespace="..."/>
The schema imports its base types via an absolute URL — http://crd.gov.pl/.... The Polish government publishes the schemas at that public URL. The expectation is that XML tooling either resolves them from the network or has a way to redirect the URL to a local copy.
cxf-xjc-plugin, by default, tries to fetch the URL. Inside a corporate network with restricted egress (which is the right default for any CI runner), the fetch fails. Even if it succeeded, depending on the public URL of a government service every time the build runs is exactly the wrong amount of fragility.
The fix is something called an OASIS XML Catalog — an XML file that tells XML tooling “when you see this URL, look at this local file instead.” It’s been part of the XML stack since the late nineties. I’d never used one.
The catalog file:
<?xml version="1.0" encoding="UTF-8"?>
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog" prefer="system">
<system systemId="http://crd.gov.pl/xml/schematy/dziedzinowe/mf/2025/06/25/eD/StrukturyDanych_v10-0E.xsd"
uri="bazowe/StrukturyDanych_v10-0E.xsd"/>
<uri name="http://crd.gov.pl/xml/schematy/dziedzinowe/mf/2025/06/25/eD/StrukturyDanych_v10-0E.xsd"
uri="bazowe/StrukturyDanych_v10-0E.xsd"/>
</catalog>
Two entries pointing at the same target, because XML tooling has two different lookup mechanisms. <system> fires when the schema reference comes from an <xsd:import schemaLocation="..."> — which is the case here. <uri> fires when something resolves via a URI lookup (less common for XSD; more common for XSLT). Including both is defensive — the <system> is the load-bearing one for XSDs; the <uri> is harmless redundancy that protects against any tool in the chain that uses the other mechanism.
The catalog file goes next to the XSDs (src/main/resources/ksef-xsd/fa/catalog.xml), so the relative uri="bazowe/..." resolves correctly. And then it’s wired into cxf-xjc-plugin’s configuration per-XSD:
<xsdOption>
<xsd>${basedir}/src/main/resources/ksef-xsd/fa/schemat_FA(3)_v1-0E.xsd</xsd>
<catalog>${basedir}/src/main/resources/ksef-xsd/fa/catalog.xml</catalog>
</xsdOption>
Practical note: if you ever consume third-party XSDs from government or standards bodies — KSeF, FatturaPA, ISO 20022, FedRAMP — and the schema imports from absolute URLs, you’ll need an XML catalog. The pattern is universal; the syntax is the only thing that ever needs Googling.
That fixed the first error. The next one had been hiding behind it.
Error 2 — A 5,000-node parser limit
Current configuration of the parser doesn't allow the expansion of
a content model for a complex type to contain more than 5,000 nodes.
This one was new to me. Java’s built-in XML parser (Xerces, shipped with the JDK) has a hardcoded safety cap: it refuses to expand a single complex type’s content model beyond 5,000 nodes. It’s an anti-DoS measure — pathologically deep schemas can blow up memory or CPU at parse time, and the limit is there to fail fast.
UBL and CII never hit this limit. They’re large, but they’re not Polish-tax-law-large. KSeF FA(3) is genuinely heftier: it covers every VAT scheme variant, every Adnotacje (notation) field, every line-level tax code, every payment instrument type. The schema’s effective node count for some of its complex types tips over 5,000.
The fix is to bump jdk.xml.maxOccurLimit — a JVM system property that controls the limit. Bumping it to 50,000 is comfortably more than any reasonable schema actually needs:
-Djdk.xml.maxOccurLimit=50000
But there’s a wrinkle that took me an embarrassingly long time to spot. MAVEN_OPTS="-Djdk.xml.maxOccurLimit=50000" didn’t work. Setting it in ~/.mavenrc didn’t work. The error kept appearing.
The reason: cxf-xjc-plugin forks a separate JVM process to run xjc. It does this because XJC has historically had classpath isolation requirements, and forking a clean JVM is the safe way to satisfy them. The forked process doesn’t inherit system properties set on Maven’s own JVM.
cxf-xjc-plugin exposes <additionalJvmArgs> exactly for this case:
<configuration>
<sourceRoot>${project.build.directory}/generated-sources/xsd</sourceRoot>
<additionalJvmArgs>-Djdk.xml.maxOccurLimit=50000</additionalJvmArgs>
<xsdOptions>...</xsdOptions>
</configuration>
That one line is the fix. With it in place, the next mvn install got past schema parsing and started generating Java files — 56 of them, with package pl.gov.crd.wzor._2025._06._25._13775 (the Polish gov namespace transliterated). Faktura. TNaglowek. TKodWaluty. The schema was being read.
Two lessons stacked here: (1) when your fix doesn’t take effect, check whether the tooling forks a JVM you don’t control. (2) language-level XML defaults in the JDK have been hardened over the years — for security reasons that are real — and any schema bigger than mainstream business data will surface them.
Error 3 — package jakarta.xml.bind.annotation does not exist
XJC succeeded. Then mvn compile produced 100+ errors, all variations of:
cannot find symbol: class XmlAccessorType
location: package jakarta.xml.bind.annotation
Every generated .java file was annotated with @XmlAccessorType(XmlAccessType.FIELD), @XmlType, @XmlRootElement, @XmlEnumValue. The compiler had no idea what those classes were.
The story behind this one is worth knowing because it’ll bite anyone using JAXB on Java 11 or newer for the first time. Two things changed:
- Java 9 deprecated, Java 11 removed JAXB from the JDK. Before Java 11,
javax.xml.bind.*was on every developer’s classpath for free. After Java 11, you have to declare it as a Maven dependency or your code won’t compile. - The package was renamed
javax.*→jakarta.*during the Oracle → Eclipse Foundation handover of Java EE in 2018-19. So the modern artifact lives underjakarta.xml.bind, and all the annotations have moved tojakarta.xml.bind.annotation.*.
cxf-xjc-plugin 4.x generates Jakarta-namespaced code (you can see this in the error — “package jakarta.xml.bind.annotation does not exist”). The matching Maven dependency is:
<dependency>
<groupId>jakarta.xml.bind</groupId>
<artifactId>jakarta.xml.bind-api</artifactId>
<version>4.0.2</version>
</dependency>
One thing to know about this dependency: it’s the API, not the runtime. JAXB is split into two artifacts on purpose:
| Artifact | What it provides | When you need it |
|---|---|---|
jakarta.xml.bind:jakarta.xml.bind-api | Annotations (@XmlElement, @XmlType, …) and interfaces (Marshaller, Unmarshaller, JAXBContext) | Compile time |
org.glassfish.jaxb:jaxb-runtime | The actual implementation that parses/writes XML | Run time |
Think of it as a power outlet specification vs. the generator that produces electricity. The data-classes-only module (our new generated-ksef) only needs the specification — its classes are annotated; nothing in there calls JAXBContext.newInstance() or unmarshal(). The downstream module that uses the generated classes (document-processing, where KsefProcessor.unmarshal() will live) pulls in the runtime.
That separation matters because pulling in the runtime everywhere adds startup time and dependency surface area you don’t need. Only the consumer module that actually calls JAXB needs it.
Added the API dependency. Compile succeeded. 56 generated classes, all green.
Error 4 — The build cache silently ate package
The compile step succeeded. The next phase up — install — failed:
[INFO] BUILD SUCCESS
[WARNING] Could not find generated-ksef-1.0.0.jar to install
Looking at the full log: no jar: step had run. The Maven lifecycle had jumped straight from compile to install, skipping package (which is the phase that actually builds the JAR file). The classes were sitting in target/classes/ but nothing had bundled them into an artifact.
Buried earlier in the log:
[INFO] Attempting to restore project com.b2bsaas:generated-ksef from build cache
[WARNING] s3://b2bsaas-mvn-cache/cache - Connection refused
[INFO] Remote cache is incomplete or missing, trying local build
[INFO] Saved Build to local file: ~/.m2/build-cache/v1.1/com.b2bsaas/generated-ksef/.../buildinfo.xml
The codebase uses a Maven build cache extension — an S3-backed shared cache that lets developers and CI runners skip work they’ve already done. When the cache detects that the module’s source inputs haven’t changed since the last successful build, it short-circuits the lifecycle: “I’ve seen this exact input combination before, here’s the output.” On CI with the cache server reachable, this saves hours per week across the team.
Locally, without VPN credentials for the cache S3 bucket, the extension can’t reach the remote cache. It falls back to a local-only cache. And in that fallback path, it had recorded a previous run where the build had failed at package (one of the earlier errors), but the metadata cache entry was still there — telling Maven “this module has been built, here’s where the artifact is.” Except the artifact didn’t exist; only the metadata pointing at it.
The result: Maven would happily skip package because the cache said “already done,” then install would fail trying to copy a JAR that was never produced.
The fix had two steps:
# 1. wipe the stale local cache entry for this module
rm -rf ~/.m2/build-cache/v1.1/com.b2bsaas/generated-ksef
# 2. rebuild with the cache extension disabled
mvn -f integration-api/generated-ksef/pom.xml clean install \
-Dmaven.build.cache.enabled=false
With the cache disabled, the full lifecycle ran end-to-end, package produced the JAR, and install copied it into ~/.m2/repository/.
Maven build cache extensions are excellent for CI throughput and brittle on a developer laptop without network access to the cache backend. When you see
BUILD SUCCESSbut no artifact, suspect the cache before suspecting your pom — and disable it with-Dmaven.build.cache.enabled=falseto confirm.
What I’d tell another engineer
The work I was about to do — “add a new XML format to a system that already handles UBL and CII” — turned out to be straightforward to describe and full of layered surprises in the doing. Six distinct things had to line up before the new module could even produce a JAR:
- The codebase’s existing convention for vendoring XSDs and code-generating Java was non-obvious (unregistered Maven module, published as its own artifact).
- JAXB and the
cxf-xjc-plugintooling needed re-learning if you, like me, last touched it in Java 8. - Maven’s
generate-sources→compile→packageordering had to be internalised, andbuild-helper-maven-pluginunderstood as the glue that registers generated source directories. - The schema imported from an absolute URL, which is normal for government-published XSDs and requires an XML catalog to redirect.
- The schema’s complexity tripped a JDK-default parser limit, and the fix had to be passed to
cxf-xjc-plugin’s forked JVM viaadditionalJvmArgs. - The generated classes referenced the Jakarta-namespaced JAXB API, which post-Java-11 needs an explicit Maven dependency.
- The local build cache had recorded a stale half-build and needed clearing.
Six is a lot of layers for “add support for one more XML standard.” But there’s a pattern in all six: each layer is one piece of opt-in default behaviour from one tool, and each one is silent until the moment it isn’t. The job — when integrating any third-party schema in 2026 — is to expect a half-dozen of those, and budget time for them.
What’s next
The build chain is now green. generated-ksef-1.0.0.jar is built, published, and document-processing can pull it. Faktura and friends are real Java classes I can instantiate.
What hasn’t happened yet:
- The runtime integration of KSeF into the upload flow — wiring
KsefProcessorinto theXMLStandardenum and theProcessorFactorythat picks the right one per file. - Loading the FA(3) schema at runtime for validation (needs another catalog dance, because
SchemaFactoryhas its own opinions about absolute URL imports). - Mapping the XJC-generated Polish field names (
getNaglowek().getKodWaluty(),getPodmiot1(),getFa().getWiersz()) onto the genericInvoiceRepresentationthe system uses for everything. - Whatever surprises production has waiting once a real customer’s FA(3) file hits the unmarshaller.
That’s the next post — the runtime integration war story, written after the feature has actually shipped and we’ve seen what’s hard about it for real users. The build chain is the foundations. The runtime is where the lessons that matter for production live.
For now: if you’re staring at “customer can’t upload their invoice” and the file doesn’t match any schema you know, before reaching for “validation bug” — check whether you’re looking at an entire parallel standard you’ve never heard of. Especially if the namespace points at a gov.<country> domain.



