NEW273435
PDF.js contains binary code
https://bugs.webkit.org/show_bug.cgi?id=273435
Summary PDF.js contains binary code
Michael Catanzaro
Reported 2024-04-29 14:26:00 PDT
We should update to PDF.js v4.2.67. However, this is going to be more complicated than usual because this release contains a binary wasm module for processing JPEG 2000 images: https://github.com/mozilla/pdf.js/pull/17946/files#diff-0c3dc243c4697cac89b08327394744c65a35d3ff8f7e0badcd98f40229c7e2cb. I think that goes too far and we don't want it. JPEG 2000 image format is old and obscure, and I think it's OK to not fully support PDFs that use it. It's bad enough that we ship prebuilt JS files instead of the preferred modifiable source, but at least those are still human-readable and not minified or obfuscated or anything (and we can't do much about it, because the build process depends on a large number of node.js modules that we surely don't want WebKit to depend on). Even though probably nobody will ever inspect the built JS code to see what it's doing, at least it's *possible* to do so. That's no longer true if we ship binary code compiled by somebody else. (Interestingly, it *looks* like a text file rather than a binary, because it's base64 encoded and embedded into pdf.worker.mjs.) Anyway, removing the code that uses the wasm should be easy enough. We can add a patch to sabotage it: static decode(data, ignoreColorSpace) { this.#module ||= OpenJPEG(); const imageData = this.#module.decode(data, ignoreColorSpace); if (!imageData) { throw new JpxError("JPX decode failed"); } return imageData; } We'll also need a script to automatically remove the jpx.js source section from pdf.worker.mjs. (The same script could apply the patch and maybe run some of the other steps that are currently manual.) I thought the above was a good plan, but then I decided to check to be sure there isn't more wasm binary content in pdf.js. Unfortunately there is, it's quickjs-eval.js which is required to implement the PDF.js subsandbox, so removing it would have security implications. :S I'm pretty sure it's unacceptable to have this in WebKit, but we might have to discuss this with upstream before deciding what to do. One possibility is to actually depend on node.js, which would be sad.
Attachments
Michael Catanzaro
Comment 1 2024-04-29 14:28:58 PDT
(To be clear, quickjs-eval.js has been there since the first time we imported PDF.js into WebKit, and I just hadn't noticed.)
Alexey Proskuryakov
Comment 2 2024-04-29 16:47:38 PDT
Cf. bug 271726, where we did re-enable JPEG 2000 for PDFs.
Michael Catanzaro
Comment 3 2024-04-30 07:21:40 PDT
(In reply to Michael Catanzaro from comment #0) > I thought the above was a good plan, but then I decided to check to be sure > there isn't more wasm binary content in pdf.js. Unfortunately there is, it's > quickjs-eval.js which is required to implement the PDF.js subsandbox, so > removing it would have security implications. :S I'm pretty sure it's > unacceptable to have this in WebKit, but we might have to discuss this with > upstream before deciding what to do. One possibility is to actually depend > on node.js, which would be sad. Interestingly, Firefox does not ship this binary code at all. It's in build/pdf.sandbox.mjs, but Firefox has a different file instead, pdf.sandbox.external.sys.mjs. Firefox takes the upstream releases but builds everything itself with gulp, and must be doing something differently than upstream does. So there is a path forward here, though I don't yet understand how exactly. One non-ideal option would be to just switch to the Firefox releases, although we'll want to wait and see how Firefox chooses to handle OpenJPEG. Might be better to investigate how the Firefox build works.
Michael Catanzaro
Comment 4 2024-05-01 14:27:22 PDT
So after talking with one of the PDF.js developers, we won't need to use gulp. With some effort, we can build the binaries as part of the CMake/XCode build, depending only on Emscripten: * https://github.com/mozilla/pdf.js.quickjs/blob/main/compile.sh * https://github.com/mozilla/pdf.js.openjpeg/blob/main/compile.sh Unfortunately, if we depend on Emscripten then the PDF.js support will surely wind up disabled in places we care about, like Fedora (no Emscripten package) and possibly also Epiphany Tech Preview (not sure about adding it to GNOME and WebKit SDKs, and definitely won't want to bundle it in Epiphany). Right now the most likely plan is to (a) explore whether we can run JS in a fresh/isolated JSContext rather than using quick.js, and (b) manually downgrade to the previous version of PDF.js's JPEG 2000 decoder or just sabotage the JPEG 2000 support. PDF.js contains stubs in https://github.com/mozilla/pdf.js/blob/33732ff2cba88e262dcd542b97a237fe8b9bbe35/src/pdf.sandbox.external.js that allow the quick.js sandbox to be replaced by a custom sandbox. Firefox does so in https://searchfox.org/mozilla-central/rev/b41bb321fe4bd7d03926083698ac498ebec0accf/toolkit/components/pdfjs/content/PdfSandbox.sys.mjs. We can use user scripts to register a custom JS function that would only be accessible to PDF.js. PDFDocument.cpp would then run the script, equivalent to Services.scriptloader.loadSubScript. We'd also need to implement exportValueToSandbox, importValueFromSandbox, and createErrorForSandbox. This will probably require PDFDocument to keep a map of JSContexts, since each PDF document could have multiple JS sandboxes. I'm not sure if this would actually work or if I'm underestimating the difficulty, but it sounds doable? Additional context: * The purpose of the sandbox is to allow executing PDF JS, i.e. untrusted JS embedded in PDF documents. * Unfortunately a Firefox developer encountered a PDF with JPEG 2000 image in the wild, and the previous decoder didn't handle it correctly.
Michael Catanzaro
Comment 5 2024-05-01 15:11:57 PDT
Another problem with emscripten, from https://github.com/emscripten-core/emscripten/blob/3745ad8c37f0eab93de7ef4989dc4f6665f3675e/docs/packaging.md """ One important thing to note here is that emscripten doesn't currently support the stable LLVM releases. It requires a build of LLVM from top-of-tree, or at least very close to it. This means that depending on a packaged version of LLVM is unlikely to work. """ But we're not *not* going to use a packaged version of LLVM, so failure seems likely. In particular, the Debian package's README has some concerning caveats: """ Using features requiring newer LLVM may lead to errors like this: * Function addresses with offsets not supported * TLS symbol is undefined, but TLS symbols cannot yet be imported * unknown argument: '-mrelaxed-simd' """ This means bundling our own Emscripten is also a no-go unless we're also willing to add LLVM to Source/ThirdParty, which would be overkill and probably double the time it takes to build WebKit. Let's not. Their packaging documentation is really good though. Too bad!
Elliott Williams
Comment 6 2024-05-06 13:39:11 PDT
PDF.js is not enabled by default in Apple ports (https://bugs.webkit.org/show_bug.cgi?id=242263), but we'd have similar problems with packaging emscripten, as it is not part of the Xcode toolchain. Neither is node.js.
Radar WebKit Bug Importer
Comment 7 2024-05-06 14:26:14 PDT
Michael Catanzaro
Comment 8 2024-06-09 08:46:45 PDT
(In reply to Michael Catanzaro from comment #4) > Right now the most likely plan is to (a) explore whether we can run JS in a > fresh/isolated JSContext rather than using quick.js, and (b) manually > downgrade to the previous version of PDF.js's JPEG 2000 decoder or just > sabotage the JPEG 2000 support. This is still on my to-do list, but I'm not sure when I'll get to it. Pretty sure this is the only reasonable path forward. We should not update PDF.js again unless we're able to resolve this; it's better to ship old PDF.js than to ship a new precompiled blob.
Michael Catanzaro
Comment 9 2024-08-26 13:27:39 PDT
(In reply to Michael Catanzaro from comment #4) > Right now the most likely plan is to (a) explore whether we can run JS in a > fresh/isolated JSContext rather than using quick.js, and (b) manually > downgrade to the previous version of PDF.js's JPEG 2000 decoder or just > sabotage the JPEG 2000 support. This isn't a good plan. Turns out JPEG 2000 is important, see bug #278675.
Michael Catanzaro
Comment 10 2024-08-26 14:16:55 PDT
(In reply to Michael Catanzaro from comment #8) > This is still on my to-do list, but I'm not sure when I'll get to it. I'm afraid I might need to declare to-do bankruptcy again. And while my plan in comment #4 should eliminate the need for quick.js, I fear there's no way around needing OpenJPEG. Not sure what to do about that.
Michael Catanzaro
Comment 11 2024-09-04 07:58:33 PDT
Adrian thinks we can compile OpenJPEG as part of the CMake build by using https://github.com/WebAssembly/wasi-libc rather than using Emscripten. That is probably going to be the path forward here. We could also possibly do that for quick.js, though it might not be much harder to simply remove quick.js (comment #4). I don't plan to work on this myself. In the meantime, we should not update Source/ThirdParty/pdfjs to ensure the preexisting binary code does not change before we're ready to remove it.
Michael Catanzaro
Comment 12 2024-09-04 09:49:22 PDT
Surprise: I checked out the Firefox source code, and it does NOT have this same problem, despite using a newer PDF.js than we do. The problem is still present in upstream PDF.js, so Firefox is doing something differently. * Check out https://archive.mozilla.org/pub/firefox/releases/130.0/source/firefox-130.0.source.tar.xz * Open firefox-130.0.source/firefox-130.0/toolkit/components/pdfjs/pdf.worker.mjs * Search for ";// CONCATENATED MODULE: ./external/openjpeg/openjpeg.js" * Notice lots of normal JavaScript code rather than the wasm binary that's present in the upstream version If we could figure out how Firefox has accomplished this, that might present a path forward.
Calvin Walton
Comment 13 2025-04-22 06:44:06 PDT
In current releases of Firefox (checked in 137.0), Mozilla ships a WASM blob for OpenJPEG, in firefox-137.0/toolkit/components/pdfjs/content/web/wasm/openjpeg.wasm They also ship a technically-javascript file firefox-137.0/toolkit/components/pdfjs/content/web/wasm/openjpeg_nowasm_fallback.js - but this is also a compiled binary blob, not readable source code (it is compiled to asm.js using emscripten). The build of openjpeg used is coming from here: https://github.com/mozilla/pdf.js.openjpeg/
Note You need to log in before you can comment on or make changes to this bug.