There’s a lot of help online about making Portable Document Format (PDF) accessible. Even with all the advice out there, I still encounter people who find it difficult to make their documents friendly to people with disabilities.
It seems a lot of people have a hard time translating what’s in a tutorial to the PDF they’re working on. Unlike fixing accessibility errors in HTML, the solution for most PDF errors is usually the same for any PDF. It’s actually difficult to have multiple solutions when fixing accessibility errors in PDFs. Unlike HTML, a PDF that’s only got rudimentary accessibility is usually still too hard to use.
But why? What makes PDF inaccessible in the first place?
Different approaches to the same problem
Both HTML and PDF are documents. Both need some sort of application to read them. But what’s the fundamental difference between an HTML document and a PDF document?
An HTML document is made with it’s structure in mind (at least it should be). On the other hand, PDFs are made with the layout in mind. PDFs essentially make your computer act like a printer. The output strives to be an exact representation of what the original document looked like.
In HTML, the structure needs to be established in order to achieve the desired output. Typing a bunch of text into a text editor will have different outcomes when making HTML than a PDF. Formatting such as a clump of spaces or many hits of the enter key disappears when an HTML file is rendered. A PDF will look exactly the way it did in your text editor with formatting applied.
Yet, there’s an important consideration going on with these two files. HTML is displaying text on a screen, while the PDF is displaying a representation of the text on the screen. In reality, what you’re seeing in a PDF is not actually text.
First some history
Remember I said that PDFs make your computer act like a printer? There’s a reason for that.
The PDF file format is a subset of the PostScript page description language. A page description language is a programming language that describes the appearance of a printed page.
PostScript changed the way printers work. Before PostScript, it was hard to print text along with images. Vector drawings took expensive machines to print. Glyphs for text had to be physically changed if you wanted to use a different font. Printing Raster images required separate people to write programs for various printers.
PostScript converts pictures and text into a series of straight and Bézier vector curves. Vectors let you create a file with a code-based representation for everything that exist on a page. In years past, many printers shipped with fonts, but with PostScript, people could use fonts like those on their computer. While these features were super cool, PostScript came with its share of nuisances.
For example, while it’s possible to display PostScript on the screen, it wasn’t very practical. If you wanted to view page 150, pages 1 through 149 had to process first. That took a lot of processing power on computers, which made the process very expensive.
Enter PDF
The Portable Document Format still generates output like PostScript. But it also introduced a structure storage system and advanced compression. And it interpreted the PostScript through tokenization. Tokenization is a way of breaking up something into a series of chunks to classify and label what the code means. Storing tokens that repeat allows software to display page 150 without having to render any of the pages before it.
PDF also offered the ability to embed fonts inside the file format. This meant that end users didn’t need to have a bazillion fonts on their computers. PDF draws text on the page as vector lines and curves that are derived from the embedded fonts.
A PDF is a visual representation of the source document. To display where text resides on a page, it uses the page as the reference for the text’s location. The structure of the page defines its content because it is a set of printing instructions.
On its own, there is no inherent structure assigned to the objects in a PDF. Everything in a PDF exists in relation to the layout of the page. For PDF documents to have structure, there needs to be a means to separate the useless content from the content that matters (After many confusing conversations with me, this has become a dream of my wife’s for years).
There are three basic ways content in a PDF is rendered. The first is how content is presented to the end user. This is the primary purpose of PDFs: displaying electronic print output on your computer screen. The second is how content is organized based on the way the document looks when you’re looking at it on the computer. The last way is optional. Defined tags can provide meaning to those tokens I mentioned earlier.
A closer look behind the scenes
PDF is basically a sequence of bytes. These bytes’ values can be represented as defined in ASCII plus white space characters. When the bytes are grouped together, they’re called tokens. The tokens represent a means to what is visually described in a PDF. Objects are data structures composed of small sets of bytes from a small set of data objects.
A PDF file consists of one or more pages and the objects in it. The fact that it actually contains a page is very important. All content that came from the application that created the PDF is broken down into tokenized objects. These tokens describe the graphics and text on each page.
Images in PDF documents aren’t added to the PDF file. Images are extracted and represented as data. Objects describe their transparency, values, location, color space, and so on. But the raw image is stored in the PDF file structure so it can be extracted from the PDF.
Like images, text is extracted when the PDF is created. Therefore, text in a PDF aren’t actual characters on the page. They are a rendering of lines and bezier curves that represent the shapes of letters and numbers that you’d see on the page. From an object perspective, there’s no actual correlation between the text on the page and what you see when you look at a PDF. When you see a table, your mind visually interprets it as a table. In a PDF, a table is just a sequence of objects, representing other objects, representing bytes on the page. The content in a PDF needs to be structured if you want to be able to extract text from the PDF
When you need to interact with multimedia on a page, PDF includes something called annotations to the file. Annotations are objects that are layered over the content that provide interactivity.
Objects that make up text, images, and annotations aren’t connected to each other. Instead, these objects relate to where they are in relation to the page. As objects are independent from one another, they only describe what the page looked like before it became a PDF.
What tags do
PDFs are essentially an organized mess. Letters, numbers, and spaces aren’t automatically connected to themselves or to the graphics they sit next to. Any hyperlinks on the page are only clickable rectangles that float willy-nilly above the content. You can actually move these clickable rectangles separately from the URLs that happen to be underneath them.
For content to make any kind of sense, you must define the relationships between these objects. You do that by giving a PDF what’s called a logical structure. The logical structure ranks groups of objects on the page. Objects receive attributes that provide further define their meaning. The logical structure lets you organize objects independently from the chaos of the PDF content. Under the hood, logical structure adds operands around the tokens in the content streams.
Operands give special instructions to the tokens to provide undefined semantics. To define tokens, tags must be included along with the logical structure. Tags are a set of instructions for PDF Objects that provide meaning and purpose within the logical structure. Tags can be organized by adding them to the “structure tree,” also known as a “tag tree.” This contains the logical structure of a PDF. The tag tree has a root, where all structure begins from. Inside this root is the first child and subsequent grouped structure elements, represented by tags. That first child generally describes the type of document that’s being structured. The first child of most PDFs will be a <Document>
tag explaining that the PDF is, well, a document.
Consider a simplified example: Suppose that you were to add a tag root to the PDF in Acrobat. At this point, there’d be no actual tags associated with the document yet. In the PDF’s source code, all PDF objects now have a bunch of operands, or markers, assigned to them. These markers expose every single object in the PDF, regardless of how useful this information is.
In order to provide a semantic relationship between what’s important and what’s not, we associate tags to the content. This separates real content from the content used to construct the page. For the parts of the document that are not important to understand the meaning of the content, we add a special designation. That designation is called an artifact.
The PDF specification — ISO 32000— defines how certain content types need to be marked up. For example, hyperlinks must be tagged to associate the hyperlink to both the clickable rectangle and to the text or image that needs to be associated to it. The PDF specification explains the requirements about how to use specific tags. But it doesn’t always require you to use specific tags when associating content.
For example, a PowerPoint presentation that’s exported to a PDF will represent its slides as containers. No matter what layout you use for your slides, each slide within that PDF will contain a first-level heading and a slew of paragraphs, even if the slide itself were to have the appearance of a bulleted list. In other words, there’s no requirement that stipulates something that says that elements within a PDF must be of the type that they look like.
“Tagged PDF” vs Accessible PDF
ISO 32000 doesn’t include a requirement for software to create PDFs as Tagged PDFs. It only requires software that supports Tagged PDFs to adhere to the chapter in ISO 32000 that covers Tagged PDF. Further, making a Tagged PDF does not necessarily render it accessible.
Making PDFs work with assistive technologies (AT) was a bit of an afterthought. Tagged PDF was initially used to export to other formats like HTML or XML. The requirements for Tagged PDF leaves much to interpretation. That means that software companies can determine which tags they want to include when exporting to a PDF. Worse, there’s no consensus for which tags matter most to AT. What we’ve ended up with is a bunch of PDFs that could have provided more semantic information to AT. But AT tends to ignore information that they don’t recognize because it conflicts with its own interpretation. It has basically boiled down to some really poor communication between developers and assistive technology companies in the past.
Not a solution, but a step in the right direction
PDF/UA was released to try to resolve this miscommunication. PDF/UA makes requirements for both software that creates Tagged PDF and the AT that reads them. Yet, believe it or not, making a PDF file PDF/UA does not actually render a PDF accessible. PDF/UA is what we call a companion standard. A companion standard is a standard that’s used alongside other standards. PDF/UA defines requirements for software that creates tagged PDFs and for the assistive technologies that reads them. But saving a PDF file as a PDF/UA doesn’t automatically make the PDF accessible.
PDF/UA primarily deals with making programmatically determined PDFs. PDF/UA places the responsibility for the content’s accessibility on the author. To make a PDF accessible, the document must be able to do a few things:
- The content that matters needs to be separate from the content that doesn’t;
- Content needs to be organized to reflect the visual representation;
- Relevant stuff needs to make enough sense to to someone who isn’t looking at it;
- It needs to follow the rules of ISO 32000;
- Something needs to tell Assistive Technologies what it expect when reading it.
Putting it all together
When you create a PDF, the file format creates a vector reproduction of the file’s visual presentation. To convey the presentation to other technology, the PDF must include some logic to define what’s what. Tagging a PDF does that.
When assistive technologies accesses Tagged PDF, it needs to understand what’s going on in the document. It uses something called a Document Object Model or DOM to understand it. In PDF, the DOM is the structure tree.
The authoring software decides the tags that are assigned to the PDF objects. The assignment of tags to objects can (and usually must be) fine-tuned by an end user. Unfortunately, one person’s interpretation of how to tag a document can be different from someone else’s. We agree to a common interpretation by following standards. Standards represent an agreement on how to structure content.
PDF/UA specification helps to define a common set of semantics for PDF. The Web Content Accessibility Guidelines provide guidance on how to describe content. Combined, those can go a long way toward helping you use the most semantic approach when you provide structure and meaning to PDFs.
In my next article, I’ll be covering how using PDF/UA and WCAG together will help to provide better accessibility for PDFs.
References
- ISO 32000-1:2008 (PDF)
- Developing with PDF, by Leonard Rosenthol.
Comments
Thanks Jon, will you address the ‘accessibility support’ (or rather the current lack of it on mobile devices) in your next article?
This is a stunning summary, and should be on everyone’s reading list. Especially for, anyone who works with accessibility. Thank you for this great insight into the nuts and bolts behind the scenes.
Andrew, thanks for the question!
My follow-up to this post will be more about interpreting existing accessibility recommendations in relation to PDF. As far as the current lack of accessibility support on mobile devices, I don’t intend to cover too much in that area. Unfortunately, my accessibility kung fu is not as stellar when dealing with mobile devices as I’d like it to be!
Though based on my limited understanding of iOS (as well as their desktop OS), Apple doesn’t adhere to the Tagged PDF specification when creating a PDF. Apple generates PDF the same way it draws vector graphics. This output is very limited, and doesn’t provide enough of a semantic relationship to the PDF Objects in order to make heads or tails of what they’re supposed to mean. Creating a PDF on a Mac is accomplished by Quartz2D, which seems create a set of instructions for drawing a page. When describing the details of the drawing, the only available option is to provide an alternate description, but not the drilled down structure of what’s inside that drawing. Consider this example: “Jon walks around in the shape of a box in a room” is less meaningful than “Jon stands 10 feet from two walls, then walks two steps forward, turns right, and repeats three times.”
This would explain why Apple Preview relies so heavily on the content view of the document rather than the structure tree. The content view in PDF is a representation of how content visually appears to a user, but it’s not a great way to parse organization. A PDF on an iOS device is essentially a graphical representation of text. Any relevant structure one applies to a PDF is ignored in iOS, because Apple doesn’t make Tagged PDF, and therefore is not required to meet that part of the specification. In order to make a Tagged PDF, it would require the way it creates a PDF to be dramatically different and probably more complicated.
I suspect that third party applications that can open and view PDFs (Like Adobe Reader, for example) are limited by what iOS makes available to developers to make a PDF. Apple doesn’t provide the functionality available for third party developers to create a structured PDF on an iOS, because it doesn’t create PDF in such a way that would expose the tokenization of the PDF Objects in a meaningful way. Therefore, it appears the onus for accessibility support on iOS devices is on Apple to fix, because it would require them to change the way they allow developers to create PDFs in the first place.
I’m eagerly looking forward to the next “installment.” I believe that WCAG 2.0 requires PDF editors to be able to produce a PDF that is PDF/UA compliant. Assuming that, is there any realistic hope that Adobe will build that into Acrobat Pro anytime soon?
I am constantly frustrated by how poor PDFs created from Microsoft Word or PowerPoint are. They sometimes require extensive PDF editing to make them pass the PDF Accessibility Checker, which checks 136 failure conditions in PDF/UA-1.
I have found, by the way, that using the commercial axesPDF for Word AddIn does let me produce extremely high quality PDFs as long as the source DOCX file is as accessible as possible. Even complex documents such as ones that contain properly constructed complex tables require considerably less editing than if I use the Acrobat “Create PDF” process.
As someone who has to generate PDFs for reports, the code that’s available always fails to do anything accessibility related… one system will even split paragraphs up into individual characters.
To be honest, I’ve kind of given up trying to make the PDFs I generate accessible, and instead provide a HTML alternative.
That said, I know PDF’s have lots of other problems as well (try reading an A4 formatted PDF document on a small screen).
So as a side project, I’m trying to get browsers to implement a simple way of bundling a web page as a single (secure/contained) document:
https://github.com/craigfrancis/wdoc
I’ve been calling for people to abandon PDF and move to EPUB for years. I’ve met some resistance from people in the PDF world, but haven’t heard any resistance from people with disabilities.