UPM Facultad de Informática researchers compile information on security and privacy for authors or readers of PDF documents, the most popular format for publication of digital documents.
This work by researchers from the Universidad Politécnica de Madrid’s Facultad de Informática surveys security and privacy threats related to digital document publishing. It addresses publisher-related information that is leaked once the document is sent over the Internet, as well as reader-related information that might be disclosed every time they open a downloaded document for examination. The work mainly focuses on the PDF document format that is the most popular document format for digital document publishing.
Publication of digital documents over the Internet poses serious security and privacy threats to both authors and readers. Previous research by the UPM Facultad de Informática’s Distributed Systems Laboratory researchers addressed information leakage in popular Microsoft Office document formats. This research focuses on the PDF document format, which is the de facto standard for digital document exchange. Many institutions worldwide have adopted PDF as their document standard, and it has been estimated that billions of PDF documents are published or downloaded every day. The results of this research were published in the Journal of Systems and Software.
Published documents could include additional author-related data, such as user name, document location on the author’s machine and even parts of the documents that were deleted before publication.
Some of this information, such as the user name or the last day the document was edited, are referred to as meta-data and are used by reader or editor applications to improve the user experience; however, they could lead to privacy breaches mainly because authors are not aware of their disclosure upon document publication. Other sensitive information is leaked because of the poor design of the document format. For example, whenever a paragraph of a document is deleted, PDF authoring applications do not remove the paragraph but rather mark it as “invisible.” This way, the reader application does not visualize the deleted text when the document is opened for reading. Hence deleted data is kept along with the document and can be read by any malicious user that knows where to look for it. UPM researchers have developed several tools to extract information from PDF documents that are not accessible with standard document readers.
Avoiding information leakage
There are many popular incidents where document publication has revealed much more information than the publishers intended to communicate. For example, the Coalition Provisional Authority in Iraq published a PDF document on the “Sgrena-Calipari Incident” in May 2005. Black boxes were used to conceal the names of some of the people involved in the incident, but they were all easily revealed by copying the text from the original document into a text editor. Several companies and institutions have distributed guidelines to avoid information leakage in published documents after the media reported news about documents published on the Web containing sensitive information that was not supposed to be made public.
From the reader’s point of view, opening a downloaded PDF document could expose sensitive information like the IP address of the user’s machine, the user name and potentially any other information that is stored on the machine used to open the document. This is due to the interactive features of PDF applications. Several actions, like connecting to a website or reading data from a disk, can be automatically triggered every time a PDF is opened for reading. Ideally, the user should be warned of the risks of the action being taken and asked for confirmation. This research has highlighted that in many settings, especially when opening PDF documents within an Internet browser, triggered actions are performed without user notification or agreement. In their work, the UPM researchers elaborate on how it would be possible to retrieve and abuse information about each user that downloads and reads a PDF document.
Finally, the UPM researchers believe that PDF document format is a powerful document exchange medium. The main goal of their work is to make users aware of the risks that they face every time they publish time a document on the Internet and to provide effective guidelines to minimize the leakage of sensitive information.