Re: [libreoffice-users] A word of warning about PDF text

Cley Faye <cleyfaye -AT- gmail.com>
Fri, 31 Jan 2014 11:00:58 +0100

2014-01-31 Peter West <lists@pbw.id.au>:

A word of warning about text retrieved from PDF documents.

Recovering text blocks from PDFs is inherently risky.  PDF is a page
definition format, and so it has no notion of the semantics of the text it
contains. It places bits of text at certain positions on the page. You can
create a whole page of text by taking the individual characters and their
attributes and position on the page, shuffling them, and writing them to
the file.  That will produce a readable file, but try extracting the text
from that file. Unless you have a very, very smart text extractor that
reverse-engineers the process of creating the page, then calculates the
_visual_ order of the text elements, you will end up with gibberish.

_Most_ pdf text, _most_ of the time, is laid on the page in visual order,
but in even the best-behaved files, you are likely to be surprised.

If you don't _know_ that your PDF text extractor program is completely
visually accurate by design, don't tell your boss that you can easily
extract that PDF text, without allowing time for proof-reading every page.
You will get burned.

I don't know how LO extracts PDF text; perhaps it is very sophisticated. I
have my doubts.

You are right about the fact that a PDF is not meant to be opened for
modification/text recovery. However it is hardly relevant here, as LO is
not (as far as I know...) marketted as a PDF extractor.

While it is possible to open PDF with Draw, even the simplest file will
show you that it is not meant for full and easy recovery: embedded fonts
are not used, some graphics are off by a few pixels (sometime more), and
yes, text get split into an unexpected number of parts, even when the PDF
content is layered correctly in the final file.
For example you can get a single line of text split in three text elements,
or have a single text elements with (seemingly) random spaces inserted in
the middle of words. General page layout is also an issue: a very simple
PDF, containing only a single page of text, show up as two pages on Draw,
with the footer of the first page at the beginning of the second one.

But I do not think any of this is relevant as long as users know that
opening PDF is at most useful for recovering some select elements. Unless
the documentation state otherwise, it is fine, as it works very well for
this specific usage. Opening a PDF in Draw just does this: show the various
elements present in the PDF.

-- 
To unsubscribe e-mail to: users+unsubscribe@global.libreoffice.org
Problems? http://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: http://wiki.documentfoundation.org/Netiquette
List archive: http://listarchives.libreoffice.org/global/users/
All messages sent to this list will be publicly archived and cannot be deleted

Context

[libreoffice-users] A word of warning about PDF text · Peter West
- Re: [libreoffice-users] A word of warning about PDF text · Cley Faye
- Re: [libreoffice-users] A word of warning about PDF text · Dominique Michel
  - Re: [libreoffice-users] A word of warning about PDF text · Dominique Michel

Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.