Date: prev next · Thread: first prev next last
2020 Archives by date, by thread · List index


Much simpler than using a hex editor:

1. rename the .odt to .zip and use a zip archive manager.  you want to look
in content.xml

2. save the file as an .fodt (its the same file in open text.) view that
with a plain text editor. This works great unless you have a lot of images
in it. textified images are too much of a good thing.


On Tue, Jul 21, 2020 at 12:18 AM John Kaufmann <kaufmann@nb.net> wrote:

Hi Brian,

Thanks for introducing a fundamental concept that my brain had not yet
grasped. As for the details:

On 2020-07-20 18:49, Brian Barker wrote:
At 16:38 20/07/2020 -0400, John Kaufmann wrote:
Documents archived in Project Gutenberg are typically simple text, with
each line ending in <CR><LF> (Hex:0D0A), so that paragraphs are separated
by an empty line <CR><LF><CR><LF>. I thought it would be simple to convert
one such (5657.txt) to format in Writer, ...
... but stumbled on elementary problems in Find-&-Replace [Ctrl-H]
using regular expressions:
(1) "\n" is not found. Should not "\n" match one of the codes in
<CR><LF>? [If not, what code(s) should "\n" match?]

First, once you have your text in a word processor, you do not have <CR>
or <LF> or <CR><LF> or anything else like that in your text; instead you
have *paragraph breaks*. There is no character there, despite the pilcrow
that you can get Writer to display. And what you are calling "empty lines"
are actually empty paragraphs. "\n" in the "Search for" field matches line
breaks, not paragraph breaks. (And line beaks are line breaks - also no
"codes".)

You sent me to do something I should have done before asking the question:
examine a hex dump of an ODT content.xml file.  I see what you mean about
"no codes": A paragraph is just a text string between XML bounds <text:p
...> and </text:p>, and a line break (inside the paragraph bounds) is just
<text:line-break/>. Sorry; I should have done that sooner, rather than rely
on a different paradigm that once worked for me.


(2) Although "$" is found (matches to <CR><LF>), ...

No, "$" does not match anything; instead, it anchors the expression
before it to the end of a paragraph. So an expression ending with "$" will
match text only if it comes at the end of its paragraph.

... "$$" (for successive occurrences of <CR><LF>) is not found. Why?

"$$" has no sense. If anything it means "this pattern needs to match
something that is *really, really* at the end of a paragraph"!

:-)!  Given what I learned after your answer on (1), this (2) is obvious.


(3) Doing Find "$" & Replace with " " (single space), <CR><LF> is
replaced by " " (single space). However, doing Find "$" & Replace with "@"
(single @char), <CR><LF> is replaced by "@@" (double @char). Why?

I don't think that's true. In any case, there are no <CR><LF>s present.

You're right. Examining again, I note that the "@@" was a remnant of a
prior attempt to replace the <CR><LF><CR><LF> of the original text file
with "@@" (prior to replacing all other <CR><LF> with " "). A faulty mental
model led to a bad approach and a misreading of the result.


To achieve what you want:

First combine single-line paragraphs:
o Apply Default paragraph style to all the text.
o Select all the text.
o Apply AutoCorrect.
(You may need to adjust the minimum length of such paragraphs in
AutoCorrect Options - possibly to 0%.)

Thank you! - that is not obvious (and I had not found it in the Help or
the Writer Guide), but was worth the price of having this problem arise.
AutoCorrect had the effect of changing most non-empty "Default Style"
paragraphs to "Text Body" style, with the rest chosen [by spacing hints?]
to be "Hanging Indent", "Heading", "Heading 1", "List", "List 1",
"Numbering 2" or "Text Body Indent". (Empty paragraphs remained "Default
Style".)  That was a MUCH more elaborate and sophisticated AutoCorrect than
I ever would have imagined.

Now I understand the point of AutoCorrect Option "Combine single line
paragraphs if length greater than 50%". But how do you "adjust the minimum
length of such paragraphs in AutoCorrect Options - possibly to 0%"? (The
fact that I don't find the setting suggests that I may have also missed
something basic in your explanation.)

Note: Even after having this excellent explanation on a use of
AutoCorrect, I went back to the Writer Guide and still don't find it. So it
may take quite some time to learn enough of such subtleties to feel that I
have a solid grasp.


Then remove empty paragraphs:
o Search for "^$" (no quotes) and replace with nothing.
("^" anchors your pattern to the start of a paragraph and "$" to the
end. So "^$" matches a paragraph with nothing in it.)

Again I like your pedagogical approach, matching the action with the
reasoning. You should be a teacher.


I trust this helps.

Helps?! It was an education. (If you could just answer the follow-up
question on AutoCorrect Options...)

Thanks again,
John

--
To unsubscribe e-mail to: users+unsubscribe@global.libreoffice.org
Problems?
https://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette
List archive: https://listarchives.libreoffice.org/global/users/
Privacy Policy: https://www.documentfoundation.org/privacy


-- 
To unsubscribe e-mail to: users+unsubscribe@global.libreoffice.org
Problems? https://www.libreoffice.org/get-help/mailing-lists/how-to-unsubscribe/
Posting guidelines + more: https://wiki.documentfoundation.org/Netiquette
List archive: https://listarchives.libreoffice.org/global/users/
Privacy Policy: https://www.documentfoundation.org/privacy

Context


Privacy Policy | Impressum (Legal Info) | Copyright information: Unless otherwise specified, all text and images on this website are licensed under the Creative Commons Attribution-Share Alike 3.0 License. This does not include the source code of LibreOffice, which is licensed under the Mozilla Public License (MPLv2). "LibreOffice" and "The Document Foundation" are registered trademarks of their corresponding registered owners or are in actual use as trademarks in one or more countries. Their respective logos and icons are also subject to international copyright laws. Use thereof is explained in our trademark policy.