PDF to Word, only with basic formatting?

incifinci · October 18, 2021, 7:52pm

I often need PDF >>> DOCX conversion. What i want, is to keep only the basic formatting: bold, italic, font size, font name, font color + paragraph alignment (+ embedded images, preferably without conversion). I do not want many hundreds of styles, e.g. Normal + Times New Roman, 12 pt, Italic, Custom Color (RGB (35; 31; 32)), sparse placement, indented 1.237... Unfortunately, this is, what Adobe Acrobat and online converters do.
There are many smart and experienced people here. Does anybody know a solution?

ohrenkino · October 18, 2021, 8:09pm

I am not really sure where you draw the line between basic formatting and other options.
IMHO embedding pictures, changing font names, sizes, expression and alignment are characteristic for full-grown text formatters.
If you do not like the style names - then ignore them.
If you find that you need all the special formatting, then I doubt that there will be any other option than to select all the paragraphs that should look the same and assign the same style name.
The other option would be to select all the text and press Ctrl-Space to reset all paragraphs to the initial formatting of the style definition.

incifinci · October 18, 2021, 8:29pm

Ohrenkino, thank you for your quick answer.

Because i need (10-20) styles, usually i do something similar; unfortunately, it takes a lot of time.

Unfortunately, this removes not only paragraph formatting, but font formatting too -- bold, italic, and so on.

ohrenkino · October 18, 2021, 8:34pm

Exactly. This describes what I would call "local formatting" which is usually a variation of the underlying paragraph definition.
As soon as one starts writing texts like this they become more or less unmaintainable and lead to a lot of time consuming work just like you found out.
So you can die only one death: either you reduce the whole stuff to a less richer format which leads to less work or you keep all the specialities and spend the effort.

incifinci · October 18, 2021, 8:36pm

But while i live, i hope.

There are options. For example, Adobe, PDF >>> HTML; EditPad (or PowerGREP), RegEx removing all the unwanted formatting, creating a macro for this task. Then, HTML >>> DOCX in the MS Word. (Or to create a macro in the MS Word, just its language is more cumbersome, complicated, less reliable.) So far i have been lazy about it, i was trusting/waiting in a ready-made solution.

incifinci · April 28, 2023, 8:57am

I made it. Unfortunately, with partial success. First, i have to go through all the RegEx replaces one by one, because some of them have to be looked at piece by piece. Second, after, when converting HTML>>>DOC(X), Word (2003) degrades the quality of embedded images/figures -- i have to go through the entire file in Word and replace the images. (Although this is not too much time -- of course, it depends on the text.)

After all, with the ready-made RegEx collection, i can save maybe 30-40% of the time of the entire PDF>>>DOC(X) conversion.

cathcam · July 6, 2023, 3:09am

I'm late to this subject, but I hope others might find this useful. Personally I used to loathe pdf files. Back circa 2002 I started using the phrase "PDF where information goes to die."

Mostly becuase of Adobe licensing and upgrade options. I had so many actual licensed copied of Adobe PDF but none of them would work in the n+1 system that I'd use.

I've almost completely changed my opinion. A few years back for my current gig as a music researcher I needed to update some older PDF's for which I no longer had the software or source to change. I licensed a copy of Tracker Software pdf-xchange editor and later upgraded to plus.

It can add/insert/delete pages, watermark pages, export to ppt/docx/jpg/png etc - they have a free trial version, pro and support for a download, non-cloud "plus" version is less than $75.

I wrote this back in 2011... anyone been to Phoenix Sky Harbor airport lately?

Are PDF’s where information goes to die? | Adventures in systems land (wordpress.com)

LyricsLover · July 6, 2023, 6:47am

How do you use this software to copy only the basic formatting and no styles as the OP asked?

cathcam · July 11, 2023, 5:17pm

At least in the pdf's I've exported to docx format selecting No Embed and "Retain flowing text" rather than "Retain page layout" and leaving it to default to NO for comments, images and links - I seem to get something akin to what the OP - You don't get the images etc. I didn't experiment with options for images.