[help] Are there tools for documents manipulating that can provide an approximate size of components (text included)?

Programming@programming.dev – 11 points – 10 months ago

Long story short, I want to build a system that reorders some components in a document file (be it a docx or odt, I don't have a hard constraint atm).

So my problem input should be a document file, and I need to be able to approximate the number of pages consumed by this document file, I also need to be able to get the height of individual components (like a single paragraph or a table) to have the data I need to rearrange so I can make the document have less pages.

I don't have a hard constraint on the programming language of the tool either (Python preferred), I prefer not embedding LibreOffice into my system.

Also I'm willing to hear other solutions (maybe my input is not the optimal thing I can use for this problem).

Thanks in advance!

A docx is just a renamed zip archive with the XML data. You should be able to unzip it and use a parser to access that info directly. There are likely tools to do this for any relevant language. You can also find the official spec online with some more info.

Unfortunately, I can't get into much more detail than that as my company actively develops similar tools and I've worked on their document renderers not too long ago.

No clue on the odt stuff. I worked on the MS fidelity part.

I would look into a library that does manipulation of odt (or docx). Code whatever algorithm you need to do the restructuring. Now your left with an in memory representation of the document that you can hopefully figure out how many pages it spans, or save it to a temporary file.

All depends really on how feature rich the odt libraries are and/or how deep you want to dive into the spec.

I feel like this is an XY problem. Is there an underlying issue your trying to resolve?

Yeah my main is issue is trying to figure out how many pages it spans, I've looked at some docx and odt libs, none did seem to have an API related to getting the number of pages nor the height of some component (except for stuff with fixed heights like images...).

The underlying issue is that I want to create an exam paper with the least papers possible per exam, so I guess that at least I should be able to get the height of each question of the exam and rearrange them (using an algorithm) in a fashion that uses less papers.

Is using something like typst to generate your exams an option? There'd be a learning curve but it's full of utilities to format and arrange content and whatnot so it feels like it could be a cleaner way of achieving what you want. Plus, it'd make iterating easier and give you more consistency over time going forward

Not really no, I need something that I can embed into my application, rather than 3rd party software, my application must work offline too :/

How about generating latex source code, compiling it and getting the page count of the generated PDF? Reorder your set of questions and see if the result is better or worse. Optionally do it in a smart way to reduce the number of PDF compilations you have to do. (Simulated annealing comes to mind for example.)

I think it would be easier to find a library to find the last line on a PDF page than it is to parse unzipped odt files and basically write a layout engine that does the same as libre office just to get the number of pages.

Maybe you can even get Tex to put it in the log during compilation. That would be the most convenient option and seems reasonable to achieve.

Use Google Apps Script to open the document in Google Docs, read the number of pages that Google Docs renders, closes the document, then delets the document (optional).

I need to automate the process to use it during an algorithm, this is far from practical.

My suggestion was to automate the process using Google Apps Script using an algorithm. You've not given a lot of details about what you actually want to do but for what you did give, Google Apps Script would let you automate the task.

This is very different from docz or odt, but maybe its worth looking into converting markdown or latex to PDF with something like pandoc. Maybe that or some other more open and less complex format might help with this?

My requirements on the format itself are not that high, at best I need to be able to add images and tables, I can reason with any format that will work with that, maybe convert it later if I need to.

Markdown supports images and tables. It may depend on the rendered though. The GitHub flavour of Markdown supports this for example and I expect Latex supports it too. If existing tools don't exist to get the height of elements you can probably make it yourself fairly easily if you you the specific font and styling the renderer uses. You'd just have to parse the file, which is basically plain text, and run the same calculations the renderer would. For which approximation might be fine depending on the use case

Yeah that's what I'm searching for atm :/

Ultimately, no, not really, these formats are built to be "render-agnostic", and there's really no way to pre-calculate aspects of what the render will be without actually running it through the rendering engine. Which is, in theory, doable, without having to send the render output to an actual screen or printer, but the followup problem is that all renderers are not created equal. I.E. an engine for rendering a docx that you grab from NuGet or somewhere else is not guaranteed to produce the same output as what Microsoft Word will, not exactly.

If you need accuracy in predicting the rendered-size of various things, you really need to be running the documents through the same renderer that will be used to actually print/draw the documents for the user. If this is Microsoft Office, you can look into Office Interop protocols, which will let you make programmatic calls into the actual Office programs installed on the system, from your program. There ought to be a way to kick off rendering from there.