pdf to markdown
← Blog

Why Copying Text from a PDF Breaks Formatting

You have done it before: open a PDF, select text, copy it, paste it into a document or email — and the result is a mess. Line breaks in the wrong places, headings flattened into body text, bullet lists turned into run-on paragraphs, bold and italic gone entirely. This is not a bug in your clipboard. It is a fundamental consequence of how PDF stores text.

How PDF stores text internally

A PDF file is not a document in the way a Word file or HTML page is. It is a set of drawing instructions. Each piece of text is positioned on the page using exact x/y coordinates, like placing stickers on a canvas. The PDF does not know that three lines of text form a paragraph, or that a larger font means a heading. It only knows “draw these characters at this position in this font.”

When you copy text from a PDF, your viewer tries to reconstruct reading order from these coordinates. It walks through the positioned text fragments and guesses which ones form lines and paragraphs. This guessing process is where formatting breaks.

What breaks when you copy-paste

Headings become body text

PDF headings are just text drawn in a larger font. When you copy-paste, the font size information is lost. A heading and a paragraph become the same flat text.

Line breaks appear mid-sentence

PDF text wraps at the edge of the page. Each visual line is a separate text fragment. When pasted, these fragments often get hard line breaks between them, splitting sentences awkwardly.

Bullet lists lose structure

Bullet characters in PDFs are often special glyphs or symbols positioned near text. Copy-paste may drop the bullets entirely or turn a structured list into a single block of text.

Bold and italic disappear

Bold and italic in PDFs are achieved by using different font variants (e.g., “Helvetica-Bold”). The clipboard only carries plain text, so all emphasis is stripped.

Tables collapse

Table cells in a PDF are individually positioned text fragments. Copy-paste often jumbles the reading order, interleaving columns or dropping cell boundaries entirely.

Multi-column text interleaves

If a PDF has two columns, copy-paste may read left-to-right across both columns per line, mixing text from column A and column B into a single unreadable stream.

Why a proper converter does better

A PDF-to-Markdown converter like ours does not rely on the clipboard. It reads the raw PDF structure using a library like PyMuPDF, which gives access to:

  • Font sizes for each text span, allowing heading detection by comparing sizes to the page median.
  • Text flags that mark bold and italic variants, which are preserved as ** and * in the Markdown output.
  • Block positions that define paragraphs, lists, and table cells, allowing accurate reconstruction of document structure.
  • Bounding boxes for detecting gaps between words and paragraphs, producing proper spacing in the output.

Instead of guessing reading order from pixel positions, the converter reads structured data from the PDF format itself. The result is a Markdown file with real headings, formatted emphasis, proper lists, and intact tables — not the mangled text you get from Ctrl+C.

When copy-paste is fine

If you just need a single sentence or a short paragraph from a simple PDF, copy-paste works. The problems become serious when you need to extract longer sections with structure — headings, lists, tables, emphasis. That is when a proper conversion tool saves hours of manual reformatting.