Files

2026-03-03 21:30:31 +00:00

12 KiB

Raw Permalink Blame History

pdf2imos

A Python CLI tool that converts PDF technical drawings into DXF 3D CAD files and structured JSON metadata for imos CAD furniture manufacturing workflows.

It parses vector geometry and text from PDF pages, identifies orthographic views, extracts dimensions and annotations, reconstructs 3D part geometry, and outputs industry-standard DXF files alongside validated JSON sidecar metadata.

Overview
Requirements
Installation
Usage
Pipeline Architecture
Output Files
Project Structure
Development
Error Handling
Design Notes

Overview

pdf2imos targets AutoCAD-style orthographic projection drawings of furniture parts. Given a directory of PDFs, it:

Extracts vector paths and text from each page
Segments the drawing into front, top, and side views
Classifies line roles (geometry, hidden, center, dimension, etc.)
Extracts dimension measurements and structured annotations (material, edgebanding, hardware, drilling)
Reconstructs 3D part geometry from orthographic measurements
Writes a DXF R2010 file with a 3D box mesh and a schema-validated JSON metadata file

Batch processing runs across all PDFs in the input directory. Only the first page of each PDF is processed.

Requirements

Python >= 3.11
ODA File Converter (optional, for --dwg output)

Installation

# Clone the repository
git clone <repo-url>
cd pdf2cad

# Create a virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install in development mode with dev dependencies
pip install -e ".[dev]"

Usage

pdf2imos INPUT_DIR OUTPUT_DIR [OPTIONS]

Also runnable as a module:

python -m pdf2imos INPUT_DIR OUTPUT_DIR

Arguments

Argument	Description
`INPUT_DIR`	Directory containing PDF files to process
`OUTPUT_DIR`	Directory for output DXF and JSON files

Options

Option	Description
`--stage STAGE`	Stop at a pipeline stage and dump intermediate JSON
`--tolerance FLOAT`	Dimension matching tolerance in mm (default: 0.5)
`--dwg`	Also convert DXF to DWG format (requires ODAFileConverter on PATH)
`--verbose`	Enable DEBUG logging
`--version`	Show version and exit

Available --stage values: extract, segment, classify, dimensions, annotations, assemble, output

Examples

# Process all PDFs in a directory
pdf2imos ./drawings ./output

# Stop at extraction stage for debugging
pdf2imos ./drawings ./output --stage extract

# Verbose output with custom tolerance
pdf2imos ./drawings ./output --verbose --tolerance 1.0

# Also generate DWG files (requires ODAFileConverter)
pdf2imos ./drawings ./output --dwg

Exit Codes

Code	Meaning
`0`	All PDFs processed successfully
`1`	Some PDFs failed, some succeeded
`2`	All PDFs failed, or invalid arguments

Pipeline Architecture

The pipeline runs in 7 sequential stages. Use --stage to halt after any stage and inspect intermediate output.

Stage 1: Extract

Parses the PDF page with PyMuPDF. Extracts vector paths (lines, curves, rectangles, quads) and text spans with font, size, and color metadata. Flips the y-axis from PDF convention (origin top-left, y downward) to CAD convention (origin bottom-left, y upward). Filters degenerate and zero-area paths.

Stage 2: Segment

Detects and removes the title block using a bottom-right rectangle heuristic. Clusters remaining geometry by spatial proximity into view regions. Classifies clusters as FRONT, TOP, or SIDE views using third-angle projection layout conventions (US/AutoCAD standard).

Stage 3: Classify

Classifies each path by visual properties:

Role	Visual Characteristics
`GEOMETRY`	Solid line, medium width
`HIDDEN`	Dashed line
`CENTER`	Dash-dot line
`DIMENSION`	Thin line, near arrowheads
`BORDER`	Thick line, large extent
`CONSTRUCTION`	Very thin line

Arrowheads are detected as small filled triangles.

Stage 4: Dimensions

Finds numeric text values via regex (digits with optional mm suffix). Converts text coordinates to CAD space. Matches each number to the nearest dimension or geometry line segment. Determines horizontal vs. vertical orientation.

Stage 5: Annotations

Extracts structured annotations via regex:

Material specs -- type, thickness, finish
Edgebanding -- thickness, material
Hardware callouts -- brand, model
Drilling patterns -- diameter, depth, count

Also collects raw text annotations and title block metadata.

Stage 6: Assemble

Reconstructs 3D part geometry (width x height x depth) from orthographic dimension measurements. Cross-validates dimensions across views: front height should match side height, front width should match top width. Falls back to 18mm depth when depth cannot be extracted (standard furniture panel thickness).

Stage 7: Output

Generates:

DXF R2010 file with a 3D box mesh on the GEOMETRY layer, dimension text on DIMENSIONS, and part name on ANNOTATIONS
JSON metadata file validated against metadata.schema.json
Optionally converts DXF to DWG via ODAFileConverter (ACAD2018 format, 30-second timeout)

Output Files

For each input example.pdf, two files are written to the output directory:

File	Description
`example.dxf`	DXF R2010 file with 3D mesh geometry
`example.json`	JSON metadata validated against the bundled schema

JSON Metadata Structure

{
  "source_pdf": "example.pdf",
  "extraction_timestamp": "2026-03-03T12:00:00Z",
  "part_name": "Panel A",
  "overall_dimensions": {
    "width_mm": 600.0,
    "height_mm": 720.0,
    "depth_mm": 18.0
  },
  "parts": [
    {
      "material": "...",
      "edgebanding": "...",
      "hardware": "...",
      "drilling": "..."
    }
  ],
  "raw_annotations": ["..."]
}

Coordinate System

PDF space: origin top-left, y increases downward
CAD/DXF space: origin bottom-left, y increases upward
The pipeline flips y-coordinates during extraction
DXF axes: X = width, Y = depth, Z = height

Project Structure

pdf2cad/
├── pyproject.toml
├── src/
│   └── pdf2imos/
│       ├── __init__.py              # Version
│       ├── __main__.py              # python -m entry point
│       ├── cli.py                   # Typer CLI and pipeline orchestration
│       ├── errors.py                # Exception hierarchy
│       ├── extract/                 # Stage 1: PDF parsing
│       │   ├── geometry.py          #   Vector path extraction
│       │   └── text.py              #   Text span extraction
│       ├── interpret/               # Stages 2-3: Layout understanding
│       │   ├── line_classifier.py   #   Line role classification
│       │   ├── title_block.py       #   Title block detection
│       │   └── view_segmenter.py    #   Orthographic view segmentation
│       ├── models/                  # Data models (frozen dataclasses)
│       │   ├── annotations.py       #   Material, edgeband, hardware, drilling
│       │   ├── classified.py        #   ClassifiedLine, LineRole
│       │   ├── geometry.py          #   PartGeometry (3D dimensions)
│       │   ├── pipeline.py          #   PipelineResult
│       │   ├── primitives.py        #   RawPath, RawText, PageExtraction
│       │   └── views.py             #   ViewRegion, ViewType
│       ├── output/                  # Stage 7: File generation
│       │   ├── dxf_writer.py        #   DXF 3D mesh output
│       │   ├── dwg_converter.py     #   Optional DWG conversion
│       │   └── json_writer.py       #   JSON metadata output
│       ├── parse/                   # Stages 4-5: Content extraction
│       │   ├── annotations.py       #   Annotation regex parsing
│       │   └── dimensions.py        #   Dimension measurement extraction
│       ├── reconstruct/             # Stage 6: 3D assembly
│       │   └── assembler.py         #   Part geometry reconstruction
│       └── schema/                  # JSON Schema
│           ├── metadata.schema.json #   Metadata validation schema
│           └── validator.py         #   Schema validation wrapper
└── tests/
    ├── conftest.py
    ├── generate_fixtures.py         # Synthetic test PDF generator
    ├── fixtures/
    │   ├── input/                   # Test PDFs (4 files)
    │   └── expected/                # Expected JSON outputs
    ├── integration/
    │   ├── test_golden.py           # Golden file comparison
    │   └── test_pipeline.py         # End-to-end pipeline tests
    ├── test_annotation_extractor.py
    ├── test_assembler.py
    ├── test_cli.py
    ├── test_dimension_extractor.py
    ├── test_dwg_converter.py
    ├── test_dxf_writer.py
    ├── test_error_handling.py
    ├── test_geometry_extractor.py
    ├── test_json_writer.py
    ├── test_line_classifier.py
    ├── test_models.py
    ├── test_schema.py
    ├── test_text_extractor.py
    ├── test_title_block.py
    └── test_view_segmenter.py

Development

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=pdf2imos

# Run only unit tests (skip integration)
pytest tests/ --ignore=tests/integration

# Run only integration tests
pytest tests/integration/

# Regenerate test fixture PDFs
python tests/generate_fixtures.py

Test Fixtures

Four synthetic PDFs are generated by tests/generate_fixtures.py:

File	Description
`simple_panel.pdf`	600x720x18mm flat panel, 3 orthographic views
`cabinet_basic.pdf`	600x720x400mm cabinet with material and edgebanding annotations
`panel_with_drilling.pdf`	600x720x18mm panel with shelf pin holes and drilling annotations
`edge_cases.pdf`	600x720x3mm ultra-thin back panel with closely spaced and redundant dimensions

Linting

ruff check src/ tests/
ruff format src/ tests/

Ruff is configured with line-length 100, target Python 3.11, and E/F/I rule sets.

Error Handling

All exceptions inherit from Pdf2ImosError:

Exception	Raised When
`PdfExtractionError`	Invalid, corrupt, or empty PDF; no vector content found
`ViewSegmentationError`	View segmentation fails to identify orthographic regions
`DimensionExtractionError`	No dimensions found, or 3D assembly fails
`OutputWriteError`	DXF, JSON, or DWG file cannot be written

When processing a batch, failures on individual PDFs are caught and reported. The exit code reflects whether all, some, or none of the files succeeded.

Design Notes

Frozen dataclasses. All pipeline models use frozen dataclasses, making intermediate data immutable and safe to pass between stages without defensive copying.

Third-angle projection. The view segmenter assumes US/AutoCAD third-angle projection layout. First-angle (ISO) drawings will produce incorrect view assignments.

Depth fallback. When depth cannot be extracted from the drawing (common for flat panels), the assembler defaults to 18mm, the standard furniture panel thickness.

32mm system. Drilling hole placement assumes the 32mm system spacing standard used in European furniture manufacturing.

DWG conversion. ODA File Converter must be installed and available on PATH for --dwg to have any effect. If it's absent, the flag is silently ignored and only DXF output is written.

Page scope. Only the first page of each PDF is processed. Multi-page drawings are not currently supported.

12 KiB Raw Permalink Blame History