Files
pdf2cad/README.md
2026-03-03 21:30:31 +00:00

12 KiB

pdf2imos

A Python CLI tool that converts PDF technical drawings into DXF 3D CAD files and structured JSON metadata for imos CAD furniture manufacturing workflows.

It parses vector geometry and text from PDF pages, identifies orthographic views, extracts dimensions and annotations, reconstructs 3D part geometry, and outputs industry-standard DXF files alongside validated JSON sidecar metadata.


Table of Contents


Overview

pdf2imos targets AutoCAD-style orthographic projection drawings of furniture parts. Given a directory of PDFs, it:

  1. Extracts vector paths and text from each page
  2. Segments the drawing into front, top, and side views
  3. Classifies line roles (geometry, hidden, center, dimension, etc.)
  4. Extracts dimension measurements and structured annotations (material, edgebanding, hardware, drilling)
  5. Reconstructs 3D part geometry from orthographic measurements
  6. Writes a DXF R2010 file with a 3D box mesh and a schema-validated JSON metadata file

Batch processing runs across all PDFs in the input directory. Only the first page of each PDF is processed.


Requirements


Installation

# Clone the repository
git clone <repo-url>
cd pdf2cad

# Create a virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install in development mode with dev dependencies
pip install -e ".[dev]"

Usage

pdf2imos INPUT_DIR OUTPUT_DIR [OPTIONS]

Also runnable as a module:

python -m pdf2imos INPUT_DIR OUTPUT_DIR

Arguments

Argument Description
INPUT_DIR Directory containing PDF files to process
OUTPUT_DIR Directory for output DXF and JSON files

Options

Option Description
--stage STAGE Stop at a pipeline stage and dump intermediate JSON
--tolerance FLOAT Dimension matching tolerance in mm (default: 0.5)
--dwg Also convert DXF to DWG format (requires ODAFileConverter on PATH)
--verbose Enable DEBUG logging
--version Show version and exit

Available --stage values: extract, segment, classify, dimensions, annotations, assemble, output

Examples

# Process all PDFs in a directory
pdf2imos ./drawings ./output

# Stop at extraction stage for debugging
pdf2imos ./drawings ./output --stage extract

# Verbose output with custom tolerance
pdf2imos ./drawings ./output --verbose --tolerance 1.0

# Also generate DWG files (requires ODAFileConverter)
pdf2imos ./drawings ./output --dwg

Exit Codes

Code Meaning
0 All PDFs processed successfully
1 Some PDFs failed, some succeeded
2 All PDFs failed, or invalid arguments

Pipeline Architecture

The pipeline runs in 7 sequential stages. Use --stage to halt after any stage and inspect intermediate output.

Stage 1: Extract

Parses the PDF page with PyMuPDF. Extracts vector paths (lines, curves, rectangles, quads) and text spans with font, size, and color metadata. Flips the y-axis from PDF convention (origin top-left, y downward) to CAD convention (origin bottom-left, y upward). Filters degenerate and zero-area paths.

Stage 2: Segment

Detects and removes the title block using a bottom-right rectangle heuristic. Clusters remaining geometry by spatial proximity into view regions. Classifies clusters as FRONT, TOP, or SIDE views using third-angle projection layout conventions (US/AutoCAD standard).

Stage 3: Classify

Classifies each path by visual properties:

Role Visual Characteristics
GEOMETRY Solid line, medium width
HIDDEN Dashed line
CENTER Dash-dot line
DIMENSION Thin line, near arrowheads
BORDER Thick line, large extent
CONSTRUCTION Very thin line

Arrowheads are detected as small filled triangles.

Stage 4: Dimensions

Finds numeric text values via regex (digits with optional mm suffix). Converts text coordinates to CAD space. Matches each number to the nearest dimension or geometry line segment. Determines horizontal vs. vertical orientation.

Stage 5: Annotations

Extracts structured annotations via regex:

  • Material specs -- type, thickness, finish
  • Edgebanding -- thickness, material
  • Hardware callouts -- brand, model
  • Drilling patterns -- diameter, depth, count

Also collects raw text annotations and title block metadata.

Stage 6: Assemble

Reconstructs 3D part geometry (width x height x depth) from orthographic dimension measurements. Cross-validates dimensions across views: front height should match side height, front width should match top width. Falls back to 18mm depth when depth cannot be extracted (standard furniture panel thickness).

Stage 7: Output

Generates:

  • DXF R2010 file with a 3D box mesh on the GEOMETRY layer, dimension text on DIMENSIONS, and part name on ANNOTATIONS
  • JSON metadata file validated against metadata.schema.json
  • Optionally converts DXF to DWG via ODAFileConverter (ACAD2018 format, 30-second timeout)

Output Files

For each input example.pdf, two files are written to the output directory:

File Description
example.dxf DXF R2010 file with 3D mesh geometry
example.json JSON metadata validated against the bundled schema

JSON Metadata Structure

{
  "source_pdf": "example.pdf",
  "extraction_timestamp": "2026-03-03T12:00:00Z",
  "part_name": "Panel A",
  "overall_dimensions": {
    "width_mm": 600.0,
    "height_mm": 720.0,
    "depth_mm": 18.0
  },
  "parts": [
    {
      "material": "...",
      "edgebanding": "...",
      "hardware": "...",
      "drilling": "..."
    }
  ],
  "raw_annotations": ["..."]
}

Coordinate System

  • PDF space: origin top-left, y increases downward
  • CAD/DXF space: origin bottom-left, y increases upward
  • The pipeline flips y-coordinates during extraction
  • DXF axes: X = width, Y = depth, Z = height

Project Structure

pdf2cad/
├── pyproject.toml
├── src/
│   └── pdf2imos/
│       ├── __init__.py              # Version
│       ├── __main__.py              # python -m entry point
│       ├── cli.py                   # Typer CLI and pipeline orchestration
│       ├── errors.py                # Exception hierarchy
│       ├── extract/                 # Stage 1: PDF parsing
│       │   ├── geometry.py          #   Vector path extraction
│       │   └── text.py              #   Text span extraction
│       ├── interpret/               # Stages 2-3: Layout understanding
│       │   ├── line_classifier.py   #   Line role classification
│       │   ├── title_block.py       #   Title block detection
│       │   └── view_segmenter.py    #   Orthographic view segmentation
│       ├── models/                  # Data models (frozen dataclasses)
│       │   ├── annotations.py       #   Material, edgeband, hardware, drilling
│       │   ├── classified.py        #   ClassifiedLine, LineRole
│       │   ├── geometry.py          #   PartGeometry (3D dimensions)
│       │   ├── pipeline.py          #   PipelineResult
│       │   ├── primitives.py        #   RawPath, RawText, PageExtraction
│       │   └── views.py             #   ViewRegion, ViewType
│       ├── output/                  # Stage 7: File generation
│       │   ├── dxf_writer.py        #   DXF 3D mesh output
│       │   ├── dwg_converter.py     #   Optional DWG conversion
│       │   └── json_writer.py       #   JSON metadata output
│       ├── parse/                   # Stages 4-5: Content extraction
│       │   ├── annotations.py       #   Annotation regex parsing
│       │   └── dimensions.py        #   Dimension measurement extraction
│       ├── reconstruct/             # Stage 6: 3D assembly
│       │   └── assembler.py         #   Part geometry reconstruction
│       └── schema/                  # JSON Schema
│           ├── metadata.schema.json #   Metadata validation schema
│           └── validator.py         #   Schema validation wrapper
└── tests/
    ├── conftest.py
    ├── generate_fixtures.py         # Synthetic test PDF generator
    ├── fixtures/
    │   ├── input/                   # Test PDFs (4 files)
    │   └── expected/                # Expected JSON outputs
    ├── integration/
    │   ├── test_golden.py           # Golden file comparison
    │   └── test_pipeline.py         # End-to-end pipeline tests
    ├── test_annotation_extractor.py
    ├── test_assembler.py
    ├── test_cli.py
    ├── test_dimension_extractor.py
    ├── test_dwg_converter.py
    ├── test_dxf_writer.py
    ├── test_error_handling.py
    ├── test_geometry_extractor.py
    ├── test_json_writer.py
    ├── test_line_classifier.py
    ├── test_models.py
    ├── test_schema.py
    ├── test_text_extractor.py
    ├── test_title_block.py
    └── test_view_segmenter.py

Development

Running Tests

# Run all tests
pytest

# Run with coverage
pytest --cov=pdf2imos

# Run only unit tests (skip integration)
pytest tests/ --ignore=tests/integration

# Run only integration tests
pytest tests/integration/

# Regenerate test fixture PDFs
python tests/generate_fixtures.py

Test Fixtures

Four synthetic PDFs are generated by tests/generate_fixtures.py:

File Description
simple_panel.pdf 600x720x18mm flat panel, 3 orthographic views
cabinet_basic.pdf 600x720x400mm cabinet with material and edgebanding annotations
panel_with_drilling.pdf 600x720x18mm panel with shelf pin holes and drilling annotations
edge_cases.pdf 600x720x3mm ultra-thin back panel with closely spaced and redundant dimensions

Linting

ruff check src/ tests/
ruff format src/ tests/

Ruff is configured with line-length 100, target Python 3.11, and E/F/I rule sets.


Error Handling

All exceptions inherit from Pdf2ImosError:

Exception Raised When
PdfExtractionError Invalid, corrupt, or empty PDF; no vector content found
ViewSegmentationError View segmentation fails to identify orthographic regions
DimensionExtractionError No dimensions found, or 3D assembly fails
OutputWriteError DXF, JSON, or DWG file cannot be written

When processing a batch, failures on individual PDFs are caught and reported. The exit code reflects whether all, some, or none of the files succeeded.


Design Notes

Frozen dataclasses. All pipeline models use frozen dataclasses, making intermediate data immutable and safe to pass between stages without defensive copying.

Third-angle projection. The view segmenter assumes US/AutoCAD third-angle projection layout. First-angle (ISO) drawings will produce incorrect view assignments.

Depth fallback. When depth cannot be extracted from the drawing (common for flat panels), the assembler defaults to 18mm, the standard furniture panel thickness.

32mm system. Drilling hole placement assumes the 32mm system spacing standard used in European furniture manufacturing.

DWG conversion. ODA File Converter must be installed and available on PATH for --dwg to have any effect. If it's absent, the flag is silently ignored and only DXF output is written.

Page scope. Only the first page of each PDF is processed. Multi-page drawings are not currently supported.