pdf2cad/README.md

# pdf2imos

A Python CLI tool that converts PDF technical drawings into DXF 3D CAD files and structured JSON metadata for imos CAD furniture manufacturing workflows.

It parses vector geometry and text from PDF pages, identifies orthographic views, extracts dimensions and annotations, reconstructs 3D part geometry, and outputs industry-standard DXF files alongside validated JSON sidecar metadata.

---

## Table of Contents

- [Overview](#overview)
- [Requirements](#requirements)
- [Installation](#installation)
- [Usage](#usage)
- [Pipeline Architecture](#pipeline-architecture)
- [Output Files](#output-files)
- [Project Structure](#project-structure)
- [Development](#development)
- [Error Handling](#error-handling)
- [Design Notes](#design-notes)

---

## Overview

pdf2imos targets AutoCAD-style orthographic projection drawings of furniture parts. Given a directory of PDFs, it:

1. Extracts vector paths and text from each page
2. Segments the drawing into front, top, and side views
3. Classifies line roles (geometry, hidden, center, dimension, etc.)
4. Extracts dimension measurements and structured annotations (material, edgebanding, hardware, drilling)
5. Reconstructs 3D part geometry from orthographic measurements
6. Writes a DXF R2010 file with a 3D box mesh and a schema-validated JSON metadata file

Batch processing runs across all PDFs in the input directory. Only the first page of each PDF is processed.

---

## Requirements

- Python >= 3.11
- [ODA File Converter](https://www.opendesign.com/guestfiles/oda_file_converter) (optional, for `--dwg` output)

---

## Installation

```bash
# Clone the repository
git clone <repo-url>
cd pdf2cad

# Create a virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install in development mode with dev dependencies
pip install -e ".[dev]"
```

---

## Usage

```
pdf2imos INPUT_DIR OUTPUT_DIR [OPTIONS]
```

Also runnable as a module:

```bash
python -m pdf2imos INPUT_DIR OUTPUT_DIR
```

### Arguments

| Argument | Description |
|---|---|
| `INPUT_DIR` | Directory containing PDF files to process |
| `OUTPUT_DIR` | Directory for output DXF and JSON files |

### Options

| Option | Description |
|---|---|
| `--stage STAGE` | Stop at a pipeline stage and dump intermediate JSON |
| `--tolerance FLOAT` | Dimension matching tolerance in mm (default: 0.5) |
| `--dwg` | Also convert DXF to DWG format (requires ODAFileConverter on PATH) |
| `--verbose` | Enable DEBUG logging |
| `--version` | Show version and exit |

**Available `--stage` values:** `extract`, `segment`, `classify`, `dimensions`, `annotations`, `assemble`, `output`

### Examples

```bash
# Process all PDFs in a directory
pdf2imos ./drawings ./output

# Stop at extraction stage for debugging
pdf2imos ./drawings ./output --stage extract

# Verbose output with custom tolerance
pdf2imos ./drawings ./output --verbose --tolerance 1.0

# Also generate DWG files (requires ODAFileConverter)
pdf2imos ./drawings ./output --dwg
```

### Exit Codes

| Code | Meaning |
|---|---|
| `0` | All PDFs processed successfully |
| `1` | Some PDFs failed, some succeeded |
| `2` | All PDFs failed, or invalid arguments |

---

## Pipeline Architecture

The pipeline runs in 7 sequential stages. Use `--stage` to halt after any stage and inspect intermediate output.

### Stage 1: Extract

Parses the PDF page with PyMuPDF. Extracts vector paths (lines, curves, rectangles, quads) and text spans with font, size, and color metadata. Flips the y-axis from PDF convention (origin top-left, y downward) to CAD convention (origin bottom-left, y upward). Filters degenerate and zero-area paths.

### Stage 2: Segment

Detects and removes the title block using a bottom-right rectangle heuristic. Clusters remaining geometry by spatial proximity into view regions. Classifies clusters as FRONT, TOP, or SIDE views using third-angle projection layout conventions (US/AutoCAD standard).

### Stage 3: Classify

Classifies each path by visual properties:

| Role | Visual Characteristics |
|---|---|
| `GEOMETRY` | Solid line, medium width |
| `HIDDEN` | Dashed line |
| `CENTER` | Dash-dot line |
| `DIMENSION` | Thin line, near arrowheads |
| `BORDER` | Thick line, large extent |
| `CONSTRUCTION` | Very thin line |

Arrowheads are detected as small filled triangles.

### Stage 4: Dimensions

Finds numeric text values via regex (digits with optional `mm` suffix). Converts text coordinates to CAD space. Matches each number to the nearest dimension or geometry line segment. Determines horizontal vs. vertical orientation.

### Stage 5: Annotations

Extracts structured annotations via regex:

- **Material specs** -- type, thickness, finish
- **Edgebanding** -- thickness, material
- **Hardware callouts** -- brand, model
- **Drilling patterns** -- diameter, depth, count

Also collects raw text annotations and title block metadata.

### Stage 6: Assemble

Reconstructs 3D part geometry (width x height x depth) from orthographic dimension measurements. Cross-validates dimensions across views: front height should match side height, front width should match top width. Falls back to 18mm depth when depth cannot be extracted (standard furniture panel thickness).

### Stage 7: Output

Generates:

- **DXF R2010** file with a 3D box mesh on the `GEOMETRY` layer, dimension text on `DIMENSIONS`, and part name on `ANNOTATIONS`
- **JSON metadata** file validated against `metadata.schema.json`
- Optionally converts DXF to DWG via ODAFileConverter (ACAD2018 format, 30-second timeout)

---

## Output Files

For each input `example.pdf`, two files are written to the output directory:

| File | Description |
|---|---|
| `example.dxf` | DXF R2010 file with 3D mesh geometry |
| `example.json` | JSON metadata validated against the bundled schema |

### JSON Metadata Structure

```json
{
  "source_pdf": "example.pdf",
  "extraction_timestamp": "2026-03-03T12:00:00Z",
  "part_name": "Panel A",
  "overall_dimensions": {
    "width_mm": 600.0,
    "height_mm": 720.0,
    "depth_mm": 18.0
  },
  "parts": [
    {
      "material": "...",
      "edgebanding": "...",
      "hardware": "...",
      "drilling": "..."
    }
  ],
  "raw_annotations": ["..."]
}
```

### Coordinate System

- **PDF space:** origin top-left, y increases downward
- **CAD/DXF space:** origin bottom-left, y increases upward
- The pipeline flips y-coordinates during extraction
- **DXF axes:** X = width, Y = depth, Z = height

---

## Project Structure

```
pdf2cad/
├── pyproject.toml
├── src/
│   └── pdf2imos/
│       ├── __init__.py              # Version
│       ├── __main__.py              # python -m entry point
│       ├── cli.py                   # Typer CLI and pipeline orchestration
│       ├── errors.py                # Exception hierarchy
│       ├── extract/                 # Stage 1: PDF parsing
│       │   ├── geometry.py          #   Vector path extraction
│       │   └── text.py              #   Text span extraction
│       ├── interpret/               # Stages 2-3: Layout understanding
│       │   ├── line_classifier.py   #   Line role classification
│       │   ├── title_block.py       #   Title block detection
│       │   └── view_segmenter.py    #   Orthographic view segmentation
│       ├── models/                  # Data models (frozen dataclasses)
│       │   ├── annotations.py       #   Material, edgeband, hardware, drilling
│       │   ├── classified.py        #   ClassifiedLine, LineRole
│       │   ├── geometry.py          #   PartGeometry (3D dimensions)
│       │   ├── pipeline.py          #   PipelineResult
│       │   ├── primitives.py        #   RawPath, RawText, PageExtraction
│       │   └── views.py             #   ViewRegion, ViewType
│       ├── output/                  # Stage 7: File generation
│       │   ├── dxf_writer.py        #   DXF 3D mesh output
│       │   ├── dwg_converter.py     #   Optional DWG conversion
│       │   └── json_writer.py       #   JSON metadata output
│       ├── parse/                   # Stages 4-5: Content extraction
│       │   ├── annotations.py       #   Annotation regex parsing
│       │   └── dimensions.py        #   Dimension measurement extraction
│       ├── reconstruct/             # Stage 6: 3D assembly
│       │   └── assembler.py         #   Part geometry reconstruction
│       └── schema/                  # JSON Schema
│           ├── metadata.schema.json #   Metadata validation schema
│           └── validator.py         #   Schema validation wrapper
└── tests/
    ├── conftest.py
    ├── generate_fixtures.py         # Synthetic test PDF generator
    ├── fixtures/
    │   ├── input/                   # Test PDFs (4 files)
    │   └── expected/                # Expected JSON outputs
    ├── integration/
    │   ├── test_golden.py           # Golden file comparison
    │   └── test_pipeline.py         # End-to-end pipeline tests
    ├── test_annotation_extractor.py
    ├── test_assembler.py
    ├── test_cli.py
    ├── test_dimension_extractor.py
    ├── test_dwg_converter.py
    ├── test_dxf_writer.py
    ├── test_error_handling.py
    ├── test_geometry_extractor.py
    ├── test_json_writer.py
    ├── test_line_classifier.py
    ├── test_models.py
    ├── test_schema.py
    ├── test_text_extractor.py
    ├── test_title_block.py
    └── test_view_segmenter.py
```

---

## Development

### Running Tests

```bash
# Run all tests
pytest

# Run with coverage
pytest --cov=pdf2imos

# Run only unit tests (skip integration)
pytest tests/ --ignore=tests/integration

# Run only integration tests
pytest tests/integration/

# Regenerate test fixture PDFs
python tests/generate_fixtures.py
```

### Test Fixtures

Four synthetic PDFs are generated by `tests/generate_fixtures.py`:

| File | Description |
|---|---|
| `simple_panel.pdf` | 600x720x18mm flat panel, 3 orthographic views |
| `cabinet_basic.pdf` | 600x720x400mm cabinet with material and edgebanding annotations |
| `panel_with_drilling.pdf` | 600x720x18mm panel with shelf pin holes and drilling annotations |
| `edge_cases.pdf` | 600x720x3mm ultra-thin back panel with closely spaced and redundant dimensions |

### Linting

```bash
ruff check src/ tests/
ruff format src/ tests/
```

Ruff is configured with line-length 100, target Python 3.11, and E/F/I rule sets.

---

## Error Handling

All exceptions inherit from `Pdf2ImosError`:

| Exception | Raised When |
|---|---|
| `PdfExtractionError` | Invalid, corrupt, or empty PDF; no vector content found |
| `ViewSegmentationError` | View segmentation fails to identify orthographic regions |
| `DimensionExtractionError` | No dimensions found, or 3D assembly fails |
| `OutputWriteError` | DXF, JSON, or DWG file cannot be written |

When processing a batch, failures on individual PDFs are caught and reported. The exit code reflects whether all, some, or none of the files succeeded.

---

## Design Notes

**Frozen dataclasses.** All pipeline models use frozen dataclasses, making intermediate data immutable and safe to pass between stages without defensive copying.

**Third-angle projection.** The view segmenter assumes US/AutoCAD third-angle projection layout. First-angle (ISO) drawings will produce incorrect view assignments.

**Depth fallback.** When depth cannot be extracted from the drawing (common for flat panels), the assembler defaults to 18mm, the standard furniture panel thickness.

**32mm system.** Drilling hole placement assumes the 32mm system spacing standard used in European furniture manufacturing.

**DWG conversion.** ODA File Converter must be installed and available on PATH for `--dwg` to have any effect. If it's absent, the flag is silently ignored and only DXF output is written.

**Page scope.** Only the first page of each PDF is processed. Multi-page drawings are not currently supported.