354 lines
12 KiB
Markdown
354 lines
12 KiB
Markdown
# pdf2imos
|
|
|
|
A Python CLI tool that converts PDF technical drawings into DXF 3D CAD files and structured JSON metadata for imos CAD furniture manufacturing workflows.
|
|
|
|
It parses vector geometry and text from PDF pages, identifies orthographic views, extracts dimensions and annotations, reconstructs 3D part geometry, and outputs industry-standard DXF files alongside validated JSON sidecar metadata.
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
- [Overview](#overview)
|
|
- [Requirements](#requirements)
|
|
- [Installation](#installation)
|
|
- [Usage](#usage)
|
|
- [Pipeline Architecture](#pipeline-architecture)
|
|
- [Output Files](#output-files)
|
|
- [Project Structure](#project-structure)
|
|
- [Development](#development)
|
|
- [Error Handling](#error-handling)
|
|
- [Design Notes](#design-notes)
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
pdf2imos targets AutoCAD-style orthographic projection drawings of furniture parts. Given a directory of PDFs, it:
|
|
|
|
1. Extracts vector paths and text from each page
|
|
2. Segments the drawing into front, top, and side views
|
|
3. Classifies line roles (geometry, hidden, center, dimension, etc.)
|
|
4. Extracts dimension measurements and structured annotations (material, edgebanding, hardware, drilling)
|
|
5. Reconstructs 3D part geometry from orthographic measurements
|
|
6. Writes a DXF R2010 file with a 3D box mesh and a schema-validated JSON metadata file
|
|
|
|
Batch processing runs across all PDFs in the input directory. Only the first page of each PDF is processed.
|
|
|
|
---
|
|
|
|
## Requirements
|
|
|
|
- Python >= 3.11
|
|
- [ODA File Converter](https://www.opendesign.com/guestfiles/oda_file_converter) (optional, for `--dwg` output)
|
|
|
|
---
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
# Clone the repository
|
|
git clone <repo-url>
|
|
cd pdf2cad
|
|
|
|
# Create a virtual environment
|
|
python -m venv venv
|
|
source venv/bin/activate # Windows: venv\Scripts\activate
|
|
|
|
# Install in development mode with dev dependencies
|
|
pip install -e ".[dev]"
|
|
```
|
|
|
|
---
|
|
|
|
## Usage
|
|
|
|
```
|
|
pdf2imos INPUT_DIR OUTPUT_DIR [OPTIONS]
|
|
```
|
|
|
|
Also runnable as a module:
|
|
|
|
```bash
|
|
python -m pdf2imos INPUT_DIR OUTPUT_DIR
|
|
```
|
|
|
|
### Arguments
|
|
|
|
| Argument | Description |
|
|
|---|---|
|
|
| `INPUT_DIR` | Directory containing PDF files to process |
|
|
| `OUTPUT_DIR` | Directory for output DXF and JSON files |
|
|
|
|
### Options
|
|
|
|
| Option | Description |
|
|
|---|---|
|
|
| `--stage STAGE` | Stop at a pipeline stage and dump intermediate JSON |
|
|
| `--tolerance FLOAT` | Dimension matching tolerance in mm (default: 0.5) |
|
|
| `--dwg` | Also convert DXF to DWG format (requires ODAFileConverter on PATH) |
|
|
| `--verbose` | Enable DEBUG logging |
|
|
| `--version` | Show version and exit |
|
|
|
|
**Available `--stage` values:** `extract`, `segment`, `classify`, `dimensions`, `annotations`, `assemble`, `output`
|
|
|
|
### Examples
|
|
|
|
```bash
|
|
# Process all PDFs in a directory
|
|
pdf2imos ./drawings ./output
|
|
|
|
# Stop at extraction stage for debugging
|
|
pdf2imos ./drawings ./output --stage extract
|
|
|
|
# Verbose output with custom tolerance
|
|
pdf2imos ./drawings ./output --verbose --tolerance 1.0
|
|
|
|
# Also generate DWG files (requires ODAFileConverter)
|
|
pdf2imos ./drawings ./output --dwg
|
|
```
|
|
|
|
### Exit Codes
|
|
|
|
| Code | Meaning |
|
|
|---|---|
|
|
| `0` | All PDFs processed successfully |
|
|
| `1` | Some PDFs failed, some succeeded |
|
|
| `2` | All PDFs failed, or invalid arguments |
|
|
|
|
---
|
|
|
|
## Pipeline Architecture
|
|
|
|
The pipeline runs in 7 sequential stages. Use `--stage` to halt after any stage and inspect intermediate output.
|
|
|
|
### Stage 1: Extract
|
|
|
|
Parses the PDF page with PyMuPDF. Extracts vector paths (lines, curves, rectangles, quads) and text spans with font, size, and color metadata. Flips the y-axis from PDF convention (origin top-left, y downward) to CAD convention (origin bottom-left, y upward). Filters degenerate and zero-area paths.
|
|
|
|
### Stage 2: Segment
|
|
|
|
Detects and removes the title block using a bottom-right rectangle heuristic. Clusters remaining geometry by spatial proximity into view regions. Classifies clusters as FRONT, TOP, or SIDE views using third-angle projection layout conventions (US/AutoCAD standard).
|
|
|
|
### Stage 3: Classify
|
|
|
|
Classifies each path by visual properties:
|
|
|
|
| Role | Visual Characteristics |
|
|
|---|---|
|
|
| `GEOMETRY` | Solid line, medium width |
|
|
| `HIDDEN` | Dashed line |
|
|
| `CENTER` | Dash-dot line |
|
|
| `DIMENSION` | Thin line, near arrowheads |
|
|
| `BORDER` | Thick line, large extent |
|
|
| `CONSTRUCTION` | Very thin line |
|
|
|
|
Arrowheads are detected as small filled triangles.
|
|
|
|
### Stage 4: Dimensions
|
|
|
|
Finds numeric text values via regex (digits with optional `mm` suffix). Converts text coordinates to CAD space. Matches each number to the nearest dimension or geometry line segment. Determines horizontal vs. vertical orientation.
|
|
|
|
### Stage 5: Annotations
|
|
|
|
Extracts structured annotations via regex:
|
|
|
|
- **Material specs** -- type, thickness, finish
|
|
- **Edgebanding** -- thickness, material
|
|
- **Hardware callouts** -- brand, model
|
|
- **Drilling patterns** -- diameter, depth, count
|
|
|
|
Also collects raw text annotations and title block metadata.
|
|
|
|
### Stage 6: Assemble
|
|
|
|
Reconstructs 3D part geometry (width x height x depth) from orthographic dimension measurements. Cross-validates dimensions across views: front height should match side height, front width should match top width. Falls back to 18mm depth when depth cannot be extracted (standard furniture panel thickness).
|
|
|
|
### Stage 7: Output
|
|
|
|
Generates:
|
|
|
|
- **DXF R2010** file with a 3D box mesh on the `GEOMETRY` layer, dimension text on `DIMENSIONS`, and part name on `ANNOTATIONS`
|
|
- **JSON metadata** file validated against `metadata.schema.json`
|
|
- Optionally converts DXF to DWG via ODAFileConverter (ACAD2018 format, 30-second timeout)
|
|
|
|
---
|
|
|
|
## Output Files
|
|
|
|
For each input `example.pdf`, two files are written to the output directory:
|
|
|
|
| File | Description |
|
|
|---|---|
|
|
| `example.dxf` | DXF R2010 file with 3D mesh geometry |
|
|
| `example.json` | JSON metadata validated against the bundled schema |
|
|
|
|
### JSON Metadata Structure
|
|
|
|
```json
|
|
{
|
|
"source_pdf": "example.pdf",
|
|
"extraction_timestamp": "2026-03-03T12:00:00Z",
|
|
"part_name": "Panel A",
|
|
"overall_dimensions": {
|
|
"width_mm": 600.0,
|
|
"height_mm": 720.0,
|
|
"depth_mm": 18.0
|
|
},
|
|
"parts": [
|
|
{
|
|
"material": "...",
|
|
"edgebanding": "...",
|
|
"hardware": "...",
|
|
"drilling": "..."
|
|
}
|
|
],
|
|
"raw_annotations": ["..."]
|
|
}
|
|
```
|
|
|
|
### Coordinate System
|
|
|
|
- **PDF space:** origin top-left, y increases downward
|
|
- **CAD/DXF space:** origin bottom-left, y increases upward
|
|
- The pipeline flips y-coordinates during extraction
|
|
- **DXF axes:** X = width, Y = depth, Z = height
|
|
|
|
---
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
pdf2cad/
|
|
├── pyproject.toml
|
|
├── src/
|
|
│ └── pdf2imos/
|
|
│ ├── __init__.py # Version
|
|
│ ├── __main__.py # python -m entry point
|
|
│ ├── cli.py # Typer CLI and pipeline orchestration
|
|
│ ├── errors.py # Exception hierarchy
|
|
│ ├── extract/ # Stage 1: PDF parsing
|
|
│ │ ├── geometry.py # Vector path extraction
|
|
│ │ └── text.py # Text span extraction
|
|
│ ├── interpret/ # Stages 2-3: Layout understanding
|
|
│ │ ├── line_classifier.py # Line role classification
|
|
│ │ ├── title_block.py # Title block detection
|
|
│ │ └── view_segmenter.py # Orthographic view segmentation
|
|
│ ├── models/ # Data models (frozen dataclasses)
|
|
│ │ ├── annotations.py # Material, edgeband, hardware, drilling
|
|
│ │ ├── classified.py # ClassifiedLine, LineRole
|
|
│ │ ├── geometry.py # PartGeometry (3D dimensions)
|
|
│ │ ├── pipeline.py # PipelineResult
|
|
│ │ ├── primitives.py # RawPath, RawText, PageExtraction
|
|
│ │ └── views.py # ViewRegion, ViewType
|
|
│ ├── output/ # Stage 7: File generation
|
|
│ │ ├── dxf_writer.py # DXF 3D mesh output
|
|
│ │ ├── dwg_converter.py # Optional DWG conversion
|
|
│ │ └── json_writer.py # JSON metadata output
|
|
│ ├── parse/ # Stages 4-5: Content extraction
|
|
│ │ ├── annotations.py # Annotation regex parsing
|
|
│ │ └── dimensions.py # Dimension measurement extraction
|
|
│ ├── reconstruct/ # Stage 6: 3D assembly
|
|
│ │ └── assembler.py # Part geometry reconstruction
|
|
│ └── schema/ # JSON Schema
|
|
│ ├── metadata.schema.json # Metadata validation schema
|
|
│ └── validator.py # Schema validation wrapper
|
|
└── tests/
|
|
├── conftest.py
|
|
├── generate_fixtures.py # Synthetic test PDF generator
|
|
├── fixtures/
|
|
│ ├── input/ # Test PDFs (4 files)
|
|
│ └── expected/ # Expected JSON outputs
|
|
├── integration/
|
|
│ ├── test_golden.py # Golden file comparison
|
|
│ └── test_pipeline.py # End-to-end pipeline tests
|
|
├── test_annotation_extractor.py
|
|
├── test_assembler.py
|
|
├── test_cli.py
|
|
├── test_dimension_extractor.py
|
|
├── test_dwg_converter.py
|
|
├── test_dxf_writer.py
|
|
├── test_error_handling.py
|
|
├── test_geometry_extractor.py
|
|
├── test_json_writer.py
|
|
├── test_line_classifier.py
|
|
├── test_models.py
|
|
├── test_schema.py
|
|
├── test_text_extractor.py
|
|
├── test_title_block.py
|
|
└── test_view_segmenter.py
|
|
```
|
|
|
|
---
|
|
|
|
## Development
|
|
|
|
### Running Tests
|
|
|
|
```bash
|
|
# Run all tests
|
|
pytest
|
|
|
|
# Run with coverage
|
|
pytest --cov=pdf2imos
|
|
|
|
# Run only unit tests (skip integration)
|
|
pytest tests/ --ignore=tests/integration
|
|
|
|
# Run only integration tests
|
|
pytest tests/integration/
|
|
|
|
# Regenerate test fixture PDFs
|
|
python tests/generate_fixtures.py
|
|
```
|
|
|
|
### Test Fixtures
|
|
|
|
Four synthetic PDFs are generated by `tests/generate_fixtures.py`:
|
|
|
|
| File | Description |
|
|
|---|---|
|
|
| `simple_panel.pdf` | 600x720x18mm flat panel, 3 orthographic views |
|
|
| `cabinet_basic.pdf` | 600x720x400mm cabinet with material and edgebanding annotations |
|
|
| `panel_with_drilling.pdf` | 600x720x18mm panel with shelf pin holes and drilling annotations |
|
|
| `edge_cases.pdf` | 600x720x3mm ultra-thin back panel with closely spaced and redundant dimensions |
|
|
|
|
### Linting
|
|
|
|
```bash
|
|
ruff check src/ tests/
|
|
ruff format src/ tests/
|
|
```
|
|
|
|
Ruff is configured with line-length 100, target Python 3.11, and E/F/I rule sets.
|
|
|
|
---
|
|
|
|
## Error Handling
|
|
|
|
All exceptions inherit from `Pdf2ImosError`:
|
|
|
|
| Exception | Raised When |
|
|
|---|---|
|
|
| `PdfExtractionError` | Invalid, corrupt, or empty PDF; no vector content found |
|
|
| `ViewSegmentationError` | View segmentation fails to identify orthographic regions |
|
|
| `DimensionExtractionError` | No dimensions found, or 3D assembly fails |
|
|
| `OutputWriteError` | DXF, JSON, or DWG file cannot be written |
|
|
|
|
When processing a batch, failures on individual PDFs are caught and reported. The exit code reflects whether all, some, or none of the files succeeded.
|
|
|
|
---
|
|
|
|
## Design Notes
|
|
|
|
**Frozen dataclasses.** All pipeline models use frozen dataclasses, making intermediate data immutable and safe to pass between stages without defensive copying.
|
|
|
|
**Third-angle projection.** The view segmenter assumes US/AutoCAD third-angle projection layout. First-angle (ISO) drawings will produce incorrect view assignments.
|
|
|
|
**Depth fallback.** When depth cannot be extracted from the drawing (common for flat panels), the assembler defaults to 18mm, the standard furniture panel thickness.
|
|
|
|
**32mm system.** Drilling hole placement assumes the 32mm system spacing standard used in European furniture manufacturing.
|
|
|
|
**DWG conversion.** ODA File Converter must be installed and available on PATH for `--dwg` to have any effect. If it's absent, the flag is silently ignored and only DXF output is written.
|
|
|
|
**Page scope.** Only the first page of each PDF is processed. Multi-page drawings are not currently supported.
|