From 120f018906eb3c686ef16fc773e5d43f6ebe1400 Mon Sep 17 00:00:00 2001 From: repi Date: Tue, 3 Mar 2026 21:30:31 +0000 Subject: [PATCH] chore: README.md --- README.md | 353 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 353 insertions(+) create mode 100644 README.md diff --git a/README.md b/README.md new file mode 100644 index 0000000..65c827b --- /dev/null +++ b/README.md @@ -0,0 +1,353 @@ +# pdf2imos + +A Python CLI tool that converts PDF technical drawings into DXF 3D CAD files and structured JSON metadata for imos CAD furniture manufacturing workflows. + +It parses vector geometry and text from PDF pages, identifies orthographic views, extracts dimensions and annotations, reconstructs 3D part geometry, and outputs industry-standard DXF files alongside validated JSON sidecar metadata. + +--- + +## Table of Contents + +- [Overview](#overview) +- [Requirements](#requirements) +- [Installation](#installation) +- [Usage](#usage) +- [Pipeline Architecture](#pipeline-architecture) +- [Output Files](#output-files) +- [Project Structure](#project-structure) +- [Development](#development) +- [Error Handling](#error-handling) +- [Design Notes](#design-notes) + +--- + +## Overview + +pdf2imos targets AutoCAD-style orthographic projection drawings of furniture parts. Given a directory of PDFs, it: + +1. Extracts vector paths and text from each page +2. Segments the drawing into front, top, and side views +3. Classifies line roles (geometry, hidden, center, dimension, etc.) +4. Extracts dimension measurements and structured annotations (material, edgebanding, hardware, drilling) +5. Reconstructs 3D part geometry from orthographic measurements +6. Writes a DXF R2010 file with a 3D box mesh and a schema-validated JSON metadata file + +Batch processing runs across all PDFs in the input directory. Only the first page of each PDF is processed. + +--- + +## Requirements + +- Python >= 3.11 +- [ODA File Converter](https://www.opendesign.com/guestfiles/oda_file_converter) (optional, for `--dwg` output) + +--- + +## Installation + +```bash +# Clone the repository +git clone +cd pdf2cad + +# Create a virtual environment +python -m venv venv +source venv/bin/activate # Windows: venv\Scripts\activate + +# Install in development mode with dev dependencies +pip install -e ".[dev]" +``` + +--- + +## Usage + +``` +pdf2imos INPUT_DIR OUTPUT_DIR [OPTIONS] +``` + +Also runnable as a module: + +```bash +python -m pdf2imos INPUT_DIR OUTPUT_DIR +``` + +### Arguments + +| Argument | Description | +|---|---| +| `INPUT_DIR` | Directory containing PDF files to process | +| `OUTPUT_DIR` | Directory for output DXF and JSON files | + +### Options + +| Option | Description | +|---|---| +| `--stage STAGE` | Stop at a pipeline stage and dump intermediate JSON | +| `--tolerance FLOAT` | Dimension matching tolerance in mm (default: 0.5) | +| `--dwg` | Also convert DXF to DWG format (requires ODAFileConverter on PATH) | +| `--verbose` | Enable DEBUG logging | +| `--version` | Show version and exit | + +**Available `--stage` values:** `extract`, `segment`, `classify`, `dimensions`, `annotations`, `assemble`, `output` + +### Examples + +```bash +# Process all PDFs in a directory +pdf2imos ./drawings ./output + +# Stop at extraction stage for debugging +pdf2imos ./drawings ./output --stage extract + +# Verbose output with custom tolerance +pdf2imos ./drawings ./output --verbose --tolerance 1.0 + +# Also generate DWG files (requires ODAFileConverter) +pdf2imos ./drawings ./output --dwg +``` + +### Exit Codes + +| Code | Meaning | +|---|---| +| `0` | All PDFs processed successfully | +| `1` | Some PDFs failed, some succeeded | +| `2` | All PDFs failed, or invalid arguments | + +--- + +## Pipeline Architecture + +The pipeline runs in 7 sequential stages. Use `--stage` to halt after any stage and inspect intermediate output. + +### Stage 1: Extract + +Parses the PDF page with PyMuPDF. Extracts vector paths (lines, curves, rectangles, quads) and text spans with font, size, and color metadata. Flips the y-axis from PDF convention (origin top-left, y downward) to CAD convention (origin bottom-left, y upward). Filters degenerate and zero-area paths. + +### Stage 2: Segment + +Detects and removes the title block using a bottom-right rectangle heuristic. Clusters remaining geometry by spatial proximity into view regions. Classifies clusters as FRONT, TOP, or SIDE views using third-angle projection layout conventions (US/AutoCAD standard). + +### Stage 3: Classify + +Classifies each path by visual properties: + +| Role | Visual Characteristics | +|---|---| +| `GEOMETRY` | Solid line, medium width | +| `HIDDEN` | Dashed line | +| `CENTER` | Dash-dot line | +| `DIMENSION` | Thin line, near arrowheads | +| `BORDER` | Thick line, large extent | +| `CONSTRUCTION` | Very thin line | + +Arrowheads are detected as small filled triangles. + +### Stage 4: Dimensions + +Finds numeric text values via regex (digits with optional `mm` suffix). Converts text coordinates to CAD space. Matches each number to the nearest dimension or geometry line segment. Determines horizontal vs. vertical orientation. + +### Stage 5: Annotations + +Extracts structured annotations via regex: + +- **Material specs** -- type, thickness, finish +- **Edgebanding** -- thickness, material +- **Hardware callouts** -- brand, model +- **Drilling patterns** -- diameter, depth, count + +Also collects raw text annotations and title block metadata. + +### Stage 6: Assemble + +Reconstructs 3D part geometry (width x height x depth) from orthographic dimension measurements. Cross-validates dimensions across views: front height should match side height, front width should match top width. Falls back to 18mm depth when depth cannot be extracted (standard furniture panel thickness). + +### Stage 7: Output + +Generates: + +- **DXF R2010** file with a 3D box mesh on the `GEOMETRY` layer, dimension text on `DIMENSIONS`, and part name on `ANNOTATIONS` +- **JSON metadata** file validated against `metadata.schema.json` +- Optionally converts DXF to DWG via ODAFileConverter (ACAD2018 format, 30-second timeout) + +--- + +## Output Files + +For each input `example.pdf`, two files are written to the output directory: + +| File | Description | +|---|---| +| `example.dxf` | DXF R2010 file with 3D mesh geometry | +| `example.json` | JSON metadata validated against the bundled schema | + +### JSON Metadata Structure + +```json +{ + "source_pdf": "example.pdf", + "extraction_timestamp": "2026-03-03T12:00:00Z", + "part_name": "Panel A", + "overall_dimensions": { + "width_mm": 600.0, + "height_mm": 720.0, + "depth_mm": 18.0 + }, + "parts": [ + { + "material": "...", + "edgebanding": "...", + "hardware": "...", + "drilling": "..." + } + ], + "raw_annotations": ["..."] +} +``` + +### Coordinate System + +- **PDF space:** origin top-left, y increases downward +- **CAD/DXF space:** origin bottom-left, y increases upward +- The pipeline flips y-coordinates during extraction +- **DXF axes:** X = width, Y = depth, Z = height + +--- + +## Project Structure + +``` +pdf2cad/ +├── pyproject.toml +├── src/ +│ └── pdf2imos/ +│ ├── __init__.py # Version +│ ├── __main__.py # python -m entry point +│ ├── cli.py # Typer CLI and pipeline orchestration +│ ├── errors.py # Exception hierarchy +│ ├── extract/ # Stage 1: PDF parsing +│ │ ├── geometry.py # Vector path extraction +│ │ └── text.py # Text span extraction +│ ├── interpret/ # Stages 2-3: Layout understanding +│ │ ├── line_classifier.py # Line role classification +│ │ ├── title_block.py # Title block detection +│ │ └── view_segmenter.py # Orthographic view segmentation +│ ├── models/ # Data models (frozen dataclasses) +│ │ ├── annotations.py # Material, edgeband, hardware, drilling +│ │ ├── classified.py # ClassifiedLine, LineRole +│ │ ├── geometry.py # PartGeometry (3D dimensions) +│ │ ├── pipeline.py # PipelineResult +│ │ ├── primitives.py # RawPath, RawText, PageExtraction +│ │ └── views.py # ViewRegion, ViewType +│ ├── output/ # Stage 7: File generation +│ │ ├── dxf_writer.py # DXF 3D mesh output +│ │ ├── dwg_converter.py # Optional DWG conversion +│ │ └── json_writer.py # JSON metadata output +│ ├── parse/ # Stages 4-5: Content extraction +│ │ ├── annotations.py # Annotation regex parsing +│ │ └── dimensions.py # Dimension measurement extraction +│ ├── reconstruct/ # Stage 6: 3D assembly +│ │ └── assembler.py # Part geometry reconstruction +│ └── schema/ # JSON Schema +│ ├── metadata.schema.json # Metadata validation schema +│ └── validator.py # Schema validation wrapper +└── tests/ + ├── conftest.py + ├── generate_fixtures.py # Synthetic test PDF generator + ├── fixtures/ + │ ├── input/ # Test PDFs (4 files) + │ └── expected/ # Expected JSON outputs + ├── integration/ + │ ├── test_golden.py # Golden file comparison + │ └── test_pipeline.py # End-to-end pipeline tests + ├── test_annotation_extractor.py + ├── test_assembler.py + ├── test_cli.py + ├── test_dimension_extractor.py + ├── test_dwg_converter.py + ├── test_dxf_writer.py + ├── test_error_handling.py + ├── test_geometry_extractor.py + ├── test_json_writer.py + ├── test_line_classifier.py + ├── test_models.py + ├── test_schema.py + ├── test_text_extractor.py + ├── test_title_block.py + └── test_view_segmenter.py +``` + +--- + +## Development + +### Running Tests + +```bash +# Run all tests +pytest + +# Run with coverage +pytest --cov=pdf2imos + +# Run only unit tests (skip integration) +pytest tests/ --ignore=tests/integration + +# Run only integration tests +pytest tests/integration/ + +# Regenerate test fixture PDFs +python tests/generate_fixtures.py +``` + +### Test Fixtures + +Four synthetic PDFs are generated by `tests/generate_fixtures.py`: + +| File | Description | +|---|---| +| `simple_panel.pdf` | 600x720x18mm flat panel, 3 orthographic views | +| `cabinet_basic.pdf` | 600x720x400mm cabinet with material and edgebanding annotations | +| `panel_with_drilling.pdf` | 600x720x18mm panel with shelf pin holes and drilling annotations | +| `edge_cases.pdf` | 600x720x3mm ultra-thin back panel with closely spaced and redundant dimensions | + +### Linting + +```bash +ruff check src/ tests/ +ruff format src/ tests/ +``` + +Ruff is configured with line-length 100, target Python 3.11, and E/F/I rule sets. + +--- + +## Error Handling + +All exceptions inherit from `Pdf2ImosError`: + +| Exception | Raised When | +|---|---| +| `PdfExtractionError` | Invalid, corrupt, or empty PDF; no vector content found | +| `ViewSegmentationError` | View segmentation fails to identify orthographic regions | +| `DimensionExtractionError` | No dimensions found, or 3D assembly fails | +| `OutputWriteError` | DXF, JSON, or DWG file cannot be written | + +When processing a batch, failures on individual PDFs are caught and reported. The exit code reflects whether all, some, or none of the files succeeded. + +--- + +## Design Notes + +**Frozen dataclasses.** All pipeline models use frozen dataclasses, making intermediate data immutable and safe to pass between stages without defensive copying. + +**Third-angle projection.** The view segmenter assumes US/AutoCAD third-angle projection layout. First-angle (ISO) drawings will produce incorrect view assignments. + +**Depth fallback.** When depth cannot be extracted from the drawing (common for flat panels), the assembler defaults to 18mm, the standard furniture panel thickness. + +**32mm system.** Drilling hole placement assumes the 32mm system spacing standard used in European furniture manufacturing. + +**DWG conversion.** ODA File Converter must be installed and available on PATH for `--dwg` to have any effect. If it's absent, the flag is silently ignored and only DXF output is written. + +**Page scope.** Only the first page of each PDF is processed. Multi-page drawings are not currently supported.