From 120f018906eb3c686ef16fc773e5d43f6ebe1400 Mon Sep 17 00:00:00 2001
From: repi <repi@repi.fun>
Date: Tue, 3 Mar 2026 21:30:31 +0000
Subject: [PATCH] chore: README.md

---
 README.md | 353 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 353 insertions(+)
 create mode 100644 README.md
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..65c827b
--- /dev/null
+++ b/README.md
@@ -0,0 +1,353 @@
+# pdf2imos
+
+A Python CLI tool that converts PDF technical drawings into DXF 3D CAD files and structured JSON metadata for imos CAD furniture manufacturing workflows.
+
+It parses vector geometry and text from PDF pages, identifies orthographic views, extracts dimensions and annotations, reconstructs 3D part geometry, and outputs industry-standard DXF files alongside validated JSON sidecar metadata.
+
+---
+
+## Table of Contents
+
+- [Overview](#overview)
+- [Requirements](#requirements)
+- [Installation](#installation)
+- [Usage](#usage)
+- [Pipeline Architecture](#pipeline-architecture)
+- [Output Files](#output-files)
+- [Project Structure](#project-structure)
+- [Development](#development)
+- [Error Handling](#error-handling)
+- [Design Notes](#design-notes)
+
+---
+
+## Overview
+
+pdf2imos targets AutoCAD-style orthographic projection drawings of furniture parts. Given a directory of PDFs, it:
+
+1. Extracts vector paths and text from each page
+2. Segments the drawing into front, top, and side views
+3. Classifies line roles (geometry, hidden, center, dimension, etc.)
+4. Extracts dimension measurements and structured annotations (material, edgebanding, hardware, drilling)
+5. Reconstructs 3D part geometry from orthographic measurements
+6. Writes a DXF R2010 file with a 3D box mesh and a schema-validated JSON metadata file
+
+Batch processing runs across all PDFs in the input directory. Only the first page of each PDF is processed.
+
+---
+
+## Requirements
+
+- Python >= 3.11
+- [ODA File Converter](https://www.opendesign.com/guestfiles/oda_file_converter) (optional, for `--dwg` output)
+
+---
+
+## Installation
+
+```bash
+# Clone the repository
+git clone <repo-url>
+cd pdf2cad
+
+# Create a virtual environment
+python -m venv venv
+source venv/bin/activate  # Windows: venv\Scripts\activate
+
+# Install in development mode with dev dependencies
+pip install -e ".[dev]"
+```
+
+---
+
+## Usage
+
+```
+pdf2imos INPUT_DIR OUTPUT_DIR [OPTIONS]
+```
+
+Also runnable as a module:
+
+```bash
+python -m pdf2imos INPUT_DIR OUTPUT_DIR
+```
+
+### Arguments
+
+| Argument | Description |
+|---|---|
+| `INPUT_DIR` | Directory containing PDF files to process |
+| `OUTPUT_DIR` | Directory for output DXF and JSON files |
+
+### Options
+
+| Option | Description |
+|---|---|
+| `--stage STAGE` | Stop at a pipeline stage and dump intermediate JSON |
+| `--tolerance FLOAT` | Dimension matching tolerance in mm (default: 0.5) |
+| `--dwg` | Also convert DXF to DWG format (requires ODAFileConverter on PATH) |
+| `--verbose` | Enable DEBUG logging |
+| `--version` | Show version and exit |
+
+**Available `--stage` values:** `extract`, `segment`, `classify`, `dimensions`, `annotations`, `assemble`, `output`
+
+### Examples
+
+```bash
+# Process all PDFs in a directory
+pdf2imos ./drawings ./output
+
+# Stop at extraction stage for debugging
+pdf2imos ./drawings ./output --stage extract
+
+# Verbose output with custom tolerance
+pdf2imos ./drawings ./output --verbose --tolerance 1.0
+
+# Also generate DWG files (requires ODAFileConverter)
+pdf2imos ./drawings ./output --dwg
+```
+
+### Exit Codes
+
+| Code | Meaning |
+|---|---|
+| `0` | All PDFs processed successfully |
+| `1` | Some PDFs failed, some succeeded |
+| `2` | All PDFs failed, or invalid arguments |
+
+---
+
+## Pipeline Architecture
+
+The pipeline runs in 7 sequential stages. Use `--stage` to halt after any stage and inspect intermediate output.
+
+### Stage 1: Extract
+
+Parses the PDF page with PyMuPDF. Extracts vector paths (lines, curves, rectangles, quads) and text spans with font, size, and color metadata. Flips the y-axis from PDF convention (origin top-left, y downward) to CAD convention (origin bottom-left, y upward). Filters degenerate and zero-area paths.
+
+### Stage 2: Segment
+
+Detects and removes the title block using a bottom-right rectangle heuristic. Clusters remaining geometry by spatial proximity into view regions. Classifies clusters as FRONT, TOP, or SIDE views using third-angle projection layout conventions (US/AutoCAD standard).
+
+### Stage 3: Classify
+
+Classifies each path by visual properties:
+
+| Role | Visual Characteristics |
+|---|---|
+| `GEOMETRY` | Solid line, medium width |
+| `HIDDEN` | Dashed line |
+| `CENTER` | Dash-dot line |
+| `DIMENSION` | Thin line, near arrowheads |
+| `BORDER` | Thick line, large extent |
+| `CONSTRUCTION` | Very thin line |
+
+Arrowheads are detected as small filled triangles.
+
+### Stage 4: Dimensions
+
+Finds numeric text values via regex (digits with optional `mm` suffix). Converts text coordinates to CAD space. Matches each number to the nearest dimension or geometry line segment. Determines horizontal vs. vertical orientation.
+
+### Stage 5: Annotations
+
+Extracts structured annotations via regex:
+
+- **Material specs** -- type, thickness, finish
+- **Edgebanding** -- thickness, material
+- **Hardware callouts** -- brand, model
+- **Drilling patterns** -- diameter, depth, count
+
+Also collects raw text annotations and title block metadata.
+
+### Stage 6: Assemble
+
+Reconstructs 3D part geometry (width x height x depth) from orthographic dimension measurements. Cross-validates dimensions across views: front height should match side height, front width should match top width. Falls back to 18mm depth when depth cannot be extracted (standard furniture panel thickness).
+
+### Stage 7: Output
+
+Generates:
+
+- **DXF R2010** file with a 3D box mesh on the `GEOMETRY` layer, dimension text on `DIMENSIONS`, and part name on `ANNOTATIONS`
+- **JSON metadata** file validated against `metadata.schema.json`
+- Optionally converts DXF to DWG via ODAFileConverter (ACAD2018 format, 30-second timeout)
+
+---
+
+## Output Files
+
+For each input `example.pdf`, two files are written to the output directory:
+
+| File | Description |
+|---|---|
+| `example.dxf` | DXF R2010 file with 3D mesh geometry |
+| `example.json` | JSON metadata validated against the bundled schema |
+
+### JSON Metadata Structure
+
+```json
+{
+  "source_pdf": "example.pdf",
+  "extraction_timestamp": "2026-03-03T12:00:00Z",
+  "part_name": "Panel A",
+  "overall_dimensions": {
+    "width_mm": 600.0,
+    "height_mm": 720.0,
+    "depth_mm": 18.0
+  },
+  "parts": [
+    {
+      "material": "...",
+      "edgebanding": "...",
+      "hardware": "...",
+      "drilling": "..."
+    }
+  ],
+  "raw_annotations": ["..."]
+}
+```
+
+### Coordinate System
+
+- **PDF space:** origin top-left, y increases downward
+- **CAD/DXF space:** origin bottom-left, y increases upward
+- The pipeline flips y-coordinates during extraction
+- **DXF axes:** X = width, Y = depth, Z = height
+
+---
+
+## Project Structure
+
+```
+pdf2cad/
+├── pyproject.toml
+├── src/
+│   └── pdf2imos/
+│       ├── __init__.py              # Version
+│       ├── __main__.py              # python -m entry point
+│       ├── cli.py                   # Typer CLI and pipeline orchestration
+│       ├── errors.py                # Exception hierarchy
+│       ├── extract/                 # Stage 1: PDF parsing
+│       │   ├── geometry.py          #   Vector path extraction
+│       │   └── text.py              #   Text span extraction
+│       ├── interpret/               # Stages 2-3: Layout understanding
+│       │   ├── line_classifier.py   #   Line role classification
+│       │   ├── title_block.py       #   Title block detection
+│       │   └── view_segmenter.py    #   Orthographic view segmentation
+│       ├── models/                  # Data models (frozen dataclasses)
+│       │   ├── annotations.py       #   Material, edgeband, hardware, drilling
+│       │   ├── classified.py        #   ClassifiedLine, LineRole
+│       │   ├── geometry.py          #   PartGeometry (3D dimensions)
+│       │   ├── pipeline.py          #   PipelineResult
+│       │   ├── primitives.py        #   RawPath, RawText, PageExtraction
+│       │   └── views.py             #   ViewRegion, ViewType
+│       ├── output/                  # Stage 7: File generation
+│       │   ├── dxf_writer.py        #   DXF 3D mesh output
+│       │   ├── dwg_converter.py     #   Optional DWG conversion
+│       │   └── json_writer.py       #   JSON metadata output
+│       ├── parse/                   # Stages 4-5: Content extraction
+│       │   ├── annotations.py       #   Annotation regex parsing
+│       │   └── dimensions.py        #   Dimension measurement extraction
+│       ├── reconstruct/             # Stage 6: 3D assembly
+│       │   └── assembler.py         #   Part geometry reconstruction
+│       └── schema/                  # JSON Schema
+│           ├── metadata.schema.json #   Metadata validation schema
+│           └── validator.py         #   Schema validation wrapper
+└── tests/
+    ├── conftest.py
+    ├── generate_fixtures.py         # Synthetic test PDF generator
+    ├── fixtures/
+    │   ├── input/                   # Test PDFs (4 files)
+    │   └── expected/                # Expected JSON outputs
+    ├── integration/
+    │   ├── test_golden.py           # Golden file comparison
+    │   └── test_pipeline.py         # End-to-end pipeline tests
+    ├── test_annotation_extractor.py
+    ├── test_assembler.py
+    ├── test_cli.py
+    ├── test_dimension_extractor.py
+    ├── test_dwg_converter.py
+    ├── test_dxf_writer.py
+    ├── test_error_handling.py
+    ├── test_geometry_extractor.py
+    ├── test_json_writer.py
+    ├── test_line_classifier.py
+    ├── test_models.py
+    ├── test_schema.py
+    ├── test_text_extractor.py
+    ├── test_title_block.py
+    └── test_view_segmenter.py
+```
+
+---
+
+## Development
+
+### Running Tests
+
+```bash
+# Run all tests
+pytest
+
+# Run with coverage
+pytest --cov=pdf2imos
+
+# Run only unit tests (skip integration)
+pytest tests/ --ignore=tests/integration
+
+# Run only integration tests
+pytest tests/integration/
+
+# Regenerate test fixture PDFs
+python tests/generate_fixtures.py
+```
+
+### Test Fixtures
+
+Four synthetic PDFs are generated by `tests/generate_fixtures.py`:
+
+| File | Description |
+|---|---|
+| `simple_panel.pdf` | 600x720x18mm flat panel, 3 orthographic views |
+| `cabinet_basic.pdf` | 600x720x400mm cabinet with material and edgebanding annotations |
+| `panel_with_drilling.pdf` | 600x720x18mm panel with shelf pin holes and drilling annotations |
+| `edge_cases.pdf` | 600x720x3mm ultra-thin back panel with closely spaced and redundant dimensions |
+
+### Linting
+
+```bash
+ruff check src/ tests/
+ruff format src/ tests/
+```
+
+Ruff is configured with line-length 100, target Python 3.11, and E/F/I rule sets.
+
+---
+
+## Error Handling
+
+All exceptions inherit from `Pdf2ImosError`:
+
+| Exception | Raised When |
+|---|---|
+| `PdfExtractionError` | Invalid, corrupt, or empty PDF; no vector content found |
+| `ViewSegmentationError` | View segmentation fails to identify orthographic regions |
+| `DimensionExtractionError` | No dimensions found, or 3D assembly fails |
+| `OutputWriteError` | DXF, JSON, or DWG file cannot be written |
+
+When processing a batch, failures on individual PDFs are caught and reported. The exit code reflects whether all, some, or none of the files succeeded.
+
+---
+
+## Design Notes
+
+**Frozen dataclasses.** All pipeline models use frozen dataclasses, making intermediate data immutable and safe to pass between stages without defensive copying.
+
+**Third-angle projection.** The view segmenter assumes US/AutoCAD third-angle projection layout. First-angle (ISO) drawings will produce incorrect view assignments.
+
+**Depth fallback.** When depth cannot be extracted from the drawing (common for flat panels), the assembler defaults to 18mm, the standard furniture panel thickness.
+
+**32mm system.** Drilling hole placement assumes the 32mm system spacing standard used in European furniture manufacturing.
+
+**DWG conversion.** ODA File Converter must be installed and available on PATH for `--dwg` to have any effect. If it's absent, the flag is silently ignored and only DXF output is written.
+
+**Page scope.** Only the first page of each PDF is processed. Multi-page drawings are not currently supported.