# AGENTS.md - Requirements & Decisions This document records the requirements, architecture decisions, and rationale for the S3 to HF Bucket Importer. ## Requirements ### Functional 1. **OAuth Login**: Users sign in with Hugging Face OAuth. Scopes: `manage-repos` (create buckets), `jobs` (run Jobs), plus default `openid` + `profile`. 2. **S3 Configuration**: Users provide AWS credentials (Access Key ID, Secret Access Key), region, bucket name, optional endpoint URL (for S3-compatible services like MinIO), and optional source prefix. 3. **File Browser**: Display S3 bucket contents in a tree view with: - Lazy loading (only load folder contents on expand) - Checkboxes for file/folder selection - Select all / Deselect all - File count and size statistics - CORS fallback mode if browser can't access S3 directly 4. **Destination Configuration**: User provides: - Bucket name (`bucket` or `namespace/bucket` format) - Optional destination prefix - Auto-create bucket if it doesn't exist 5. **Import Execution**: Launch an HF Job that: - Installs `huggingface_hub[s3]` from branch `cursor/s3-to-hf-bucket-ingestion-f144` - Runs `hf buckets import` with appropriate arguments - Passes S3 credentials as encrypted secrets - Uses CPU hardware (I/O-bound task) 6. **No Local Storage**: Nothing persists in the browser. Page refresh = start over. All credentials are in-memory only. ### Non-Functional - Professional, confidence-inspiring design - Dark theme inspired by HF Buckets announcement - Responsive (works on mobile) - Clear error messages and graceful degradation ## Architecture Decisions ### Decision 1: No Build Step **Choice**: Vanilla HTML/CSS/JS with ES modules from CDN (esm.sh) **Alternatives considered**: - Vite + vanilla JS (tree-shaking, npm packages) - React/Vue/Svelte (component model) **Rationale**: - Simplest deployment: just upload files to a static Space - No `node_modules`, no `package.json`, no build config - CDN imports are cached by the browser across sessions - The app is a single page with ~4 states - doesn't warrant a framework - AWS SDK v3 is ~500KB from CDN but cached effectively - Easy for anyone to maintain: just HTML/CSS/JS, no tooling knowledge needed ### Decision 2: CDN Provider (esm.sh) **Choice**: esm.sh for ES module CDN **Rationale**: - Reliable ESM CDN that handles CommonJS -> ESM conversion - Supports import maps natively - Handles transitive dependencies for AWS SDK v3 - Alternative was jsdelivr+esm or unpkg, but esm.sh has better ESM support ### Decision 3: CORS Fallback Strategy **Choice**: Try browser-side S3 listing, gracefully degrade to manual mode **Alternatives considered**: - Skip browser listing entirely (simpler but less interactive) - Require CORS (better UX but higher friction) **Rationale**: - Most S3 buckets won't have CORS configured for our domain - Attempting listing first gives the best UX for configured buckets - Manual fallback (include/exclude patterns) lets everyone use the tool - Clear CORS instructions help users enable browsing if they want ### Decision 4: Selection-to-Pattern Mapping **Choice**: Compute include or exclude patterns, whichever set is smaller **Rationale**: - `hf buckets import` supports `--include` and `--exclude` with fnmatch patterns - When most files are selected, use `--exclude` for the few deselected - When few files are selected, use `--include` for the selected ones - Folder-level patterns (e.g., `folder/*`) keep command lines short - Shell-escaping prevents injection in the Job command ### Decision 5: Job Configuration **Choice**: `python:3.12` Docker image with `pip install` + `bash -c` **Alternatives considered**: - Custom Docker image with huggingface_hub pre-installed - `ghcr.io/astral-sh/uv` image with `uv run` **Rationale**: - `python:3.12` is a standard, well-maintained image - `pip install` from git is simple and reliable - `bash -c` allows chaining install + import in one command - Once `hf buckets import` is released, change to `pip install 'huggingface_hub[s3]>=X.Y.Z'` - CPU-only hardware (`cpu-basic`) is sufficient - this is purely I/O-bound ### Decision 6: Bucket Creation **Choice**: Call `POST /api/repos/create` with `type: "bucket"` before submitting the Job **Rationale**: - The Job needs the bucket to exist before it can write to it - Creating from the browser (pre-Job) ensures the user has proper permissions - Handle 409 (already exists) gracefully ### Decision 7: Token Handling **Choice**: Pass OAuth token as Job secret (`HF_TOKEN`), S3 credentials as secrets too **Rationale**: - Secrets are encrypted server-side by the Jobs API - Injected as environment variables at runtime - Never appear in logs or the HF UI - The `hf` CLI automatically uses `HF_TOKEN` for authentication - boto3/s3fs automatically uses `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` ## OAuth Scopes | Scope | Purpose | |-------|---------| | `openid` | ID token (always included) | | `profile` | Username, avatar (always included) | | `manage-repos` | Create and manage buckets | | `jobs` | Create and monitor HF Jobs | ## API Endpoints Used | Method | Endpoint | Purpose | |--------|----------|---------| | GET | `/api/whoami-v2` | Get user info + orgs | | POST | `/api/repos/create` | Create destination bucket | | POST | `/api/jobs/{namespace}` | Submit import Job | | GET | `/api/jobs/{namespace}/{id}` | Poll Job status | ## Security Model 1. **S3 credentials**: Only transmitted from browser to HF Jobs API as encrypted secrets. Never stored in localStorage, cookies, or any persistent storage. 2. **OAuth token**: Held in JavaScript memory only. Not persisted. Passed to Job as encrypted secret for HF API authentication. 3. **Shell injection prevention**: All user inputs interpolated into the Job command are escaped using single-quote wrapping with internal quote escaping. 4. **No backend**: The static site has no server-side code. All API calls are made directly from the browser to S3 and HF APIs. ## File Structure ``` s3-importer/ ├── README.md # Space YAML frontmatter + project docs ├── AGENTS.md # This file ├── index.html # HTML structure + import maps ├── style.css # Dark theme (CSS custom properties) └── app.js # Application logic (ES module, ~500 lines) ``` ## Future Improvements - [ ] Show Job logs inline (stream from `/api/jobs/{namespace}/{id}/logs`) - [ ] Support AWS session tokens (for temporary credentials / assumed roles) - [ ] Remember recent imports (optional, opt-in localStorage) - [ ] Support importing to existing bucket with merge strategy - [ ] Add a "dry run" option that shows what would be imported - [ ] Update `huggingface_hub` install to stable release once available