s3-importer / AGENTS.md
Wauplin's picture
Wauplin HF Staff
Upload folder using huggingface_hub
364df85 verified

AGENTS.md - Requirements & Decisions

This document records the requirements, architecture decisions, and rationale for the S3 to HF Bucket Importer.

Requirements

Functional

  1. OAuth Login: Users sign in with Hugging Face OAuth. Scopes: manage-repos (create buckets), jobs (run Jobs), plus default openid + profile.

  2. S3 Configuration: Users provide AWS credentials (Access Key ID, Secret Access Key), region, bucket name, optional endpoint URL (for S3-compatible services like MinIO), and optional source prefix.

  3. File Browser: Display S3 bucket contents in a tree view with:

    • Lazy loading (only load folder contents on expand)
    • Checkboxes for file/folder selection
    • Select all / Deselect all
    • File count and size statistics
    • CORS fallback mode if browser can't access S3 directly
  4. Destination Configuration: User provides:

    • Bucket name (bucket or namespace/bucket format)
    • Optional destination prefix
    • Auto-create bucket if it doesn't exist
  5. Import Execution: Launch an HF Job that:

    • Installs huggingface_hub[s3] from branch cursor/s3-to-hf-bucket-ingestion-f144
    • Runs hf buckets import with appropriate arguments
    • Passes S3 credentials as encrypted secrets
    • Uses CPU hardware (I/O-bound task)
  6. No Local Storage: Nothing persists in the browser. Page refresh = start over. All credentials are in-memory only.

Non-Functional

  • Professional, confidence-inspiring design
  • Dark theme inspired by HF Buckets announcement
  • Responsive (works on mobile)
  • Clear error messages and graceful degradation

Architecture Decisions

Decision 1: No Build Step

Choice: Vanilla HTML/CSS/JS with ES modules from CDN (esm.sh)

Alternatives considered:

  • Vite + vanilla JS (tree-shaking, npm packages)
  • React/Vue/Svelte (component model)

Rationale:

  • Simplest deployment: just upload files to a static Space
  • No node_modules, no package.json, no build config
  • CDN imports are cached by the browser across sessions
  • The app is a single page with ~4 states - doesn't warrant a framework
  • AWS SDK v3 is ~500KB from CDN but cached effectively
  • Easy for anyone to maintain: just HTML/CSS/JS, no tooling knowledge needed

Decision 2: CDN Provider (esm.sh)

Choice: esm.sh for ES module CDN

Rationale:

  • Reliable ESM CDN that handles CommonJS -> ESM conversion
  • Supports import maps natively
  • Handles transitive dependencies for AWS SDK v3
  • Alternative was jsdelivr+esm or unpkg, but esm.sh has better ESM support

Decision 3: CORS Fallback Strategy

Choice: Try browser-side S3 listing, gracefully degrade to manual mode

Alternatives considered:

  • Skip browser listing entirely (simpler but less interactive)
  • Require CORS (better UX but higher friction)

Rationale:

  • Most S3 buckets won't have CORS configured for our domain
  • Attempting listing first gives the best UX for configured buckets
  • Manual fallback (include/exclude patterns) lets everyone use the tool
  • Clear CORS instructions help users enable browsing if they want

Decision 4: Selection-to-Pattern Mapping

Choice: Compute include or exclude patterns, whichever set is smaller

Rationale:

  • hf buckets import supports --include and --exclude with fnmatch patterns
  • When most files are selected, use --exclude for the few deselected
  • When few files are selected, use --include for the selected ones
  • Folder-level patterns (e.g., folder/*) keep command lines short
  • Shell-escaping prevents injection in the Job command

Decision 5: Job Configuration

Choice: python:3.12 Docker image with pip install + bash -c

Alternatives considered:

  • Custom Docker image with huggingface_hub pre-installed
  • ghcr.io/astral-sh/uv image with uv run

Rationale:

  • python:3.12 is a standard, well-maintained image
  • pip install from git is simple and reliable
  • bash -c allows chaining install + import in one command
  • Once hf buckets import is released, change to pip install 'huggingface_hub[s3]>=X.Y.Z'
  • CPU-only hardware (cpu-basic) is sufficient - this is purely I/O-bound

Decision 6: Bucket Creation

Choice: Call POST /api/repos/create with type: "bucket" before submitting the Job

Rationale:

  • The Job needs the bucket to exist before it can write to it
  • Creating from the browser (pre-Job) ensures the user has proper permissions
  • Handle 409 (already exists) gracefully

Decision 7: Token Handling

Choice: Pass OAuth token as Job secret (HF_TOKEN), S3 credentials as secrets too

Rationale:

  • Secrets are encrypted server-side by the Jobs API
  • Injected as environment variables at runtime
  • Never appear in logs or the HF UI
  • The hf CLI automatically uses HF_TOKEN for authentication
  • boto3/s3fs automatically uses AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY

OAuth Scopes

Scope Purpose
openid ID token (always included)
profile Username, avatar (always included)
manage-repos Create and manage buckets
jobs Create and monitor HF Jobs

API Endpoints Used

Method Endpoint Purpose
GET /api/whoami-v2 Get user info + orgs
POST /api/repos/create Create destination bucket
POST /api/jobs/{namespace} Submit import Job
GET /api/jobs/{namespace}/{id} Poll Job status

Security Model

  1. S3 credentials: Only transmitted from browser to HF Jobs API as encrypted secrets. Never stored in localStorage, cookies, or any persistent storage.

  2. OAuth token: Held in JavaScript memory only. Not persisted. Passed to Job as encrypted secret for HF API authentication.

  3. Shell injection prevention: All user inputs interpolated into the Job command are escaped using single-quote wrapping with internal quote escaping.

  4. No backend: The static site has no server-side code. All API calls are made directly from the browser to S3 and HF APIs.

File Structure

s3-importer/
β”œβ”€β”€ README.md       # Space YAML frontmatter + project docs
β”œβ”€β”€ AGENTS.md       # This file
β”œβ”€β”€ index.html      # HTML structure + import maps
β”œβ”€β”€ style.css       # Dark theme (CSS custom properties)
└── app.js          # Application logic (ES module, ~500 lines)

Future Improvements

  • Show Job logs inline (stream from /api/jobs/{namespace}/{id}/logs)
  • Support AWS session tokens (for temporary credentials / assumed roles)
  • Remember recent imports (optional, opt-in localStorage)
  • Support importing to existing bucket with merge strategy
  • Add a "dry run" option that shows what would be imported
  • Update huggingface_hub install to stable release once available