Home / API Documentation

API Documentation

Welcome to the crawlkit.dev API. Use these endpoints to create and manage website crawling jobs.

Authentication

All API endpoints require authentication using an API token.

You can provide the token as:

  • Query parameter: ?api_token=YOUR_TOKEN
  • HTTP header: Authorization: Bearer YOUR_TOKEN
POST

/v1/jobs

Create a new crawl job

Start a new website crawl from the specified URL.

Request Body (JSON)

{
  "url": "https://example.com",        // Required: The root URL to start crawling from
  "depth": 2,                         // Optional: Crawl depth (default: 1, max: 10)
  "max_pages": 1000,                  // Optional: Maximum pages to crawl (default: 1000, max: 10000)
  "raw_html": false,                  // Optional: Save raw HTML in addition to cleaned HTML (default: false)
  "extract_main": false,              // Optional: Extract main content from pages (default: false)
  "ignore_robots": false,             // Optional: Ignore robots.txt restrictions (default: false)
  "callback_url": "https://..."       // Optional: URL to receive a webhook when job completes
}

Response (201 Created)

{
  "job_id": 123,                      // The job ID for tracking status
  "status": "queued"                  // Initial job status
}
GET

/v1/jobs/{job_id}

Get job status

Check the status of a crawl job.

Path Parameters

  • job_id - The ID of the job to check

Response (200 OK)

{
  "job_id": 123,
  "url": "https://example.com",
  "depth": 2,
  "max_pages": 1000,
  "extract_main": false,
  "raw_html": false,
  "ignore_robots": false,
  "status": "running",                // Status: queued, running, completed, failed, cancelled
  "pages_done": 42,                   // Number of pages processed so far
  "pages_total": 1000,                // Maximum number of pages to process
  "created_at": "2023-10-15T14:30:00Z",
  "started_at": "2023-10-15T14:30:05Z",
  "finished_at": null,                // Will be set when job completes or fails
  "error_message": null,              // Will contain error message if job fails
  "download_url": null                // Will contain download URL if job is completed
}
GET

/v1/jobs/{job_id}/download

Download job result

Download the ZIP file containing crawled pages in HTML and Markdown format.

Path Parameters

  • job_id - The ID of the completed job

Response

Returns a ZIP file containing the crawled pages, or an error message if the job is not completed.

DELETE

/v1/jobs/{job_id}

Cancel a job

Cancel a running or queued crawl job.

Path Parameters

  • job_id - The ID of the job to cancel

Response (200 OK)

{
  "status": "cancelled"
}
GET

/v1/health

Health check

Check the health status of the API. Requires authentication

Response (200 OK)

{
  "status": "ready",                  // Status: ready, initializing
  "queue_size": 3,                    // Number of jobs in queue
  "service": "crawlkit.dev"
}

Webhook Notifications

When you provide a callback_url in your job creation request, crawlkit.dev will send a POST request to that URL when the job completes (successfully or with an error).

Webhook Payload

{
  "job_id": 123,
  "status": "completed",              // Status: completed, failed, cancelled
  "download_url": "/v1/jobs/123/download"  // Only included for completed jobs
}

Example curl commands

Here are some example curl commands to help you get started with the API:

POST Create a new crawl job

curl -X POST "https://crawlkit.dev/v1/jobs" \
     -H "Content-Type: application/json" \
     -H "Authorization: Bearer YOUR_API_TOKEN" \
     -d '{
       "url": "https://example.com",
       "depth": 2,
       "max_pages": 100,
       "raw_html": false,
       "extract_main": true,
       "ignore_robots": false
     }'

GET Check job status

curl "https://crawlkit.dev/v1/jobs/123?api_token=YOUR_API_TOKEN"

GET Download crawl results

curl "https://crawlkit.dev/v1/jobs/123/download?api_token=YOUR_API_TOKEN" \
     --output crawl_result.zip

DELETE Cancel a job

curl -X DELETE "https://crawlkit.dev/v1/jobs/123" \
     -H "Authorization: Bearer YOUR_API_TOKEN"

GET Check API health

curl "https://crawlkit.dev/v1/health?api_token=YOUR_API_TOKEN"