API Documentation
Welcome to the crawlkit.dev API. Use these endpoints to create and manage website crawling jobs.
Authentication
All API endpoints require authentication using an API token.
You can provide the token as:
-
Query parameter:
?api_token=YOUR_TOKEN
-
HTTP header:
Authorization: Bearer YOUR_TOKEN
/v1/jobs
Create a new crawl job
Start a new website crawl from the specified URL.
Request Body (JSON)
{
"url": "https://example.com", // Required: The root URL to start crawling from
"depth": 2, // Optional: Crawl depth (default: 1, max: 10)
"max_pages": 1000, // Optional: Maximum pages to crawl (default: 1000, max: 10000)
"raw_html": false, // Optional: Save raw HTML in addition to cleaned HTML (default: false)
"extract_main": false, // Optional: Extract main content from pages (default: false)
"ignore_robots": false, // Optional: Ignore robots.txt restrictions (default: false)
"callback_url": "https://..." // Optional: URL to receive a webhook when job completes
}
Response (201 Created)
{
"job_id": 123, // The job ID for tracking status
"status": "queued" // Initial job status
}
/v1/jobs/{job_id}
Get job status
Check the status of a crawl job.
Path Parameters
-
job_id
- The ID of the job to check
Response (200 OK)
{
"job_id": 123,
"url": "https://example.com",
"depth": 2,
"max_pages": 1000,
"extract_main": false,
"raw_html": false,
"ignore_robots": false,
"status": "running", // Status: queued, running, completed, failed, cancelled
"pages_done": 42, // Number of pages processed so far
"pages_total": 1000, // Maximum number of pages to process
"created_at": "2023-10-15T14:30:00Z",
"started_at": "2023-10-15T14:30:05Z",
"finished_at": null, // Will be set when job completes or fails
"error_message": null, // Will contain error message if job fails
"download_url": null // Will contain download URL if job is completed
}
/v1/jobs/{job_id}/download
Download job result
Download the ZIP file containing crawled pages in HTML and Markdown format.
Path Parameters
-
job_id
- The ID of the completed job
Response
Returns a ZIP file containing the crawled pages, or an error message if the job is not completed.
/v1/jobs/{job_id}
Cancel a job
Cancel a running or queued crawl job.
Path Parameters
-
job_id
- The ID of the job to cancel
Response (200 OK)
{
"status": "cancelled"
}
/v1/health
Health check
Check the health status of the API. Requires authentication
Response (200 OK)
{
"status": "ready", // Status: ready, initializing
"queue_size": 3, // Number of jobs in queue
"service": "crawlkit.dev"
}
Webhook Notifications
When you provide a callback_url
in your job creation request,
crawlkit.dev will send a POST request to that URL when the job completes (successfully or with an error).
Webhook Payload
{
"job_id": 123,
"status": "completed", // Status: completed, failed, cancelled
"download_url": "/v1/jobs/123/download" // Only included for completed jobs
}
Example curl commands
Here are some example curl commands to help you get started with the API:
POST Create a new crawl job
curl -X POST "https://crawlkit.dev/v1/jobs" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-d '{
"url": "https://example.com",
"depth": 2,
"max_pages": 100,
"raw_html": false,
"extract_main": true,
"ignore_robots": false
}'
GET Check job status
curl "https://crawlkit.dev/v1/jobs/123?api_token=YOUR_API_TOKEN"
GET Download crawl results
curl "https://crawlkit.dev/v1/jobs/123/download?api_token=YOUR_API_TOKEN" \
--output crawl_result.zip
DELETE Cancel a job
curl -X DELETE "https://crawlkit.dev/v1/jobs/123" \
-H "Authorization: Bearer YOUR_API_TOKEN"
GET Check API health
curl "https://crawlkit.dev/v1/health?api_token=YOUR_API_TOKEN"