Async Scraper Docs
- Schedule new Async job with public API
To create a new scraping job:
- Endpoint:
POST /job/async
- Authentication: API KEY required.
- Request Body: JSON object describing the job (see example below)
- Response: A job object with a unique identifier
Example Request:
Imagine you want to scrape product names from a website and need to click a button to reveal the product list.
{
"actions": [
{
"selector": "#button",
"type": "click",
"text": "buttonnameexample"
}
],
"data_to_scrape": [
{
"field": "product_name",
"selector": {
"type": "css-selector",
"selector": ".product-title"
}
}
],
"frequency": {
"type": "daily", // when type is set to daily, weekly, monthly, hourly type is no needed
"value": ""
},
"website": [
"https://websitetoscrape.com/"
]
}
website
: The target URL to scrape. (It's possible to schedule multiple URLs that shares the same configuration (data_to_scrape, freq and actions))actions
: A list of actions for the scraper to perform before extracting data. Each action has atype
(e.g., "click", "type") and aselector
to identify the element.data_to_scrape
: Defines what data to extract. Each item specifies afield
(the name you want to give the data) and aselector
. Theselector
is an object compound of two components: Thetype
, which can take eithercss-selector
orxpath
andselector
which is the selector.frequency
: Determines how often the job should run. Options include:- daily: Runs once a day.
- weekly: Runs once a week.
- monthly: Runs once a month.
- hourly: Runs once an hour.
- every_x_minutes: Runs every specified number of minutes.
- every_x_hours: Runs every specified number of hours.
- every_x_days: Runs every specified number of days.
- custom: Allows for a custom schedule.
Note: every_x_minutes, every_x_hours, every_x_days requires a string but number in value field which represents the x in every_x_whatever
for example:
In previous example presented, the frequency of job scheduled extraction is every 10 minutes.
Example Response:
The API responds with the created job details.
{
"id": 5,
"website": "https://websitetoscrape.com/",
"actions": [
{
"selector": "#button",
"type": "click",
"text": "buttonnameexample"
}
],
"data_to_scrape": [
{
"field": "product_name",
"selector": {
"type": "css-selector",
"selector": ".product-title"
}
}
],
"frequency": {
"type": "daily",
"value": ""
},
"status": "Pending",
"created_at": "2024-06-17T13:44:06.185683Z",
"updated_at": "2024-06-17T13:44:06.185683Z"
}
So then, you'll need to use Data endpoint to get data using the id retrieved by previous request
- Endpoint:
POST /data/job/{job_id}
- Authentication: API KEY required.
- Request Body: No body needed
- Response: A JSON array of objects with data collected related to a job execution
Example Response:
The API responds with the created job details.