Async Scraper Docs

Schedule new Async job with public API

To create a new scraping job:

Endpoint: POST /job/async
Authentication: API KEY required.
Request Body: JSON object describing the job (see example below)
Response: A job object with a unique identifier

Example Request:

Imagine you want to scrape product names from a website and need to click a button to reveal the product list.

{
  "actions": [
      {
        "selector": "#button",
        "type": "click",
        "text": "buttonnameexample"
     }
  ],
  "data_to_scrape": [
    {
      "field": "product_name",
      "selector": {
        "type": "css-selector",
        "selector": ".product-title"
      }
    }
  ],
   "frequency": {
        "type": "daily", // when type is set to daily, weekly, monthly, hourly type is no needed
        "value": ""
    },
  "website": [
    "https://websitetoscrape.com/"
  ]
}

website: The target URL to scrape. (It's possible to schedule multiple URLs that shares the same configuration (data_to_scrape, freq and actions))
actions: A list of actions for the scraper to perform before extracting data. Each action has a type (e.g., "click", "type") and a selector to identify the element.
data_to_scrape: Defines what data to extract. Each item specifies a field (the name you want to give the data) and a selector. The selector is an object compound of two components: The type, which can take either css-selector or xpath and selector which is the selector.
frequency: Determines how often the job should run. Options include:
- daily: Runs once a day.
- weekly: Runs once a week.
- monthly: Runs once a month.
- hourly: Runs once an hour.
- every_x_minutes: Runs every specified number of minutes.
- every_x_hours: Runs every specified number of hours.
- every_x_days: Runs every specified number of days.
- custom: Allows for a custom schedule.

Note: every_x_minutes, every_x_hours, every_x_days requires a string but number in value field which represents the x in every_x_whatever

for example:

{
    "frequency": {
        "type": "every_x_minutes",
        "value": "10"
    }
}

In previous example presented, the frequency of job scheduled extraction is every 10 minutes.

Example Response:

The API responds with the created job details.

{
  "id": 5,
  "website": "https://websitetoscrape.com/",
  "actions": [
      {
        "selector": "#button",
        "type": "click",
        "text": "buttonnameexample"
     }
  ],
  "data_to_scrape": [
    {
      "field": "product_name",
      "selector": {
        "type": "css-selector",
        "selector": ".product-title"
      }
    }
  ],
  "frequency": {
    "type": "daily",
    "value": ""
  },
  "status": "Pending",
  "created_at": "2024-06-17T13:44:06.185683Z",
  "updated_at": "2024-06-17T13:44:06.185683Z"
}

So then, you'll need to use Data endpoint to get data using the id retrieved by previous request

Endpoint: POST /data/job/{job_id}
Authentication: API KEY required.
Request Body: No body needed
Response: A JSON array of objects with data collected related to a job execution

Example Response:

The API responds with the created job details.

[
  {
    "created_at": "2024-10-25 02:56:11",
    "data": {
        "product_name": "This is awesome product!",
    },
    "html_url": "https://urltodownloadhtmlexample.com/",
    "id": "dataid",
    "jobid": "jobid"
  }
]