Skip to content

Async Scraper Docs

  1. Schedule new Async job with public API

To create a new scraping job:

  • Endpoint: POST /job/async
  • Authentication: API KEY required.
  • Request Body: JSON object describing the job (see example below)
  • Response: A job object with a unique identifier
Example Request:

Imagine you want to scrape product names from a website and need to click a button to reveal the product list.

{
  "actions": [
      {
        "selector": "#button",
        "type": "click",
        "text": "buttonnameexample"
     }
  ],
  "data_to_scrape": [
    {
      "field": "product_name",
      "selector": {
        "type": "css-selector",
        "selector": ".product-title"
      }
    }
  ],
   "frequency": {
        "type": "daily", // when type is set to daily, weekly, monthly, hourly type is no needed
        "value": ""
    },
  "website": [
    "https://websitetoscrape.com/"
  ]
}
  1. website: The target URL to scrape. (It's possible to schedule multiple URLs that shares the same configuration (data_to_scrape, freq and actions))
  2. actions: A list of actions for the scraper to perform before extracting data. Each action has a type (e.g., "click", "type") and a selector to identify the element.
  3. data_to_scrape: Defines what data to extract. Each item specifies a field (the name you want to give the data) and a selector. The selector is an object compound of two components: The type, which can take either css-selector or xpath and selector which is the selector.
  4. frequency: Determines how often the job should run. Options include:
    • daily: Runs once a day.
    • weekly: Runs once a week.
    • monthly: Runs once a month.
    • hourly: Runs once an hour.
    • every_x_minutes: Runs every specified number of minutes.
    • every_x_hours: Runs every specified number of hours.
    • every_x_days: Runs every specified number of days.
    • custom: Allows for a custom schedule.

Note: every_x_minutes, every_x_hours, every_x_days requires a string but number in value field which represents the x in every_x_whatever

for example:

{
    "frequency": {
        "type": "every_x_minutes",
        "value": "10"
    }
}

In previous example presented, the frequency of job scheduled extraction is every 10 minutes.

Example Response:

The API responds with the created job details.

{
  "id": 5,
  "website": "https://websitetoscrape.com/",
  "actions": [
      {
        "selector": "#button",
        "type": "click",
        "text": "buttonnameexample"
     }
  ],
  "data_to_scrape": [
    {
      "field": "product_name",
      "selector": {
        "type": "css-selector",
        "selector": ".product-title"
      }
    }
  ],
  "frequency": {
    "type": "daily",
    "value": ""
  },
  "status": "Pending",
  "created_at": "2024-06-17T13:44:06.185683Z",
  "updated_at": "2024-06-17T13:44:06.185683Z"
}

So then, you'll need to use Data endpoint to get data using the id retrieved by previous request

  • Endpoint: POST /data/job/{job_id}
  • Authentication: API KEY required.
  • Request Body: No body needed
  • Response: A JSON array of objects with data collected related to a job execution
Example Response:

The API responds with the created job details.

[
  {
    "created_at": "2024-10-25 02:56:11",
    "data": {
        "product_name": "This is awesome product!",
    },
    "html_url": "https://urltodownloadhtmlexample.com/",
    "id": "dataid",
    "jobid": "jobid"
  }
]