Skip to content

Public API First Steps

The Scraper API is a powerful and flexible RESTful service designed for developers who need to perform web scraping tasks efficiently and at scale. This API is part of a scraping SaaS platform that allows you to submit, monitor, and manage web scraping jobs programmatically.

Table of Contents


Manage Account

First, you need to create an account at the DataChaser AI platform. After that, you can create an API Key to authenticate requests to other resources or update your account details.

Update User Account

To update your account information:

  • Endpoint: PUT /account
  • Authentication: API KEY required. You can create an API Key on your DataChaser AI dashboard.

​ For example, if you are sending the request via Postman, you will add the API KEY in headers as shown below:

Example API_KEY Authentication via Postman.png

  • Request Body: JSON object with the fields you want to update (see example below)
  • Response: The updated account object

Example Request:

{
  "name": "string",
  "email": "string",
  "password": "string",
  "cs_id": "string"
}

Here are the key components:

  1. name: Your full name (optional)
  2. email: Your new email address (optional)
  3. password: Your new password (optional, must be at least 8 characters long)
  4. cs_id: Your customer ID (optional). You are automatically assigned one after signing up on the DataChaser AI platform.

Example Response:

The API responds with the updated account details.

{
  "name": "Jane Doe",
  "email": "jane.doe@example.com",
  "create_date": "2023-06-17T13:44:06.185683Z"
}

Manage API Keys

API Keys are essential for authenticating your requests to the Web Scraper AI system.

1. Create a New API Key

To create a new API key:

  • Endpoint: POST /keys

  • Authentication: Existing API KEY required.

Note: If it is your first time creating an API key, you need to Sign in and create one manually. Follow these steps:

  1. Sign in with email and password here.

Log in with email and password.png

  1. Click on user profile. On the dropdown, click Account Settings.

Account Settings.png

  1. Under Api Keys, click CREATE API KEY.

CREATE API KEY.png

  • Request Body: JSON object with rate limit information (optional)

  • Response: The newly created API key object

Example Request:

{
  "rate_limit": {
    "interval": "string",
    "limit": int
  }
}

Here are the key components:

  • rate_limit: An optional object to set usage limits for the new key
  • interval: The time frame for the limit (e.g., "hourly", "daily")
  • limit: The maximum number of requests allowed in the specified interval

Example Response:

{
  "apikey": "your_new_api_key_here",
  "rate_limit": {
    "interval": "hourly",
    "limit": 1000
  },
  "created_at": "2024-06-28T13:44:06.185683Z"
}

2. Retrieve All API Keys

To get a list of all your API keys:

  • Endpoint: GET /keys
  • Authentication: API KEY required
  • Response: An array of API key objects

Example Response:

[
  {
    "apikey": "your_api_key_1",
    "rate_limit": {
      "interval": "hourly",
      "limit": 1000
    },
    "created_at": "2024-06-28T13:44:06.185683Z"
  },
  {
    "apikey": "your_api_key_2",
    "rate_limit": {
      "interval": "daily",
      "limit": 10000
    },
    "created_at": "2024-06-29T10:15:30.123456Z"
  }
]

3. Delete an API Key

To delete an existing API key:

  • Endpoint: DELETE /keys
  • Authentication: API KEY required
  • Request Body: JSON object with the API key to delete
  • Response: The deleted API key object

Example Request:

{
  "apikey": "api_key_to_delete"
}

Example Response:

{
  "apikey": "api_key_to_delete",
  "rate_limit": {
    "interval": "hourly",
    "limit": 1000
  },
  "created_at": "2024-06-28T13:44:06.185683Z"
}

Now that you have created an account and can manage API Keys, you can access the protected endpoints as follows:


Manage Jobs

A job represents a specific web scraping task that you want to perform.

Job Lifecycle

A typical job goes through the following stages:

  1. Pending: The job is queued and waiting to be processed.
  2. In Progress: The scraper is actively working on the job.
  3. Paused: The scraper has temporarily stopped working on the job.
  4. Completed: The job has finished successfully.
  5. Failed: The job encountered an error and could not complete.
  6. Cancelled: The job was manually cancelled before completion.

Managing Jobs

1. Create a New Job

To create a new scraping job:

  • Endpoint: POST /job
  • Authentication: API KEY required.
  • Request Body: JSON object describing the job (see example below)
  • Response: A job object with a unique identifier
Example Request:

Imagine you want to scrape product names from a website and need to click a button to reveal the product list.

{
  "website": "string",
  "actions": [
    {
      "type": "string",
      "selector": "string",
      "text": "string"
    }
  ],
  "data_to_scrape": [
    {
      "field": "string",
      "selector": {
        "type": "string",
        "selector": "string"
      }
    }
  ],
  "frecuency": {
    "value": "string",
    "type": "string"
  }
}

Here are the key components:

  1. website: The target URL to scrape.
  2. actions: A list of actions for the scraper to perform before extracting data. Each action has a type (e.g., "click", "type") and a selector to identify the element.
  3. data_to_scrape: Defines what data to extract. Each item specifies a field (the name you want to give the data) and a selector. The selector is an object compound of two components: The type, which can take either css-selector or xpath and selector which is the selector.
  4. frequency: Determines how often the job should run. Options include:
  5. daily: Runs once a day.
  6. weekly: Runs once a week.
  7. monthly: Runs once a month.
  8. hourly: Runs once an hour.
  9. every_x_minutes: Runs every specified number of minutes.
  10. every_x_hours: Runs every specified number of hours.
  11. every_x_days: Runs every specified number of days.
  12. custom: Allows for a custom schedule.

Note: every_x_minutes, every_x_hours, every_x_days and custom options are not yet prepared programmatically, but we expect to receive a cron schedule expression for each option.

Example Response:

The API responds with the created job details.

{
  "id": 5,
  "website": "https://websitetoscrape.com/",
  "actions": [
      {
        "selector": "#button",
        "type": "click",
        "text": "buttonnameexample"
     }
  ],
  "data_to_scrape": [
    {
      "field": "product_name",
      "selector": {
        "type": "css-selector",
        "selector": ".product-title"
      }
    }
  ],
  "frequency": {
    "type": "daily",
    "value": "daily"
  },
  "status": "Pending",
  "created_at": "2024-06-17T13:44:06.185683Z",
  "updated_at": "2024-06-17T13:44:06.185683Z"
}

2. Retrieve a Specific Job

To get details about a particular job:

  • Endpoint: GET /job/{id}
  • Authentication: API KEY required
  • Parameters: Replace {id} with the job's unique identifier
  • Response: The job object
Example Request:
GET /job/5
Key: <your_api_key>

3. Update an Existing Job

To modify a job that hasn't started yet:

  • Endpoint: PUT /job/{id}
  • Authentication: API KEY required
  • Parameters: Replace {id} with the job's unique identifier
  • Request Body: Updated job configuration (similar to job creation)
  • Response: The updated job object

4. Cancel a Job

To cancel a job:

  • Endpoint: DELETE /job/{id}
  • Authentication: API KEY required
  • Parameters: Replace {id} with the job's unique identifier
  • Response: The cancelled job object
Example Request:
DELETE /job/5
Key: <your_api_key>
Example Response:
{
  ...
  "status": "Cancelled",
  ...
}

The job status changes to Cancelled.

5. List All Jobs

To retrieve all your jobs:

  • Endpoint: GET /job
  • Authentication: API KEY required
  • Response: An array of job objects
Example Request:
GET /job
Key: <your_api_key>
Example Response:

You get a list of all jobs.

6. Synchronous Job Processing

For small scraping tasks or testing, you can use the synchronous job processing endpoint:

  • Endpoint: POST /job/sync
  • Authentication: API Key required
  • Request Body: Same as job creation
  • Response: The scraped data

Retrieve Data

The Data endpoint allows you to retrieve, manage, and delete the data collected by your web scraping jobs.

1. Retrieve All Data

To get all data associated with your account:

  • Endpoint: GET /data

  • Authentication: API KEY required

  • Response: An array of data objects

[
  {
    "id": "string",
    "jobid": "string",
    "data": {},
    "created_at": "string"
  }
]

Example Response:

[
  {
    "id": "data_123456",
    "jobid": "job_789012",
    "data": {
      "product_name": "Example Product",
      "price": "$19.99",
      "description": "This is an example product description."
    },
    "created_at": "2024-06-28T13:44:06.185683Z"
  },
  {
    "id": "data_789012",
    "jobid": "job_345678",
    "data": {
      "article_title": "Breaking News",
      "author": "John Doe",
      "content": "This is the content of the breaking news article."
    },
    "created_at": "2024-06-29T10:15:30.123456Z"
  }
]

The scraped data is returned in a flexible JSON format.

Note: The fields in the data object directly correspond to the data_to_scrape configuration in your job setup. Each field you specify in data_to_scrape will appear as a key-value pair in this data object, where the key is the field name you specified, and the value is the data scraped from the website according to the selector you provided.

For example, in the above example (first element of the array):

  1. product_name: This field contains the name of the product that was scraped from the website. In this example, it's "Example Product".

  2. price: This field contains the price of the product that was scraped. In this case, it's "$19.99".

The structure and content of the data object can vary significantly depending on:

  1. The website being scraped, and

  2. The specific data points you've configured your job to collect.

For instance, if you had set up your scraping job to collect additional information like product manufacturer or customer ratings, you might see a data object that looks more like this:

"data": {
  "product_name": "Example Product",
  "price": "$19.99",
  "description": "This is an example product description.",
  "manufacturer": "Manufacturer_XYZ",
  "ratings": "4.5/5.0"
}

2. Retrieve Data by Job ID

To get data generated by a specific job:

  • Endpoint: GET /data/job/{id}
  • Authentication: API KEY required
  • Parameters: Replace {id} with the job's unique identifier
  • Response: An array of data objects associated with the specified job

Example Request:

GET /data/job/job_789012

Example Response:

[
  {
    "id": "data_123456",
    "jobid": "job_789012",
    "data": {
      "product_name": "Example Product",
      "price": "$19.99",
      "description": "This is an example product description."
    },
    "created_at": "2024-06-28T13:44:06.185683Z"
  }
]

3. Delete Data by ID

To delete a specific data entry:

  • Endpoint: DELETE /data/{id}
  • Authentication: API KEY required
  • Parameters: Replace {id} with the data's unique identifier
  • Response: The deleted data object

Example Request:

DELETE /data/data_123456

Example Response:

{
  "id": "data_123456",
  "jobid": "job_789012",
  "data": {
    "product_name": "Example Product",
    "price": "$19.99",
    "description": "This is an example product description."
  },
  "created_at": "2024-06-28T13:44:06.185683Z"
}

4. WebHook for Data Updates

To receive notifications when new data is available:

  • Endpoint: POST /data/user/{id}
  • Authentication: API KEY required
  • Parameters: Replace {id} with your user ID
  • Description: This is a webhook that will send a POST request to your specified URL when new data is available for any of your jobs.

Example Webhook Payload:

{
  "id": "data_123456",
  "jobid": "job_789012",
  "data": {
    "product_name": "New Product",
    "price": "$24.99",
    "description": "This is a newly scraped product description."
  },
  "created_at": "2024-06-30T09:30:15.987654Z"
}

Access Logs

Logs provide a detailed record of events that occur during the operation of your web scraping jobs.

1. Retrieve All Logs

To get all logs associated with your account:

  • Endpoint: GET /logs

  • Authentication: API KEY required

  • Response: An array of log objects

[
  {
    "jobid": "string",
    "message": "string",
    "level": "string",
    "created_at": "string"
  }
]

Example Response:

[
  {
    "jobid": "job_123456",
    "message": "Job started successfully",
    "level": "INFO",
    "created_at": "2024-06-28T13:44:06.185683Z"
  },
  {
    "jobid": "job_123456",
    "message": "Successfully scraped 50 items",
    "level": "INFO",
    "created_at": "2024-06-28T13:45:12.345678Z"
  },
  {
    "jobid": "job_789012",
    "message": "Failed to access URL: Connection timeout",
    "level": "ERROR",
    "created_at": "2024-06-29T10:15:30.123456Z"
  }
]

Each log entry contains the following fields:

  • jobid: The unique identifier of the job that generated this log.
  • message: A descriptive message about the event that occurred.
  • level: The severity level of the log. Common levels include:
  • INFO: General information about job execution.
  • WARNING: Potential issues that didn't stop the job but might require attention.
  • ERROR: Serious issues that prevented the job from completing successfully.
  • created_at: The timestamp when the log entry was created.

2. Retrieve Logs by Job ID

To get logs generated by a specific job:

  • Endpoint: GET /log/job/{id}
  • Authentication: API KEY required
  • Parameters: Replace {id} with the job's unique identifier
  • Response: An array of log objects associated with the specified job

Example Request:

GET /log/job/job_123456

Example Response:

[
  {
    "jobid": "job_123456",
    "message": "Job started successfully",
    "level": "INFO",
    "created_at": "2024-06-28T13:44:06.185683Z"
  },
  {
    "jobid": "job_123456",
    "message": "Successfully scraped 50 items",
    "level": "INFO",
    "created_at": "2024-06-28T13:45:12.345678Z"
  }
]