Public API First Steps

The Scraper API is a powerful and flexible RESTful service designed for developers who need to perform web scraping tasks efficiently and at scale. This API is part of a scraping SaaS platform that allows you to submit, monitor, and manage web scraping jobs programmatically.

Manage Account

First, you need to create an account at the DataChaser AI platform. After that, you can create an API Key to authenticate requests to other resources or update your account details.

Update User Account

To update your account information:

Endpoint: PUT /account
Authentication: API KEY required. You can create an API Key on your DataChaser AI dashboard.

For example, if you are sending the request via Postman, you will add the API KEY in headers as shown below:

Example API_KEY Authentication via Postman.png

Request Body: JSON object with the fields you want to update (see example below)
Response: The updated account object

Example Request:

{
  "name": "string",
  "email": "string",
  "password": "string",
  "cs_id": "string"
}

Here are the key components:

name: Your full name (optional)
email: Your new email address (optional)
password: Your new password (optional, must be at least 8 characters long)
cs_id: Your customer ID (optional). You are automatically assigned one after signing up on the DataChaser AI platform.

Example Response:

The API responds with the updated account details.

{
  "name": "Jane Doe",
  "email": "jane.doe@example.com",
  "create_date": "2023-06-17T13:44:06.185683Z"
}

Manage API Keys

API Keys are essential for authenticating your requests to the Web Scraper AI system.

1. Create a New API Key

To create a new API key:

Endpoint: POST /keys
Authentication: Existing API KEY required.

Note: If it is your first time creating an API key, you need to Sign in and create one manually. Follow these steps:

Sign in with email and password here.

Log in with email and password.png

Click on user profile. On the dropdown, click Account Settings.

Account Settings.png

Under Api Keys, click CREATE API KEY.

CREATE API KEY.png

Request Body: JSON object with rate limit information (optional)
Response: The newly created API key object

Example Request:

{
  "rate_limit": {
    "interval": "string",
    "limit": int
  }
}

Here are the key components:

rate_limit: An optional object to set usage limits for the new key
interval: The time frame for the limit (e.g., "hourly", "daily")
limit: The maximum number of requests allowed in the specified interval

Example Response:

{
  "apikey": "your_new_api_key_here",
  "rate_limit": {
    "interval": "hourly",
    "limit": 1000
  },
  "created_at": "2024-06-28T13:44:06.185683Z"
}

2. Retrieve All API Keys

To get a list of all your API keys:

Endpoint: GET /keys
Authentication: API KEY required
Response: An array of API key objects

Example Response:

[
  {
    "apikey": "your_api_key_1",
    "rate_limit": {
      "interval": "hourly",
      "limit": 1000
    },
    "created_at": "2024-06-28T13:44:06.185683Z"
  },
  {
    "apikey": "your_api_key_2",
    "rate_limit": {
      "interval": "daily",
      "limit": 10000
    },
    "created_at": "2024-06-29T10:15:30.123456Z"
  }
]

3. Delete an API Key

To delete an existing API key:

Endpoint: DELETE /keys
Authentication: API KEY required
Request Body: JSON object with the API key to delete
Response: The deleted API key object

Example Request:

{
  "apikey": "api_key_to_delete"
}

Example Response:

{
  "apikey": "api_key_to_delete",
  "rate_limit": {
    "interval": "hourly",
    "limit": 1000
  },
  "created_at": "2024-06-28T13:44:06.185683Z"
}

Now that you have created an account and can manage API Keys, you can access the protected endpoints as follows:

Manage Jobs

A job represents a specific web scraping task that you want to perform.

Job Lifecycle

A typical job goes through the following stages:

Pending: The job is queued and waiting to be processed.
In Progress: The scraper is actively working on the job.
Paused: The scraper has temporarily stopped working on the job.
Completed: The job has finished successfully.
Failed: The job encountered an error and could not complete.
Cancelled: The job was manually cancelled before completion.

Managing Jobs

1. Create a New Job

To create a new scraping job:

Endpoint: POST /job
Authentication: API KEY required.
Request Body: JSON object describing the job (see example below)
Response: A job object with a unique identifier

Example Request:

Imagine you want to scrape product names from a website and need to click a button to reveal the product list.

{
  "website": "string",
  "actions": [
    {
      "type": "string",
      "selector": "string",
      "text": "string"
    }
  ],
  "data_to_scrape": [
    {
      "field": "string",
      "selector": {
        "type": "string",
        "selector": "string"
      }
    }
  ],
  "frecuency": {
    "value": "string",
    "type": "string"
  }
}

Here are the key components:

website: The target URL to scrape.
actions: A list of actions for the scraper to perform before extracting data. Each action has a type (e.g., "click", "type") and a selector to identify the element.
data_to_scrape: Defines what data to extract. Each item specifies a field (the name you want to give the data) and a selector. The selector is an object compound of two components: The type, which can take either css-selector or xpath and selector which is the selector.
frequency: Determines how often the job should run. Options include:
daily: Runs once a day.
weekly: Runs once a week.
monthly: Runs once a month.
hourly: Runs once an hour.
every_x_minutes: Runs every specified number of minutes.
every_x_hours: Runs every specified number of hours.
every_x_days: Runs every specified number of days.
custom: Allows for a custom schedule.

Note: every_x_minutes, every_x_hours, every_x_days and custom options are not yet prepared programmatically, but we expect to receive a cron schedule expression for each option.

Example Response:

The API responds with the created job details.

{
  "id": 5,
  "website": "https://websitetoscrape.com/",
  "actions": [
      {
        "selector": "#button",
        "type": "click",
        "text": "buttonnameexample"
     }
  ],
  "data_to_scrape": [
    {
      "field": "product_name",
      "selector": {
        "type": "css-selector",
        "selector": ".product-title"
      }
    }
  ],
  "frequency": {
    "type": "daily",
    "value": "daily"
  },
  "status": "Pending",
  "created_at": "2024-06-17T13:44:06.185683Z",
  "updated_at": "2024-06-17T13:44:06.185683Z"
}

2. Retrieve a Specific Job

To get details about a particular job:

Endpoint: GET /job/{id}
Authentication: API KEY required
Parameters: Replace {id} with the job's unique identifier
Response: The job object

Example Request:

GET /job/5
Key: <your_api_key>

3. Update an Existing Job

To modify a job that hasn't started yet:

Endpoint: PUT /job/{id}
Authentication: API KEY required
Parameters: Replace {id} with the job's unique identifier
Request Body: Updated job configuration (similar to job creation)
Response: The updated job object

4. Cancel a Job

To cancel a job:

Endpoint: DELETE /job/{id}
Authentication: API KEY required
Parameters: Replace {id} with the job's unique identifier
Response: The cancelled job object

Example Request:

DELETE /job/5
Key: <your_api_key>

Example Response:

{
  ...
  "status": "Cancelled",
  ...
}

The job status changes to Cancelled.

5. List All Jobs

To retrieve all your jobs:

Endpoint: GET /job
Authentication: API KEY required
Response: An array of job objects

Example Request:

GET /job
Key: <your_api_key>

Example Response:

You get a list of all jobs.

6. Synchronous Job Processing

For small scraping tasks or testing, you can use the synchronous job processing endpoint:

Endpoint: POST /job/sync
Authentication: API Key required
Request Body: Same as job creation
Response: The scraped data

Retrieve Data

The Data endpoint allows you to retrieve, manage, and delete the data collected by your web scraping jobs.

1. Retrieve All Data

To get all data associated with your account:

Endpoint: GET /data
Authentication: API KEY required
Response: An array of data objects

[
  {
    "id": "string",
    "jobid": "string",
    "data": {},
    "created_at": "string"
  }
]

Example Response:

[
  {
    "id": "data_123456",
    "jobid": "job_789012",
    "data": {
      "product_name": "Example Product",
      "price": "$19.99",
      "description": "This is an example product description."
    },
    "created_at": "2024-06-28T13:44:06.185683Z"
  },
  {
    "id": "data_789012",
    "jobid": "job_345678",
    "data": {
      "article_title": "Breaking News",
      "author": "John Doe",
      "content": "This is the content of the breaking news article."
    },
    "created_at": "2024-06-29T10:15:30.123456Z"
  }
]

The scraped data is returned in a flexible JSON format.

Note: The fields in the data object directly correspond to the data_to_scrape configuration in your job setup. Each field you specify in data_to_scrape will appear as a key-value pair in this data object, where the key is the field name you specified, and the value is the data scraped from the website according to the selector you provided.

For example, in the above example (first element of the array):

product_name: This field contains the name of the product that was scraped from the website. In this example, it's "Example Product".
price: This field contains the price of the product that was scraped. In this case, it's "$19.99".

The structure and content of the data object can vary significantly depending on:

The website being scraped, and
The specific data points you've configured your job to collect.

For instance, if you had set up your scraping job to collect additional information like product manufacturer or customer ratings, you might see a data object that looks more like this:

"data": {
  "product_name": "Example Product",
  "price": "$19.99",
  "description": "This is an example product description.",
  "manufacturer": "Manufacturer_XYZ",
  "ratings": "4.5/5.0"
}

2. Retrieve Data by Job ID

To get data generated by a specific job:

Endpoint: GET /data/job/{id}
Authentication: API KEY required
Parameters: Replace {id} with the job's unique identifier
Response: An array of data objects associated with the specified job

Example Request:

GET /data/job/job_789012

Example Response:

[
  {
    "id": "data_123456",
    "jobid": "job_789012",
    "data": {
      "product_name": "Example Product",
      "price": "$19.99",
      "description": "This is an example product description."
    },
    "created_at": "2024-06-28T13:44:06.185683Z"
  }
]

3. Delete Data by ID

To delete a specific data entry:

Endpoint: DELETE /data/{id}
Authentication: API KEY required
Parameters: Replace {id} with the data's unique identifier
Response: The deleted data object

Example Request:

DELETE /data/data_123456

Example Response:

{
  "id": "data_123456",
  "jobid": "job_789012",
  "data": {
    "product_name": "Example Product",
    "price": "$19.99",
    "description": "This is an example product description."
  },
  "created_at": "2024-06-28T13:44:06.185683Z"
}

4. WebHook for Data Updates

To receive notifications when new data is available:

Endpoint: POST /data/user/{id}
Authentication: API KEY required
Parameters: Replace {id} with your user ID
Description: This is a webhook that will send a POST request to your specified URL when new data is available for any of your jobs.

Example Webhook Payload:

{
  "id": "data_123456",
  "jobid": "job_789012",
  "data": {
    "product_name": "New Product",
    "price": "$24.99",
    "description": "This is a newly scraped product description."
  },
  "created_at": "2024-06-30T09:30:15.987654Z"
}

Access Logs

Logs provide a detailed record of events that occur during the operation of your web scraping jobs.

1. Retrieve All Logs

To get all logs associated with your account:

Endpoint: GET /logs
Authentication: API KEY required
Response: An array of log objects

[
  {
    "jobid": "string",
    "message": "string",
    "level": "string",
    "created_at": "string"
  }
]

Example Response:

[
  {
    "jobid": "job_123456",
    "message": "Job started successfully",
    "level": "INFO",
    "created_at": "2024-06-28T13:44:06.185683Z"
  },
  {
    "jobid": "job_123456",
    "message": "Successfully scraped 50 items",
    "level": "INFO",
    "created_at": "2024-06-28T13:45:12.345678Z"
  },
  {
    "jobid": "job_789012",
    "message": "Failed to access URL: Connection timeout",
    "level": "ERROR",
    "created_at": "2024-06-29T10:15:30.123456Z"
  }
]

Each log entry contains the following fields:

jobid: The unique identifier of the job that generated this log.
message: A descriptive message about the event that occurred.
level: The severity level of the log. Common levels include:
INFO: General information about job execution.
WARNING: Potential issues that didn't stop the job but might require attention.
ERROR: Serious issues that prevented the job from completing successfully.
created_at: The timestamp when the log entry was created.

2. Retrieve Logs by Job ID

To get logs generated by a specific job:

Endpoint: GET /log/job/{id}
Authentication: API KEY required
Parameters: Replace {id} with the job's unique identifier
Response: An array of log objects associated with the specified job

Example Request:

GET /log/job/job_123456

Example Response:

[
  {
    "jobid": "job_123456",
    "message": "Job started successfully",
    "level": "INFO",
    "created_at": "2024-06-28T13:44:06.185683Z"
  },
  {
    "jobid": "job_123456",
    "message": "Successfully scraped 50 items",
    "level": "INFO",
    "created_at": "2024-06-28T13:45:12.345678Z"
  }
]

Public API First Steps

Table of Contents

Manage Account

Update User Account

Example Request:

Example Response:

Manage API Keys

1. Create a New API Key

Example Request:

Example Response:

2. Retrieve All API Keys

Example Response:

3. Delete an API Key

Example Request:

Example Response:

Manage Jobs

Job Lifecycle

Managing Jobs

1. Create a New Job

Example Request:

Example Response:

2. Retrieve a Specific Job

Example Request:

3. Update an Existing Job

4. Cancel a Job

Example Request:

Example Response:

5. List All Jobs

Example Request:

Example Response:

6. Synchronous Job Processing

Retrieve Data

1. Retrieve All Data

Example Response:

2. Retrieve Data by Job ID

Example Request:

Example Response:

3. Delete Data by ID

Example Request:

Example Response:

4. WebHook for Data Updates

Example Webhook Payload:

Access Logs

1. Retrieve All Logs

Example Response:

2. Retrieve Logs by Job ID

Example Request:

Example Response: