Custom Dataset API - Bright Data Docs

This update allows for a more granular and streamlined way to request and manage your data collections, facilitating more effective dataset generation according to your specific needs. Understanding When to Use Each API:

Initial Collection Without Customer-Defined View:

The 3 primary API endpoints serve distinct purposes in the data collection workflow, facilitating a structured and efficient process in obtaining tailored datasets.

Requesting a Collection:

Endpoint: POST https://api.brightdata.com/datasets/request_collection Parameters:

dataset_id

string

required

Dataset ID

type

string

required

discover_new OR url_collection

inputs

array

Array - json

file

multipart

multipart - csv

Example

curl "https://api.brightdata.com/datasets/request_collection?dataset_id=gd_l1viktl72bvl7bjuj0&type=discover_new" -H "Authorization: Bearer API_KEY" -H "Content-Type: application/json" -k -d '[{"id":"user-id"}]'

Processing may take several minutes, based on the number of inputs. When you request to discover (‘discover_new’), finding all links (PDPs) may take time.

Checking Status of the Collection Above:

Endpoint: GET https://api.brightdata.com/datasets/request_collection Parameters:

request_id

string

required

Obtain from the previous API.

freshness_ms

string

required

Sets data freshness.If data is within this period (e.g., req ested 1 wee , collected 5 days ago), 0 new scrape occurs. If data is not fresh, we scrape it now.

1 week: 604,800,000 ms
1 month: 2,592,000,000 ms

Example

curl -k "https://api.brightdata.com/datasets/request_collection?request_id=REQUEST_ID&freshness_ms=2592000000" -H "Authorization: Bearer API_KEY"

Response Indicating Nmber of Records and Freshness Found:

{
    "dataset_id": request_job.dataset_id,
    "total_lines": 100,
    "fresh_count": 30,
    "name": "linkedin_companies custom input",
    "status": "done",
    "request_id": "XXXX",
}

The request is still running:

{
    "total_lines": 100,
    "status": "running",
}

Issue with one (or more) inputs: in this case the url was sent as URL

{
    "request_id": "xxxx",  
    "error": "Validation failed",
    "error_code": "validation",
    "validation_errors": [
        {
            "line": "{\"URL\":\"https://www.tiktok.com/search?q=tjd\"}",
            "index": 1,
            "errors": [
                ["url", "Required field"]
            ]
        }
    ]
}

Initiating a Collection:

Endpoint: POST https://api.brightdata.com/datasets/initiate_collection Parameters:

request_id

string

required

The unique identifier for the collection request you are inquiring about.

freshness_ms

string

required

The time in milliseconds indicating the desired data freshness.

request_id

string

required

The time in milliseconds indicating the desired data freshness.

Example

curl -X POST -k "https://api.brightdata.com/datasets/initiate_collection" -d '{"request_id":"j_ln2x567b2961de0d1x","freshness_ms":2592000000}' -H "Authorization: Bearer API_KEY" -H "content-type: application/json"

Collection After Defining a View:

Initiating a Collection:

Endpoint: POST https://api.brightdata.com/datasets/initiate Parameters:

dataset_id

string

required

view

string

required

type

string

required

discover_new OR url_collection

inputs

array

Array - json

file

multipart

multipart - csv

Example

curl "https://api.brightdata.com/datasets/initiate?dataset_id=XXX_DATASET_ID&type=url_collection&view=XXX_VIEW_ID" -H "Authorization: Bearer API_KEY" -H "Content-Type: application/json" -k -d '[{"id":"user-id"}]'

Dataset will be delivered to the setting configured for this view. By leveraging these enhanced capabilities, users can now tailor their data collection processes more efficiently, ensuring that the datasets generated are aligned with their project requirements.

How to retrieve results of snapshot that was already collected

curl "https://api.brightdata.com/datasets/snapshots/snapshot_id/download" -H "Authorization: Bearer API_KEY"

​Initial Collection Without Customer-Defined View:

​Requesting a Collection:

​Checking Status of the Collection Above:

​Initiating a Collection:

​Collection After Defining a View:

​Initiating a Collection:

​How to retrieve results of snapshot that was already collected

Initial Collection Without Customer-Defined View:

Requesting a Collection:

Checking Status of the Collection Above:

Initiating a Collection:

Collection After Defining a View:

Initiating a Collection:

How to retrieve results of snapshot that was already collected