Index of /db/procurement (tender.sarthaksidhant.com)

Date: 2026-06-26 | Host: tender.sarthaksidhant.com | Mirror Status: Sync Complete
Database Downloads
File Name / Mirror Link Format Record Count Description
aoc_tenders.db SQLite 3 ~4.9M records Awards of Contract database. Contains winner metadata, contract values, dates.
tenders_vps.db SQLite 3 ~3.9M records Active/Archived Tenders database. Contains published notices, EMDs, fee information.
================================================================================
AN OPEN INVITATION TO DATA SCIENTISTS, RESEARCHERS, JOURNALISTS,
AND ANYONE WHO BELIEVES PUBLIC MONEY DESERVES PUBLIC SCRUTINY:

Every rupee the government spends goes through a tender. A notice is published,
bids are invited, a winner is selected, a contract is signed. This process
is, in theory, the immune system of public finance. In practice, it is where
most large-scale procurement corruption quietly occurs.

I spent two weeks building a high-throughput scraper that has systematically
crawled and archived the Central Public Procurement Portal (CPPP) of India
-- the government's own public-facing procurement database. The result is two
flat SQLite databases totalling over 8.8 million records, covering both the
initial tender notices and the final award outcomes, with full structured detail.

No paywalls. No throttled APIs. No pagination traps. No obfuscated HTML.
Just the data, as it was published.

The database is now yours. Here is what we ask in return:

  1. Download it.
  2. Look at who keeps winning.
  3. Look at how short the bid windows are.
  4. Look at how many tenders receive exactly one bid.
  5. Run the numbers. Write the queries. Publish the findings.

Corruption is not invisible. It is just tedious to find. This dataset removes
the tedium. The analysis, and the courage to share it, is still on us.

COORDINATION GROUP (Discuss, share queries, and post findings):
Discord: https://discord.gg/7Zsgyg86Mq
================================================================================
Scope of Data
SOURCE
  Portal   : Central Public Procurement Portal (CPPP) -- eprocure.gov.in
  Coverage : National and state-level procurement across all sectors
             (Works, Goods, Services, Consultancies)
  Portal Types Covered:
    - Central Government Ministries & Departments
    - State Government Portals (via CPPP aggregation)
    - Defence Procurement (defproc.gov.in mirror entries)
    - State-specific portals (e.g., wbtenders.gov.in, etenders.kerala.gov.in)

CORPUS SIZE
  aoc_tenders.db  :  4,921,960  award of contract listing records
  aoc_details     :  4,540,739  fully parsed detail pages (JSON)
  tenders_vps.db  :  3,952,191  published tender notice records
  tender_details  :  3,178,485  fully parsed tender detail pages (JSON)
  -------------------------
  Total           :  ~16,592,415 structured records across both databases

DATA COLLECTION METHOD
  - Concurrent HTTP crawlers with session rotation and backoff
  - HTML table extraction into normalized JSON key-value payloads
  - MD5-hashed internal IDs derived from source URLs (collision-safe deduplication)
  - Partition sharding for parallel ingestion
  - All timestamps preserved from source in IST

KEY AUDITABLE FIELDS & RESEARCH ANGLES
  Bid Competition Analysis:
    * Number of bids received per award -- identify single-bid contracts at scale
    * Bid submission windows (e_published_date vs bid_submission_closing_date)
      -- artificially short windows are a known indicator of pre-selected winners

  Vendor Concentration:
    * Name of the selected bidder(s) -- aggregate winner frequency by vendor name
    * Address of the selected bidder(s) -- cluster by geography or address overlap
    * Cross-reference multiple contracts won by same entity across departments

  Financial Anomaly Detection:
    * Contract Value vs. EMD ratio -- unusually low EMDs can deter legitimate bidders
    * Tender Fee structures -- high document fees as gatekeeping mechanism
    * Contract Value outliers per category and per organisation

  Timeline & Process Integrity:
    * Corrigendum frequency -- repeated amendments can signal process manipulation
    * Bid opening vs. closing gap -- very short gaps reduce competitive legitimacy
    * Document download window vs. submission window asymmetries

  Sector & Department Mapping:
    * Organisation Type (Central / State / Defence / PSU)
    * Product Category & Sub-category breakdowns for sector-level analysis
    * Department-level award concentration over time
    
Database Schemas

1. aoc_tenders.db Schema Overview

-- TABLE: aoc_tenders (~4,921,960 rows)
-- Preliminary listing metadata for award notifications.
CREATE TABLE aoc_tenders (
    internal_id   TEXT PRIMARY KEY,  -- MD5 hash of detail_url (used as unique key)
    portal_type   TEXT,              -- Source portal classifier
    year          INTEGER,           -- Calendar year of listing
    sl_no         TEXT,              -- List serial number
    aoc_date      TEXT,              -- Contract award date timestamp
    closing_date  TEXT,              -- Original bid closing date
    title         TEXT,              -- Subject line/title of the award
    ref_no        TEXT,              -- Tender reference number
    tender_id     TEXT,              -- Public tender ID key
    org_name      TEXT,              -- Purchasing department or state agency
    detail_url    TEXT,              -- CPPP details page source URL
    partition_id  INTEGER            -- Hash partition index
);

-- TABLE: aoc_details (~4,540,739 rows)
-- Deep crawled values corresponding to award details.
CREATE TABLE aoc_details (
    internal_id   TEXT PRIMARY KEY,  -- FK mapping to aoc_tenders.internal_id
    tender_id     TEXT,              -- Public tender ID key
    scraped_at    TEXT,              -- Timestamp of crawler execution
    details_json  TEXT               -- JSON representation of raw HTML table key-values
);

-- JSON Schema keys inside aoc_details.details_json:
{
  "Tender Type": string,
  "Contract Date": string,
  "Contract Value": string (currency value),
  "Published Date": string,
  "Tender Document": string (URL),
  "Tender Ref. No.": string,
  "Organisation Name": string,
  "Tender Description": string,
  "Number of bids received": string,
  "Name of the selected bidder(s)": string,
  "Address of the selected bidder(s)": string,
  "Date of Completion/Completion Period in Days": string
}
    

2. tenders_vps.db Schema Overview

-- TABLE: tenders (~3,952,191 rows)
-- Listing metadata for active/archived tenders.
CREATE TABLE tenders (
    internal_id                  TEXT PRIMARY KEY,  -- Base64 decoded internal identifier
    tender_id                    TEXT,              -- Public tender ID key
    detail_url                   TEXT,              -- URL link to detail view
    status                       TEXT,              -- Scraper classification ('active' / 'archived')
    organisation_name            TEXT,              -- Organisation or state department name
    title                        TEXT,              -- Tender title or description
    reference_number             TEXT,              -- Department tender reference code
    portal_type                  TEXT,              -- Source category (org / state)
    serial_number                TEXT,              -- List serial number
    e_published_date             TEXT,              -- Published date timestamp
    bid_submission_closing_date  TEXT,              -- Closing date timestamp
    tender_opening_date          TEXT,              -- Bid opening date timestamp
    corrigendum_url              TEXT,              -- Link to corrigendum updates page (if any)
    scraped_at                   TEXT,              -- Crawl execution timestamp
    partition_id                 INTEGER            -- Hash partition index
);

-- TABLE: tender_details (~3,178,485 rows)
-- Deep metadata and full parsed html values.
CREATE TABLE tender_details (
    internal_id   TEXT PRIMARY KEY,  -- FK mapping to tenders.internal_id
    tender_id     TEXT,              -- Public tender ID key
    details_json  TEXT,              -- JSON representation of raw HTML details table
    scraped_at    TEXT               -- Timestamp of deep crawl
);

-- JSON Schema keys inside tender_details.details_json:
{
  "Tender Reference Number": string,
  "Tender Title": string,
  "Organisation Name": string,
  "Organisation Type": string,
  "Tender Category": string,
  "Tender Type": string,
  "Product Category": string,
  "Product Sub-Category": string,
  "ePublished Date": string,
  "Bid Opening Date": string,
  "Bid Submission Start Date": string,
  "Bid Submission End Date": string,
  "Document Download Start Date": string,
  "Document Download End Date": string,
  "EMD": string (Earned Money Deposit),
  "Tender Fee": string,
  "Location": string,
  "Address": string,
  "Name": string (Contact officer name),
  "Work Description": string,
  "Tender Document": string (URL)
}