OTK Prediction API

High-performance Scientific Computing API for ecDNA (extrachromosomal DNA) Prediction Service based on the OTK and GCAP projects.

🌐 Public API Address

Production API: http://biotree.top:38123/otk/

API Base URL: http://biotree.top:38123/otk/api/v1/

✨ Features

Intelligent Resource Scheduling: Automatically selects optimal model and available resources
Model Management: Auto-discovers models from models/ directory
Data Validation: Comprehensive integrity checks during upload
Asynchronous & Synchronous Processing: Supports both async tasks and sync predictions
Real-time Statistics: Task counts, processing times, resource usage
User-friendly Web Interface: For task upload, status viewing, and management
Complete REST API: Supports curl and other HTTP clients
Multi-language Support: English and Chinese interfaces
Job Record Management: Task metadata retained permanently, results for 3 days
Security: Job IDs are masked in web interface for privacy

🚀 Quick Start

Using the Public API

You can immediately start using the public API without any installation:

# Health check
curl http://biotree.top:38123/otk/api/v1/health

# Submit prediction (async)
curl -X POST "http://biotree.top:38123/otk/api/v1/predict" \
  -F "file=@your_data.csv"

# Submit prediction (sync)
curl -X POST "http://biotree.top:38123/otk/api/v1/predict-sync" \
  -F "file=@your_data.csv"

Running Locally

Install Dependencies bash cd otk/otk_api pip install -r requirements.txt
Start the API bash cd otk/otk_api ./start_api.sh
Access
API: http://localhost:8000/api/v1/
Web Interface: http://localhost:8000/

📡 API Documentation

1. Health Check

Endpoint: GET /api/v1/health

Response:

{
  "status": "healthy",
  "version": "1.0.0",
  "gpu_available": false,
  "gpu_count": 0,
  "cpu_count": 192,
  "active_jobs": 0,
  "queue_size": 0
}

2. Submit Prediction (Async)

Endpoint: POST /api/v1/predict

Parameters:

file: CSV file with prediction data

Response:

{
  "id": "af0e5298-b326-40ca-83b5-76f54ad212e6",
  "status": "pending",
  "created_at": "2026-02-12T09:54:25.495083",
  "validation_report": {
    "is_valid": true,
    "errors": [],
    "warnings": ["Optional column missing: intersect_ratio, using default value 1.0"]
  }
}

3. Submit Prediction (Sync)

Endpoint: POST /api/v1/predict-sync

Parameters:

file: CSV file with prediction data

Response:

Returns CSV file directly for immediate use in pipelines

4. Get Task Status

Endpoint: GET /api/v1/jobs/{job_id}

Response:

{
  "id": "af0e5298-b326-40ca-83b5-76f54ad212e6",
  "status": "completed",
  "progress": 1.0,
  "completed_at": "2026-02-12T09:54:26.292634"
}

5. Download Results

Endpoint: GET /api/v1/jobs/{job_id}/download

Response:

Returns CSV file with prediction results

6. Get Statistics

Endpoint: GET /api/v1/statistics

Response:

{
  "total_jobs": 28,
  "completed_jobs": 14,
  "failed_jobs": 13,
  "avg_processing_time": 0.605,
  "cpu_jobs": 14,
  "gpu_jobs": 5
}

📊 Data Format Requirements

Minimal Required Columns

For basic prediction, your CSV file only needs these minimum columns:

Column	Description
`sample`	Sample ID
`gene_id`	Gene identifier
`segVal`	Segment value

However, for optimal prediction accuracy, we recommend including as many features as possible.

Recommended Columns

Column	Description	Required by API	Auto-fill Default
`sample`	Sample ID	✅ Yes	-
`gene_id`	Gene identifier (e.g., ENSG00000284662)	✅ Yes	-
`segVal`	Gene total copy number	✅ Yes	-
`minor_cn`	Minor copy number	✅ Yes	0
`purity`	Tumor purity	✅ Yes	0.8
`ploidy`	Ploidy level	✅ Yes	2.0
`AScore`	A-score value	✅ Yes	10.0
`pLOH`	Loss of heterozygosity probability	✅ Yes	0.1
`cna_burden`	Copy number alteration burden	✅ Yes	0.2
`CN1` to `CN19`	Chromosome copy number signatures	⚠️ Recommended	0.05 each

Optional Columns

Column	Description	Auto-fill Behavior
`type`	Cancer type (e.g., BRCA, LUAD)	Auto-converts to `type_*` columns
`age`	Sample age	Filled with mean value
`gender`	Gender (0/1 or Male/Female)	Filled with 0
`intersect_ratio`	Intersection ratio	Filled with 1.0
`y`	Ground truth label (for validation)	Not used in prediction

Auto-Generated Features

The system automatically generates these features - you do NOT need to provide them:

Feature Type	Columns	Source
Cancer Type	`type_BLCA`, `type_BRCA`, ... (24 columns)	Converted from `type` column
Gene Frequency	`freq_Linear`, `freq_BFB`, `freq_Circular`, `freq_HR`	Matched from `gene_id` using precomputed prior data

Cancer Types

The following cancer types are supported (for type column):

BLCA, BRCA, CESC, COAD, DLBC, ESCA, GBM, HNSC,
KICH, KIRC, KIRP, LGG, LIHC, LUAD, LUSC, OV,
PRAD, READ, SARC, SKCM, STAD, THCA, UCEC, UVM

If an invalid cancer type is provided, all type_* columns will be set to 0.

Example Data

Minimal input (3 columns):

sample,gene_id,segVal
TCGA-TEST-01,ENSG00000284662,3.2
TCGA-TEST-01,ENSG00000187634,2.5

Recommended input (with type column):

sample,gene_id,segVal,minor_cn,purity,ploidy,AScore,pLOH,cna_burden,age,gender,type,CN1,CN2,CN3,CN4,CN5,CN6,CN7,CN8,CN9,CN10,CN11,CN12,CN13,CN14,CN15,CN16,CN17,CN18,CN19
TCGA-TEST-01,ENSG00000284662,3.2,1.1,0.85,2.8,12.5,0.15,0.25,65,1,LUSC,0.1,0.2,0.3,0.1,0.05,0.05,0.05,0.05,0.02,0.02,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01,0.01

Alternative: Using pre-encoded type_* columns:

sample,gene_id,segVal,minor_cn,purity,ploidy,AScore,pLOH,cna_burden,type_BRCA,type_LUAD,...(other type_* columns),CN1,CN2,...
TCGA-TEST-01,ENSG00000284662,3.2,1.1,0.85,2.8,12.5,0.15,0.25,1,0,...,0.1,0.2,...

Data Validation

The API validates your data and returns a detailed report:

{
  "validation_report": {
    "is_valid": true,
    "errors": [],
    "warnings": [
      "Optional column missing: intersect_ratio, will use default value 1.0",
      "CN signature columns incomplete: found 15/19 columns",
      "Missing type column, cannot validate cancer type"
    ],
    "info": {
      "total_rows": 100,
      "unique_samples": 50,
      "unique_genes": 100
    }
  }
}

🎯 Prediction Output

Output Format

The prediction result CSV includes:

Column	Description
`sample`	Sample ID
`gene_id`	Gene identifier
`prediction_prob`	Probability of ecDNA occurrence
`prediction`	Binary prediction (0=no, 1=yes)
`sample_level_prediction_label`	Overall sample prediction label
`sample_level_prediction`	Overall sample prediction (0/1)

Example Output

sample,gene_id,prediction_prob,prediction,sample_level_prediction_label,sample_level_prediction
TCGA-TEST-01,ENSG00000284662,0.000279,0,nofocal,0
TCGA-TEST-01,ENSG00000187634,0.002650,0,nofocal,0
TCGA-TEST-01,ENSG00000243073,0.000036,0,nofocal,0

🌐 Web Interface

The API includes a user-friendly web interface:

Access

Homepage: http://biotree.top:38123/otk/
Task Upload: http://biotree.top:38123/otk/web/upload
Task List: http://biotree.top:38123/otk/web/jobs
Statistics: http://biotree.top:38123/otk/web/stats

Language Support

Add ?lang=en for English: http://biotree.top:38123/otk/?lang=en
Add ?lang=zh for Chinese: http://biotree.top:38123/otk/?lang=zh

📁 Project Structure

otk_api/
├── api/                  # API implementation
│   ├── main.py           # FastAPI application
│   ├── predictor_wrapper.py  # Prediction job handler
│   └── routes/           # API endpoints
├── config.yml           # Configuration file
├── models/              # Model storage
│   └── baseline/         # Example model
├── uploads/              # Uploaded files
├── results/              # Prediction results
├── logs/                 # Log files
├── start_api.sh          # Startup script
└── README.md             # This documentation

⚠️ Important Notes

Job ID Security: Save your Job ID securely for async tasks. It's needed to query status and download results.
Data Retention:
Result files: Automatically deleted after 3 days
Job records: Kept permanently for audit purposes
File Size Limit: Maximum upload size is 100MB
Processing Time: Depends on data size and server load, typically 1-5 seconds per sample
Error Handling: If you receive an error, check your data format and try again

🛠️ Troubleshooting

Common Issues

File Upload Errors
Ensure your file is a valid CSV
Check that all required columns are present
Verify file size is under 100MB
Prediction Failed
Check server logs for detailed error messages
Verify your data format matches requirements
Try with a smaller dataset first
API Unresponsive
Check if the server is running
Verify network connectivity
Try the health check endpoint

📞 Support

For questions or issues:

GitHub Issues: OTK Repository
Email: Contact the maintainers
Documentation: This README and API endpoints

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Last Updated: February 12, 2026 Version: 1.1.0 Maintainers: Wang Lab @ CSU