Contribution’s guideline: Data
Thank you very much for considering contributing to the data collection of NetworkCommons! In order to make the resource as user-friendly as possible, we aim to be as transparent as possible, which means that all contributions should contain at least the following elements. For other examples, see the Datasets details.
1. Data information
Experimental design: number of samples, number of experiments (if applicable), confounding factors
Data production and processing: tools used, how the data processing was performed (if applicable).
Files: number and type of files, with a small description of their contents.
Link to the database from which the data was retrieved.
Link to the dataset publication
Path information explaining the structure of the data directories This information should be appended to the existing YAML file in
networkcommons/data/datasets.yaml
An example of this can be found below:
NCI60:
name: NCI60
description: NCI-60 cell line data
publication_link: https://doi.org/10.1038/nrc1951
detailed_description: >-
This dataset contains data from the NCI-60 cell line panel.
It includes three files: TF activities from transcriptomics data,
metabolite abundances and gene reads.
path: NCI60/{cell_line}/{cell_line}__{data_type}.tsv
This information can then be accessed via nc.data.omics.datasets()
[2]:
import networkcommons as nc
[3]:
nc.data.omics.datasets()
[3]:
| name | description | publication_link | detailed_description | |
|---|---|---|---|---|
| decryptm | DecryptM | Drug perturbation proteomics and phosphoproteomics data | https://doi.org/10.1126/science.ade3925 | This dataset contains the profiling of 31 cancer drugs in 13 human cancer cell line models resulted in 1.8 million dose-response curves, including 47,502 regulated phosphopeptides, 7316 ubiquitinylated peptides, and 546 regulated acetylated peptides. |
| panacea | Panacea | Pancancer Analysis of Chemical Entity Activity RNA-Seq data | https://doi.org/10.1016/j.xcrm.2021.100492 | PANACEA contains dose-response and perturbational profiles for 32 kinase inhibitors in 11 cancer cell lines, in addition to a DMSO control. Originally, this resource served as the basis for a DREAM Challenge assessing the accuracy and sensitivity of computational algorithms for de novo drug polypharmacology predictions. |
| CPTAC | CPTAC | Clinical Proteomic Tumor Analysis Consortium data | https://doi.org/10.1158/2159-8290.CD-13-0219 | This dataset contains data from the Clinical Proteomic Tumor Analysis Consortium. It includes various cancer types and proteomic data. |
| NCI60 | NCI60 | NCI-60 cell line data | https://doi.org/10.1038/nrc1951 | This dataset contains data from the NCI-60 cell line panel. It includes three files: TF activities from transcriptomics data, metabolite abundances and gene reads. |
2. API
The data will either be deposited in the NetworkCommons server, or can be directly accessed from the original source. Regardless of this, the following functions are required
A function providing an overview of the subsets (if applicable). For example, check
nc.data.omics.decryptm_experiments().In case the data contains different files (for example, different omics layers, metadata tables, etc.), a function should retrieve this information. For example, check
nc.data.omics.nci60_datatypes()A function that retrieves the data. For example, check
nc.data.omics.nci60_table(). Ideally, apd.DataFrame, but we are planning to expand support forAnnDatainstances.
These new functions can be implemented in a new file, _{dataset}, inside the networkcommons/data/omics/ folder.
For example, nc.data.omics.nci60_table() retrieves a single pd.DataFrame by providing a data type and a cell line.
[20]:
nc.data.omics.nci60_table(cell_line='A498', data_type='RNA').head()
[20]:
| ID | score | |
|---|---|---|
| 0 | WASH7P | -2.109966 |
| 1 | NOC2L | -1.480194 |
| 2 | HES4 | -0.781522 |
| 3 | ISG15 | 0.406806 |
| 4 | AGRN | -0.324970 |