Contribution’s guideline: Data

Thank you very much for considering contributing to the data collection of NetworkCommons! In order to make the resource as user-friendly as possible, we aim to be as transparent as possible, which means that all contributions should contain at least the following elements. For other examples, see the Datasets details.

1. Data information

  • Experimental design: number of samples, number of experiments (if applicable), confounding factors

  • Data production and processing: tools used, how the data processing was performed (if applicable).

  • Files: number and type of files, with a small description of their contents.

  • Link to the database from which the data was retrieved.

  • Link to the dataset publication

  • Path information explaining the structure of the data directories This information should be appended to the existing YAML file in networkcommons/data/datasets.yaml

An example of this can be found below:

NCI60:
    name: NCI60
    description: NCI-60 cell line data
    publication_link: https://doi.org/10.1038/nrc1951
    detailed_description: >-
        This dataset contains data from the NCI-60 cell line panel.
        It includes three files: TF activities from transcriptomics data,
        metabolite abundances and gene reads.
    path: NCI60/{cell_line}/{cell_line}__{data_type}.tsv

This information can then be accessed via nc.data.omics.datasets()

[2]:
import networkcommons as nc
[3]:
nc.data.omics.datasets()
[3]:
name description publication_link detailed_description
decryptm DecryptM Drug perturbation proteomics and phosphoproteomics data https://doi.org/10.1126/science.ade3925 This dataset contains the profiling of 31 cancer drugs in 13 human cancer cell line models resulted in 1.8 million dose-response curves, including 47,502 regulated phosphopeptides, 7316 ubiquitinylated peptides, and 546 regulated acetylated peptides.
panacea Panacea Pancancer Analysis of Chemical Entity Activity RNA-Seq data https://doi.org/10.1016/j.xcrm.2021.100492 PANACEA contains dose-response and perturbational profiles for 32 kinase inhibitors in 11 cancer cell lines, in addition to a DMSO control. Originally, this resource served as the basis for a DREAM Challenge assessing the accuracy and sensitivity of computational algorithms for de novo drug polypharmacology predictions.
CPTAC CPTAC Clinical Proteomic Tumor Analysis Consortium data https://doi.org/10.1158/2159-8290.CD-13-0219 This dataset contains data from the Clinical Proteomic Tumor Analysis Consortium. It includes various cancer types and proteomic data.
NCI60 NCI60 NCI-60 cell line data https://doi.org/10.1038/nrc1951 This dataset contains data from the NCI-60 cell line panel. It includes three files: TF activities from transcriptomics data, metabolite abundances and gene reads.

2. API

The data will either be deposited in the NetworkCommons server, or can be directly accessed from the original source. Regardless of this, the following functions are required

  • A function providing an overview of the subsets (if applicable). For example, check nc.data.omics.decryptm_experiments().

  • In case the data contains different files (for example, different omics layers, metadata tables, etc.), a function should retrieve this information. For example, check nc.data.omics.nci60_datatypes()

  • A function that retrieves the data. For example, check nc.data.omics.nci60_table(). Ideally, a pd.DataFrame, but we are planning to expand support for AnnData instances.

These new functions can be implemented in a new file, _{dataset}, inside the networkcommons/data/omics/ folder.

For example, nc.data.omics.nci60_table() retrieves a single pd.DataFrame by providing a data type and a cell line.

[20]:
nc.data.omics.nci60_table(cell_line='A498', data_type='RNA').head()
[20]:
ID score
0 WASH7P -2.109966
1 NOC2L -1.480194
2 HES4 -0.781522
3 ISG15 0.406806
4 AGRN -0.324970