Contribution’s guideline: Data

Thank you very much for considering contributing to the data collection of NetworkCommons! In order to make the resource as user-friendly as possible, we aim to be as transparent as possible, which means that all contributions should contain at least the following elements. For other examples, see the Datasets details.

1. Data information

Experimental design: number of samples, number of experiments (if applicable), confounding factors
Data production and processing: tools used, how the data processing was performed (if applicable).
Files: number and type of files, with a small description of their contents.
Link to the database from which the data was retrieved.
Link to the dataset publication
Path information explaining the structure of the data directories This information should be appended to the existing YAML file in networkcommons/data/datasets.yaml

An example of this can be found below:

NCI60:
    name: NCI60
    description: NCI-60 cell line data
    publication_link: https://doi.org/10.1038/nrc1951
    detailed_description: >-
        This dataset contains data from the NCI-60 cell line panel.
        It includes three files: TF activities from transcriptomics data,
        metabolite abundances and gene reads.
    path: NCI60/{cell_line}/{cell_line}__{data_type}.tsv

This information can then be accessed via nc.data.omics.datasets()

[2]:

import networkcommons as nc

[3]:

nc.data.omics.datasets()

[3]:

	name	description	publication_link	detailed_description
decryptm	DecryptM	Drug perturbation proteomics and phosphoproteomics data	https://doi.org/10.1126/science.ade3925	This dataset contains the profiling of 31 cancer drugs in 13 human cancer cell line models resulted in 1.8 million dose-response curves, including 47,502 regulated phosphopeptides, 7316 ubiquitinylated peptides, and 546 regulated acetylated peptides.
panacea	Panacea	Pancancer Analysis of Chemical Entity Activity RNA-Seq data	https://doi.org/10.1016/j.xcrm.2021.100492	PANACEA contains dose-response and perturbational profiles for 32 kinase inhibitors in 11 cancer cell lines, in addition to a DMSO control. Originally, this resource served as the basis for a DREAM Challenge assessing the accuracy and sensitivity of computational algorithms for de novo drug polypharmacology predictions.
CPTAC	CPTAC	Clinical Proteomic Tumor Analysis Consortium data	https://doi.org/10.1158/2159-8290.CD-13-0219	This dataset contains data from the Clinical Proteomic Tumor Analysis Consortium. It includes various cancer types and proteomic data.
NCI60	NCI60	NCI-60 cell line data	https://doi.org/10.1038/nrc1951	This dataset contains data from the NCI-60 cell line panel. It includes three files: TF activities from transcriptomics data, metabolite abundances and gene reads.

2. API

The data will either be deposited in the NetworkCommons server, or can be directly accessed from the original source. Regardless of this, the following functions are required

A function providing an overview of the subsets (if applicable). For example, check nc.data.omics.decryptm_experiments().
In case the data contains different files (for example, different omics layers, metadata tables, etc.), a function should retrieve this information. For example, check nc.data.omics.nci60_datatypes()
A function that retrieves the data. For example, check nc.data.omics.nci60_table(). Ideally, a pd.DataFrame, but we are planning to expand support for AnnData instances.

These new functions can be implemented in a new file, _{dataset}, inside the networkcommons/data/omics/ folder.

For example, nc.data.omics.nci60_table() retrieves a single pd.DataFrame by providing a data type and a cell line.

[20]:

nc.data.omics.nci60_table(cell_line='A498', data_type='RNA').head()

[20]:

	ID	score
0	WASH7P	-2.109966
1	NOC2L	-1.480194
2	HES4	-0.781522
3	ISG15	0.406806
4	AGRN	-0.324970