Vignette 4: Sensitive response to drug perturbation using phosphoproteomics

The aim of this vignette is to explore a new way of evaluating the networks. In this setting, we assume that the networks that better contextualise the perturbation will have a higher average pEC50 in the nodes included in the solution network, compared to the ones not included. On the contrary, the networks which are not able to contextualise the perturbation will not show differences in the EC50 values of the nodes included or excluded. For more information about this evaluation strategy, please check the dedicated section.

[1]:
import networkcommons as nc
import pandas as pd
import decoupler as dc

1. Data processing

To showcase this strategy, we will use data from the DecryptM resource, a compendium of 14 different experiments comprising 4 different proteomics profiles and 31 different drugs. This dataset contains 10 different concentration points, which were used to fit logistic curves to obtain sensitivity values (pEC50, which is -log(EC50), higher sensitivity results in a higher pEC50 or a lower EC50). EC50 is the concentration at which the response is halfway between the baseline activity and the maximum activity.

Jana Zecha et al., Decrypting drug actions and protein modifications by dose- and time-resolved proteomics. Science 380,93-101(2023). https://doi.org/10.1126/science.ade3925

To see in a nutshell the available experiments from the resource, we can use nc.data.omics.decryptm_datasets

Note

Here, we assume that the averaging of phosphosite-level information is a good approximation for changes at the protein level. However, this is not necessarily true. For example, in case of proteins with multiple phosphorylation sites, some of them can result in the protein becoming more active, while others can make it become less active. In addition, some proteins become active/inactive only when a minimum of n phosphorylation sites have been phosphorylated. All these examples are ingnored in this setting, and thus should be taken into account when interpreting the results.

For the sake of simplicity, we will focus on one drug targeting EGFR. We get the 3 EGFR inhibitors:

[2]:
decryptm_curves = nc.data.omics.decryptm_table('3_EGFR_Inhibitors', 'Phosphoproteome', 'curves.txt')

We first filter those fittings with a high R2. We define here an arbitrary threshold of 0.9

[3]:
filtered_df = decryptm_curves[decryptm_curves['R2']>0.9]

Now, we can extract the information from each sample by splitting the Experiment column:

[ ]:
filtered_df[['PTM', 'Cell_line', 'Drug', 'Time', 'Rep']] = filtered_df['Experiment'].str.split('_', expand=True)

We can see which drug has the most fitted curves, which will be the one we will choose for this analysis.

[5]:
filtered_df['Drug'].value_counts()
[5]:
Drug
Afatinib     4054
Gefitinib    1515
Dasatinib    1368
Name: count, dtype: int64

With 4054 fitted curves, we can continue the analysis focusing in Afatinib.

[6]:
filtered_df = filtered_df[filtered_df['Drug'] == 'Afatinib']

We binarise the Regulation column:

[7]:
filtered_df['Regulation'] = filtered_df['Regulation'].map({'up': 1, 'down': -1}).fillna(0)

Now, since our prior knowledge network is detailed to the protein level, we need to summarise our phosphosite-specific information. For this, we can just calculate the average across phosphosites and replicates, per protein.

[8]:
numeric_cols = filtered_df.select_dtypes(include=[float, int]).columns
grouped_df = filtered_df.groupby(['PTM', 'Cell_line', 'Drug', 'Time', 'Gene names'])[numeric_cols].mean()
grouped_df.reset_index(inplace=True)

We discard those proteins which showed no response:

[9]:
grouped_df = grouped_df[grouped_df['Regulation'] != 0]

Now, we split our dataset in two. On one side, we will use the transcription factors (TFs) for the measurement layer (upregulated or downregulated). The rest of measurements which are not TFs will be used for the evaluation.

[10]:
collectri = dc.get_collectri()
collectri_tfs = set(collectri.source)
[11]:
measurements = grouped_df[grouped_df['Gene names'].isin(collectri_tfs)]
measurements.head()
[11]:
PTM Cell_line Drug Time Gene names N duplicates Length Missed cleavages Phospho (STY) Max Score ... R2 Curve RMSE Log EC50 error Curve slope error Curve top error Curve bottom error EC50 pEC50 Curve effect size Regulation
37 ddPTM A431 Afatinib 30min AKNA 1.00 8.000000 0.000000 1.000000 97.565000 ... 0.930210 0.079855 0.078054 0.326630 0.022948 0.021961 1.081113e-08 7.966129 -0.633375 -1.000000
77 ddPTM A431 Afatinib 30min ARHGAP35 1.25 15.416667 0.583333 1.166667 219.326917 ... 0.951323 0.063784 0.113077 1.232567 0.022569 0.030057 1.907749e-07 7.873396 -0.673425 -0.833333
106 ddPTM A431 Afatinib 30min ATF7 1.20 15.400000 0.400000 1.600000 231.716000 ... 0.939197 0.046872 2.402108 0.517590 0.020224 2.005207 2.000098e-03 6.252465 0.540893 -0.200000
113 ddPTM A431 Afatinib 30min ATRX 1.00 12.125000 1.000000 1.250000 205.422625 ... 0.925583 0.059691 0.129673 0.607281 0.020850 0.023965 1.983393e-08 8.059884 -0.060007 -0.125000
128 ddPTM A431 Afatinib 30min BCL3 1.10 16.000000 1.000000 1.400000 131.791200 ... 0.945352 0.093729 0.093656 0.186224 0.040301 0.036615 7.787948e-09 8.132247 -0.125451 -0.400000

5 rows × 60 columns

[12]:
ec50_values = grouped_df[~grouped_df['Gene names'].isin(collectri_tfs)]
ec50_dict = dict(zip(grouped_df['Gene names'], grouped_df['pEC50']))
list(ec50_dict.items())[:5]
[12]:
[('ABLIM3', 8.038036722142818),
 ('ACACA', 7.919001347352594),
 ('ACIN1', 8.51245270397656),
 ('ADAM17', 8.088225031254964),
 ('ADAR', 8.209162778966169)]

2. Network inference

Afatinib is an EGFR inhibitor, therefore:

[13]:
source_dict = {'EGFR': -1}

We get the top 30 most sensitive TFs.

[20]:
measurements = measurements.sort_values('pEC50', ascending=False)
measurements = measurements.head(30)
[15]:
target_dict = dict(zip(measurements['Gene names'], measurements['Regulation']))

Now we compute the contextualised networks using four different topological methods: shortest paths and all paths, with and without sign consistency.

[16]:
network_df = nc.data.network.get_omnipath()
network = nc.utils.network_from_df(network_df)

Now, we will perform network inference with some topology methods and CORNETO-Carnival.

Warning

The following code block uses CORNETO, a package specialised in Integer Linear programming for network inference. Some CORNETO methods (such as Carnival) depend on GUROBI, a third-party solver that holds a commercial license. Therefore, in order to run this part of the code, you will have to install a license in your system. Please check their home page for more information.

[ ]:
network_dict = {}

network_dict['corneto_network'] = nc.methods.run_corneto_carnival(network, source_dict, target_dict, betaWeight=0.01, solver='GUROBI')
network_dict['shortest_paths'], shortest_paths = nc.methods.run_shortest_paths(network, source_dict, target_dict)
network_dict['shortest_paths_sc'], shortest_paths_sc = nc.methods.run_sign_consistency(network, shortest_paths, source_dict, target_dict)
network_dict['all_paths'], all_paths = nc.methods.run_all_paths(network, source_dict, target_dict, depth_cutoff=5)
network_dict['all_paths_sc'], all_paths_sc = nc.methods.run_sign_consistency(network, all_paths, source_dict, target_dict)

3. Evaluation using sensitivity to perturbation

Finally, we can assess whether these networks contain a differential share of highly sensitive nodes, which will be our evaluation metric. For more information, please check the Evaluation strategies details page.

[18]:
ec50_eval_results = nc.eval.get_metric_from_networks(network_dict, nc.eval.get_ec50_evaluation, ec50_dict=ec50_dict)
[19]:
p = nc.visual.plot_scatter(
    ec50_eval_results,
    x_col='diff_EC50',
    y_col='coverage',
    size_col=None,
    hue_col='method',
    numeric_cols=None,
    style_col=None,
    xlabel='Difference in average -log(EC50) scores',
    ylabel='Coverage',
    title='Coverage vs Difference in average sensitivity scores',
    figsize=(10, 6),
)
../_images/vignettes_4_evaluation_decryptm_40_0.png

Both corneto and shortest paths + sign consistency had the highest difference between EC50 values of nodes included and excluded of the solution networks. In addition, the corneto network has almost double coverage than the rest of the networks. Therefore, we would consider the corneto network a better performer, taking both dimensions into account.

That was it! Here, we used phosphoproteomics to infer mechanistic explanations of sensitivity changes in transcription factors, and then we used the same sensitivity information to evaluate the different methodologies. In our next vignette, we will explore the usage of multiomics in another evaluation setting. Bye!