Using the CLI#
In addition to our API we provide an easy to use command-line interface (CLI) which can be used to train your own site-of-metabolism (SOM) prediction models and retrieve predictions from them.
In this interactive Jupyter notebook, we will walk you through using this tool by creating a new SOM prediction model based on a synthetic dataset. The resulting model is not expected to be useful for real metabolism prediction, but serves as an example for what can be done using our tool.
Tip
To get additional information you can also invoke any subcommand with the --help flag. This will show a summary of all supported arguments.
Building the model#
Hyperparameter search#
The hyperparameters command allows you to perform cross-validation hyperparameter search for your data using the same setup that was also used in the FAME3R paper. Hyperparameters are exported as JSON.
Note
Hyperparameter optimization is disabled per default in this notebook, as it takes a long time.
If you want to perform hyperparameter optimization and use the generated hyperparameters, you can uncomment the next code cell and add --hyperparameters hyperparameters.json as an option to the train command in the next section.
%%bash
#fame3r hyperparameters -i data/metatrans_autoannotated_cleaned/train.sdf -o hyperparameters.json
Training#
The train command is used to train a random forest model for predicting SOMs as well as an auxillary model for predicting the FAME score. Without additional parameters the resulting model will be trained using exactly the same parameters that were also used for the models reported in the FAME3R paper.
%%bash
fame3r train -i data/metatrans_autoannotated_cleaned/train.sdf -o models
Threshold post-tuning#
The threshold command can be used for threshold post-tuning i.e. finding the classification threshold that will result in the most balanced predictions.
Note
Threshold post-tuning is disabled per default in this notebook, as it takes a long time.
If you want to perform threshold post-tuning and use the generated threshold, you can uncomment the next code cell and add --threshold hyperparameters.json as an option to the predict command in the next section.
%%bash
#fame3r threshold -i data/metatrans_autoannotated_cleaned/train.sdf -m models/random_forest_classifier.joblib
Applying the model#
Generating predictions#
Now that we have some trained models, the predict command can be used to generate predictions, including predicted probabilities and binary predictions based on either the default or a provided threshold. The --uncertainty flag can be used to also generate uncertainty estimations for each input atom.
%%bash
fame3r predict -i data/metatrans_autoannotated_cleaned/test.sdf -m models -o predictions.csv --uncertainty fame-score --uncertainty shannon-entropy
Calculating metrics#
Given a prediction CSV file generated as seen above, the metrics command can then be used to calculate various classification metrics, including the Top-k metric (k=2) which is commonly used in metabolism prediction. Metrics are exported as JSON.
%%bash
fame3r metrics -i predictions.csv -o metrics.json
Using descriptors externally#
While our Python API can be used to seamlessly integrate our work into your Python-based chemoinformatics workflows, we recognize that other programming and modelling environments exist. To that end, you can use the descriptors command to generate FAME3R descriptors in various configurations. The generated descriptors are exported as CSV.
%%bash
fame3r descriptors -i data/metatrans_autoannotated_cleaned/train.sdf -o descriptors.csv