ZINC15 for Drug Similarity Search

Author ORCID

Fuad Al Abir 0000-0002-9091-3078

Publication Date

4-29-2024

Abstract

Abstract: This dataset is a subset of the ZINC15 database, specifically filtered and processed for molecular similarity search applications using MegaMolBART embeddings. The subset focuses on drug-like molecules with specific physicochemical and purchasability properties.

Keywords: ZINC15, Molecular Similarity Search, MegaMolBART, Drug Discovery, Cheminformatics.

Background: The ZINC15 database is a comprehensive collection of commercially available compounds for virtual screening. This subset was created to facilitate the development of machine learning models for drug discovery, particularly those based on molecular embeddings.

Methodology: The ZINC15 database was queried using the following criteria:

  • Molecular weight <= 500 Daltons
  • LogP <= 5
  • Reactivity level = "reactive"
  • Purchasability = "annotated"

The resulting dataset was then processed to extract MegaMolBART embeddings for each molecule.

Data Description:

The dataset is organized into three folders:

  • /data/project/ubrite/drg-depot/zinc15-similarity-search/raw-data/ (66 GB): This folder contains the raw data files obtained from the ZINC15 database after applying the filtering criteria.
  • /data/project/ubrite/drg-depot/zinc15-similarity-search/processed-data/ (13 GB): This folder contains the processed data, including the extracted MegaMolBART embeddings for each molecule.
  • /data/project/ubrite/drg-depot/zinc15-similarity-search/query/: This folder contains sample SMILES strings and their corresponding embeddings for performing similarity searches.

Technical Specifications:

  • Format: SMILES strings, numerical data (embeddings)
  • Size: 79 GB (total)
  • License: This dataset is derived from the ZINC15 database and processed using MegaMolBART. It is subject to the licenses of both the ZINC15 database and the MegaMolBART model.
    • ZINC15 Database: ZINC15 data is made available under the Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license. For more information, please visit the ZINC15 website.
    • MegaMolBART: The MegaMolBART model and its associated data are copyrighted by AstraZeneca and NVIDIA. The usage of MegaMolBART is subject to the terms and conditions specified by the copyright holders.

By using this dataset, you agree to comply with the licenses and conditions imposed by the ZINC15 database and MegaMolBART.

Access and Usage:

The dataset is available for download through Zenodo. Users are encouraged to acknowledge this dataset and the corresponding Zenodo entry in any publications or research projects that utilize the data.

Contact: Fuad Al Abir, fuad021@uab.edu

Repository

Zenodo

Distribution License

Creative Commons Attribution 4.0 International License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Share

COinS