Tutorial¶

From data to Knowledge Graph

In [1]:

Copied!





# This cell is not import for the understanding of the tutorial
# and only prints the version of the biokb_ipni package and the time of execution.
import os
import biokb_ipni
from datetime import datetime

# delete existing environment variables
if "CONNECTION_STR" in os.environ:
    del os.environ["CONNECTION_STR"]

print(
    "biokb_ipni version:",
    biokb_ipni.__version__,
    "Date:",
    datetime.now().strftime("%Y-%m-%d"),
)
# This cell is not import for the understanding of the tutorial
# and only prints the version of the biokb_ipni package and the time of execution.
import os
import biokb_ipni
from datetime import datetime

# delete existing environment variables
if "CONNECTION_STR" in os.environ:
    del os.environ["CONNECTION_STR"]

print(
    "biokb_ipni version:",
    biokb_ipni.__version__,
    "Date:",
    datetime.now().strftime("%Y-%m-%d"),
)

biokb_ipni version: 0.1.8 Date: 2026-02-03

Tip: This notebook is available at GitHub here. It's recommended to set up a virtual environment (see description below), install the biokb_ipni package and run the Jupyter notebook in the virtual environment cell by cell.

Abstract¶

The biokb_ipni library is part of the biokb family, which aims to integrate ontologies, terminologies, and knowledge from multiple domains—such as biochemistry, pharmacology, taxonomy, and ethnobotany—into a unified Knowledge Graph. Each library - like biobk_ipni on plant names - is focusing on a different aspect, but all are using the same workflow to import the data into the database and knowledge graph. Since the biokb family uses SQLAlchemy as its database layer, many different relational database systems (SQLite, MySLQ, MariaDB, PostgreSQL, ...) can be used. As backend for the knowledge graph biokb_ipni is using Neo4j, but any triple store SPARQL server like Fuseki can load the Resource Description Framework (RDF) file which can be created with the library. An essential criterion for the high connectivity of the various knowledge graphs (like IPNI) is the library-wide use of the same Uniform Resource Identifiers. Each major version of the biokb family libraries will ensure that the same URIs will be used across all libraries.

This tutorial describes the use of biokb_ipni to generate a Knowledge Graph from primary data provided by the International Plant Names Index (IPNI). It also outlines the individual steps required to create either a relational database or RDF Turtle files.

Overview¶

The workflow is simple:

[data]-import_data()->[relational_database]-create_ttls()->[rdf_files]-import_ttls()->[knowledge_graph]

Main functions are:

import_data(): Import data in database
create_ttls(): Create RDF files from database
import_ttls(): Import RDF files into a Knowledge Graph

You have choices how to use the library depending:

Command line interface (CLI)
Python API
Podman/Docker containers

If you want to use all features of the library it is recommended to use the Podman/Docker containers, since all dependencies are already installed and configured. If you only want to use parts of the library (like only the API or only the RDF generator) you can also install the library in a virtual environment and use it via CLI or Python API.

Installation¶

In general it is recommended to install python libraries in a virtual environment to avoid conflicts between libraries. To create a virtual environment you first need to create a new directory, navigate to it and activate it. If your operating system is Linux or MacOS run:

mkdir biokb_ipni_test
cd biokb_ipni_test
source .venv/bin/activate

If you are using Windows please check this tutorial.

Once the virtual environment is activated you can install the biokb_ipni library with pip:

pip install biokb_ipni

Tip: uv is a great and very fast alternative tool to create and manage virtual environments.

Use Cases¶

Depending on your needs, follow the links:

Relational database only
RDF turtle files only
Knowledge Graph

In the different chapters you will find step-by-step instructions how to achieve your goal (CLI, RESTful API, Python API, or Podman/Docker containers).

Relational database only¶

Tip: SQLite database can be opened with DB Browser for SQLite, which provides a user-friendly interface to explore the database structure and content. Or with the SQLite Viewer extension for VS Code.

Python¶

by default creates (or updates) a SQLite database biokb.db in subfolder of your home directory ~/.biokb/

In [ ]:

Copied!

from biokb_ipni import import_data

import_data()
from biokb_ipni import import_data

import_data()

If you want to change the logging level you can do it as follows:

import logging
from biokb_ipni import import_data
logging.getLogger('biokb_ipni').setLevel(logging.WARNING)
import_data()

Output:

2026-01-28 14:18:54,188 - biokb_ipni.db.manager - INFO - Using database engine: Engine(sqlite:////home/ceb/.biokb/biokb.db)
2026-01-28 14:18:54,231 - biokb_ipni.db.manager - INFO - Database recreated.
2026-01-28 14:18:54,231 - biokb_ipni.db.manager - INFO - Loading NCBI Taxonomy data for mapping families and names
2026-01-28 14:19:06,912 - biokb_ipni.db.manager - INFO - Importing references
2026-01-28 14:19:17,348 - biokb_ipni.db.manager - INFO - Importing families
2026-01-28 14:19:19,166 - biokb_ipni.db.manager - INFO - Importing names
2026-01-28 14:20:09,802 - biokb_ipni.db.manager - INFO - Importing type materials
2026-01-28 14:20:15,326 - biokb_ipni.db.manager - INFO - Importing name relations

In [4]:

Copied!

print(import_data.__doc__)
print(import_data.__doc__)

Import all data in database.

Args:
    engine (Optional[Engine]): SQLAlchemy engine. Defaults to None.
    force_download (bool, optional): If True, will force download the data, even if
        files already exist. If False, it will skip the downloading part if files
        already exist locally. Defaults to False.
    delete_files (bool, optional): If True, downloaded files are deleted after import.
        Defaults to False.

Returns:
    Dict[str, int]: table=key and number of inserted=value

Command line interface¶

biokb_ipni -v import-data

-v provides you with information about the processes currently running. If you do not wish to see this information, you can simply omit -v.

Options with the CLI¶

  -f, --force-download          Force re-download of the source file [default: False]
  -d, --delete-files            Delete downloaded source files after import [default: False]
  -c, --connection-string TEXT  SQLAlchemy engine URL [default:sqlite:////~/.biokb/biokb.db]
  --help                        Show this message and exit.

Here an example to create a database with a custom name, force re-download of the source file, and delete downloaded source files after import:

biokb_ipni -v import-data -f -d -c sqlite:///my_own_name.db

How to write a connection string¶

In the last (and also in the next) section we used a connection string to define the type and name of the database. Here are some examples how to write connection strings for different database management systems:

SQLite: sqlite:///ipni.db (creates a file named ipni.db in the current directory)
MySQL: mysql+pymysql://username:password@localhost/ipni_db
PostgreSQL: postgresql+psycopg2://username:password@localhost/ipni_db

If you are using a different host or port, please adjust the connection string accordingly. For more details on connection strings, please refer to the SQLAlchemy documentation. SQLite and MySQL are supported out of the box. For other databases you need to install the specific libraries (like psycopg2 package) in your environment.

Python with other database management systems¶

If we want to use a different database like the default (SQLite -> ~/.biokb/biokb.db) in python we have to create an engine with SQLAlchemy first and then pass it to the import_data function. Here an example with MySQL (assuming you have already created a database named ipni_db in your MySQL server, host:localhost, port:3306, database: ipni, user: username, password: password):

from sqlalchemy import create_engine
from biokb_ipni import import_data

engine = create_engine("mysql+pymysql://username:password@localhost:3306/ipni")
import_data(engine)

RESTful API only¶

If you only want to use the RESTful API to access the data you can start it as follows:

biokb_ipni run-server

If you get an error message like:

biokb_ipni run-server
API server running at http://127.0.0.1:8000/docs#/
ERROR:    [Errno 98] error while attempting to bind on address ('0.0.0.0', 8000): address already in use

Use another port with the --port option:

biokb_ipni run-server --port 8080

Depending on the port (here 8080) open http://127.0.0.1:8080/docs#/ in your browser to access the API documentation.

By default username admin and password admin are used to access the protected methods of the API (with a lock symbol). You can change it with the options --user and --password:

biokb_ipni run-server --user my_user --password my_password

If data already import via biokb_ipni import-data the API will use the existing database located at ~/.biokb/biokb.db. Otherwise open the /import_data/ endpoint to import the data (use user and password). ttls files will be created (if not exists) and exported via the /export_data/ endpoint. If you want to import the data into a Neo4j Knowledge Graph use /import_neo4j/.